| [**Overview**](./00_overview.ipynb) | [Getting Started](./01_jupyter_python.ipynb) | **Examples:** | [Access](./02_accessing_indexing.ipynb) | [Transform](./03_transform.ipynb) | [Plotting](./04_simple_vis.ipynb) | [Norm-Spiders](./05_norm_spiders.ipynb) | [Minerals](./06_minerals.ipynb) | [lambdas](./07_lambdas.ipynb) | [CIPW](./08_CIPW_Norm.ipynb) | [Lattice Strain](./09_lattice_strain.ipynb) | **Extensions:** | [ML](./11_geochem_ML.ipynb) | [Spatial Data](./12_spatial_geochem.ipynb) |
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |

# Getting Started with Jupyter and Python


**New to Jupyter?**<br>

If you're new to Jupyter, the main thing you'll need to know for this set of notebooks is that you can use <code>Shift+Enter</code> to execute cells of code (or press the <i class="fas fa-play"></i> button). This notebook should have launched in Jupyter Lab if you've come via the Binder link; this is an interface which provides a file browser to your left, a main workspace (if you're reading this, the space occupied by this notebook) and organises separate open notebooks in tabs (see above this cell). If you'd like a more detailed overview of the Jupyter interface, you can check out [the respective documentation](https://jupyterlab.readthedocs.io/en/stable/user/interface.html). Jupyter notebooks typically contain both text (markdown, like this one) and executable code cells. Try running the next two, which generate simple output interactively: 

In [None]:
"Run me with Shift-Enter/Cmd-Enter!"

In [None]:
2 + 1

Note that a few conventions are used throughout, code or keyboard commands are typically formatted `in grey boxes`, and where a piece of code refers to something you can change/input, I'll often use references between angular brackets (e.g. `print("<thing>")`, where `<thing>` shouldn't be inserted literally, but instead can be replaced by e.g. `"Hello world!"`, giving `print("Hello world!")`). Especially in this notebook, I've used **bold** to highlight some key terms related to core Python basics.

---
## What is Jupyter?

[Jupyter](https://jupyter.org/) is an ecosystem of open source tools which provide interfaces for working with a variety of programming languages. The most well known of these is the Jupyter notebook - which in its simplest form is an electronic notebook consisting of a series of cells (like this one) which can contain a mix of text, code, output, metadata and potentially even interactive elements. Today we're working in Jupyter Lab - which is an environment which combines an interface to notebooks with a file explorer (left) and enables the integration of a variety of other tools.
#### Should you use notebooks?

Jupyter notebooks can be a good way to organise prototype workflows, and are often a good mechanism for sharing and explaining your code in a way which invites conversation and interaction (hence using them here!). Notably though, they're not necessarily the solution for everything. While you can construct workflows and models through Jupyter notebooks, they are more difficult to manage relative to standalone scripts and libraries when it comes to version management, integration and automation. For this reason it's suggested that once you have something working well, consider writing it up as a separate script or even a Python library/module!

#### Using Notebooks for Today (if you haven't seen them before)

The key thing to note for today is that it's common to find a mix of text cells like this one (typically written in [Markdown](https://www.markdownguide.org/) for easy markup of text) and code cells (scroll down a bit, they'll have a grey background). While it's not necessary for today, knowing a bit of markdown syntax can help structure notes and documentation accompanying your code. 

Code cells are not static - here in the workshop JupyterHub/Binder you can run them (`Shift-Enter` or use the <i class="fas fa-play"></i> button), edit and re-run them! We encourage you to edit, change and break things within reason to get to know the tools (you can always restart Jupyter!).

You can tell which cells are being executed by the notation on the left of it - cells already run will have a number (e.g. `[1]`) noting the order in which it was run, cells yet to run will have an asterisk (`[*]`) and cells which haven't been executed will have empty brackets (`[ ]`). Also check the small circle in the upper right - if it's <i class="far fa-circle"></i> then it's stopped/hasn't started executing, if it's <i class="fas fa-circle"></i> it's trying to execute something/busy. If you get stuck and it looks like nothing's happening the kernel might have stalled; you can restart it under the `Kernel` menu to the top left, using `Restart Kernel...`.

<div class='alert alert-warning'> <font color="black"><b>Note:</b> Binder will not save/persist your progress or changes outside your session! If you want to keep a modified notebook, you can right click and download from the file browser on the left (or, in Binder - you can also click the download link provided above).</font></div>



---
## What is Python?

[Python](https://www.python.org/) is a high-level multi-purpose programming language. It's freely and openly available and you'll be able to find a distribution which can run on just about any system (e.g. 'micropython' runs on bare-metal for tiny microcontrollers). There is a large community which uses Python, the majority of which revolves around open-source projects. You can use Python as a fancy calculator, build websites, run servers, build machine learning models, image black holes or provide testing and code generation for an [embedded software framework for NASA](https://github.com/nasa/fprime).

Python is an *interpreted* language, which means that rather than being compiled (like e.g. C, C++ and Fortran) it's read, interpreted and executed as needed. For this reason, it'll typically be a bit slower for most task (but not necessarily by much), but it also makes it much less complex to get into, read and run. When working with numerically intense workflows, you're often actually running code which was written in a more performant language in the background - and this bridges a large part of the gap between language 'performance'. Notably, however, Python tends to be written to be later read (or at least it can and should be) - and the accessibility together with it's flexibility are some of the key reasons it's so widely used.

You can run Python from the terminal, but typically we want to either write and execute programs (e.g. like 'scripts'; Python is often termed a 'scripting language') or play to the language's strengths and execute code interactively (e.g. in these notebooks!). To do this we need some kind of editor - whether it be notepad, Jupyter Notebooks or a dedicated development environment. While Python is often distributed with some kind of editor, many people have their own favourites - and it tends to depend a bit on what you're doing (e.g. I use 'Atom' to write pyrolite, but write these workshops/demonstrations in Jupyter notebooks).

---
## Some Basic Python


The typical first line that new Python coders will execute will print the words 'Hello World!':

In [None]:
print("Hello World!")

The `print` statement is a useful way to get some quick output from your code, you'll likely end using this a lot! Here the `print` statement is actually a **function** (a specific chunk of code which typically takes some input and generates some output), and we're passing it the **argument** `"Hello World!"`, which itself is a **string** (text enclosed by either single or double quotation marks; `""` or `''`). 

Note that in an **interactive environment** (like this notebook) that the execution of code will typically output the result of the last expression - so here we should also be able to just output the string by having it as the last line in a cell (note the inclusion of the quotation marks denoting the string type, however!):

In [None]:
"Hello World!"

### Operators and Numbers

Beyond using Python as a glorified digital printer, you can also use it as a flexible calculator for just about any kind of numerical task you desire. There are a range of **operators** which when used with numeric types allow you to add, subtract, mutliply, divide and more:

In [None]:
2 + 2

In [None]:
2 * 4 - 4 / 2

Note that there is a distinction beteween **integers** (type `int`, e.g. `2`) and **floats** type (`float`, e.g. `2.0`), and that depending on the operations used you may get a different type of result than what you put in. Beyond the simple operators for numeric types, there are a few more complicated ones for integer/floor division (`//`), remainder (`%`), exponent (`**`).

In [None]:
4 // 2, 4 / 2

In [None]:
4 % 2

In [None]:
4**2

There are also operators for equalities: `==` for equal-to, `>` and `,` for greater-than, less-than, `>=` and `<=` for greater-than-or-equal-to and less-than-or-equal-to; these will return a **boolean** (type `bool`, `True` or `False`):

In [None]:
4 < 2

In [None]:
8 >= 7

In [None]:
1 == 2

Beyond simple numerical calculations and printing, we'll typically want to store the value of some of these expressions as a **variable** - essentially giving it a name. We can then reuse and reassign these variables by referring to them (note these variables persist between cells; they're defined **globally**):

In [None]:
a = 2
a * 2

In [None]:
a = a + 3
a

There are some operators which allow this type of assignment with a type of shorthand, e.g. for incremental addition:

In [None]:
a += 1
a

### Comments

As you start to get a bit of structure in your code and have multiple variables floating around, it's a good idea to start commenting and documenting your code. To do this, the use of comments and multi-line strings is typical:

In [None]:
"""
This is a multi-line string to describe the general behaviour of the code to follow below.
"""
# this is a comment
b = 14
# halve b
b /= 2

### Collections

Besides types which correspond to an individual value, there are also types of collections of values, including lists, tuples ("lists which you can't change"), dictionaries ( key-value pairs), and sets (a collection of unique things).

In [None]:
list_0 = ["a", "b", 1, 2]

tuple_0 = ("a", "b", 1, 2)

dict_0 = {"a": "b", 1: 2}

set_0 = {"a", "b", 1, 2}

We can refer to/**index** the values within these collections, and in some cases assign new values (note that Python uses 0-based indexing; you need to use 0 to ask for the first element, and so on):

In [None]:
list_0[0], tuple_0[1]

Notably, you can't use these indexes to get the elements/values of a dictionary or set; a dictionary has key-value pairs - and you can access the values by using the keys:

In [None]:
dict_0["a"], dict_0[1]

To get the last elements of an indexable collection, you can also use negative indexing:

In [None]:
list_0[-1]

Another handy function for dealing with collections is `len`, which will return the length of a collection (at it's root level):

In [None]:
lst = [["b", "d", "a"], "c", 1, 5.6]
len(lst)

Note that the first element only counts as one! But we can also check it's length if you wanted to dig into it:

In [None]:
len(lst[0])

### Loops and Iteration

A common construct in many programming languages is a loop, and especially a `for` loop. When we want to iterate through items within an object (like a list), for loops *can* be a useful way to do so (sometimes there are better ways).

In [None]:
lst = ["a", "b", "c"]
for item in lst:
    print(item)

Two other useful things to mention here are `range` and `enumerate`. `range` gives you an ordered set of numbers up to (but not including) the specified maximum:

In [None]:
range(10)

In [None]:
for ix in range(10):
    print(ix)

`enumerate` gives you an indexed reference to items in a collection, and is often used in for loops. Here we also use some **string formatting** (the `{}` and the `.format(...)`) to print out the indexes and items returned from `enumerate`:

In [None]:
for ix, item in enumerate(lst):
    print("The item at position {} is {}".format(ix, item))

Something which might be a bit advanced for now, but you'll see a lot of later today and within `pyrolite` is something called **list comprehension**, which is basically a for loop contained within a list. It enables some quick modifications/filtering combined with iterating through collections, e.g.:

In [None]:
[a for a in lst]

In [None]:
[a * 2 for a in lst]

We can even use conditional statements within them (the `if`):

In [None]:
[a for a in lst if a != "c"]

### Functions 
Once you start to write a bit of code, you'll notice that a lot of times you'll repeat your self/copy paste code from one script/notebook to another. At this point you should consisder starting to organise code which is run repeatedly into a relevant structure. **Functions** are often the primary structure you'll use to do this - wrapping up and naming a bit of code, defining its input and output. 

Functions can take a variety of inputs. These inputs which are divided into **arguments** (passed simply as values, typically noted `args`) and **keyword arguments** (key-value pairs, often noted `kwargs`). The following function takes a single unnamed input, modifies it and returns a different value:

In [None]:
def add_one(x):
    return x + 1


add_one(2)

Functions need not return a value (e.g. sometimes you want functions to save something to disk, print output or a variety of other use cases), but often do - and can return multiple.

It's a good idea to give your functions and their arguments readable names, and document these in docstrings (the multi-line string below). Here we use some conditional statements to control scaling of a number, and add some appropriate documentation for the process: 

In [None]:
def to_wt_percent(
    x,
    from_units="ppm",  # this is an argument  # this is a keyword argument
):
    """
    Convert a value to wt%.

    Parameters
    ----------
    x : float | array
        Value to convert units for.
    from_units : str
        Current units the value is in.

    Returns
    -------
    float
    """
    if from_units == "wt%":
        pass
    elif from_units == "ppm":
        x /= 10000
    else:
        raise NotImplementedError
    return x

In [None]:
to_wt_percent(10, from_units="ppm")

### Inbuilt Help

If you write functions with docstrings, you're then also able to access them as needed through built-in help functions:

In [None]:
help(to_wt_percent)

Within Jupyter, you can also use some built-in shortcuts (and usually, tab-completion):

In [None]:
to_wt_percent?

The other useful aspect of bundling up your code in this way is that you can save it to a separate file (e.g. we might put the file above in a file called `units.py`), and then import it so you can use it across multiple scripts - or even put together a collection of functions into a library (this is how pyrolite began...).

As a bit of an aside, some additional methods beyond the standard help and inbuilt Jupyter `?` might be useful:
* `whos` will give you an indication of existing objects and their types in an interactive Python session (like in Jupyter)
* `dir(<object>)` will give you the list of methods and attributes defined for an object, handy for finding what an object can do/has within it


In [None]:
whos

In [None]:
dir(lst)

### Using More of Python

One of the great advantages of working with Python is the ecosystem of tools which you can leverage in your own work - from `numpy`, `scipy`, `matplotlib` and `pandas` through to `pyrolite`, which is built upon the others. To be able to use these libraries/packages, you'll need to **import** them. There are a few conventions for importing some common packages to reduce the amount of code you need to type which are handy to recognise. For example, instead of importing `pandas` like this:

In [None]:
import pandas

We can import `pandas` *as* `pd` like this, and thereafter reference it using `pd`:

In [None]:
import pandas as pd

Similarly, there are conventions for other libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

If you're just after one part of a library, you can import the single class/function/submodule *from* that library:

In [None]:
from pyrolite.util.plot import save_figure

There's lots we haven't covered in much/any detail, including:
* classes and objects
* reading and writing files
* the rest of the standard Python library  ... 

Feel free to ask about any of this later, or do some online reading!

----
<div class='alert alert-warning'> <font size="+1" color="black"><b> Checkpoint & Time Check</b><br>How are things going?</font></div>

----

<div class='alert alert-success'><font color='black'>Below we'll have a quick look at the packages that pyrolite is built upon - particularly because pyrolite replicates their API (Application Programming Interface - how you interact with the code) and the outputs of pyrolite are typically objects created using these libraries.</font></div>


## `numpy`

`numpy` is a Python package for working with numeric data, with the `numpy.array` being the core type of data you'd likely run into:

In [None]:
import numpy as np

arr = np.array([[0.6, 1.3, 2.0, 4.1], [0.2, 1.1, 1.9, 3.2]])
arr

In some ways arrays are a lot like lists, but can be quite a bit more performant when you're getting to larger amounts of data - simply because arrays are typically restricted to containing a single type of data, and the shapes which arrays can take are also restricted (e.g. such that for a multidimensional array you'll need to combine sequences of the same length).

In [None]:
arr.dtype

In [None]:
arr.shape

---
## Getting Started with `pandas`

`pandas` is a Python package for working with tabluar data, and in many ways could replace what most folks do in Excel. It provides an interface to your data in such a way that you'll be looking at more than just the numbers (in contrast to numpy), and allows you to index, subset, filter and otherwise manipulate your dataset based on indexes - specifically the column names and index values. Like `numpy`, it has some restrictions on the shape of your data, and values within each column all have the same data type. 

The core objects you'll likely be working with in `pandas` are `pandas.DataFrame`s. You can build dataframes from a variety of sources - including numpy arrays, but also a number of different file types. `pd.to_<format>` functions similarly save our dataframes into other formats. See the [docs](https://pandas.pydata.org/pandas-docs/stable/io.html) for a list of all the file types `pandas` can read and write. 

In [None]:
import pandas as pd

df = pd.DataFrame(arr, columns=["A", "B", "C", "D"])
df

We can see that the dataframe has the column names we specified:

In [None]:
df.columns

But when we look at the `.index` (the other axis of the table), we haven't assigned anything so it'll use a default range of integers:

In [None]:
df.index

To get a the underlying `numpy` array, we can use `.values`:

In [None]:
df.values

The data types are based on columnar formats - so there is one for each column:

In [None]:
df.dtypes

In a lot of cases, rather than constructing dataframes yourself, you'll likely want to read a file. Here we'll be pull in some spinel geochemistry data from Norilsk, which contains data on the geochemical features of spinels found as inclusions within different phases. Note that `pandas` has a [range of import and export options](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) (and `geopandas` - which we might look at towards the end - has more).

This data is available as supplementary material in [Schoneveld, L., Barnes, S. J., Williams, M., Le Vaillant, M., and Paterson, D. (2020). Silicate and Oxide Mineral Chemistry and Textures of the Norilsk-Talnakh Ni-Cu-Platinum Group Element Ore-Bearing Intrusions. Economic Geology](http://doi.org/10.5382/econgeo.4747). We can see that each major-element mineral analysis below includes relevant context as to the data source, analysis, thin section location and the enclosing phase within which the spinel sits. The anlyses include major oxdides in weight percent, and calcuated atoms per formula unit (apfu) for each of these cations:

In [None]:
df = pd.read_csv("../data/spinel/Schoneveld2020.csv")
df.head(2)

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.dtypes  # note that here 'object' typically refers to strings - always worth checking if your numbers get converted to strings if you have 'NA', '<LOD' etc!

---

#### Indexing

`pandas` allows you to index data in a number of ways (with a few restrictions). You can access individual columns using `df["<column_name>"]` or `df.column_name` (if there are no spaces in the column name!), which will return a `pandas.Series` (a single column, with the index of the dataframe):

In [None]:
df["Fe2_apfu"].head()

In [None]:
df.Fe2_apfu.head()

You can access multiple columns at once by indexing with a list rather than a single value, which will return a *view* of the dataframe (you can also do this to return a once-column dataframe!):

In [None]:
df[["Fe2_apfu", "Fe3_apfu"]].head()

If you also want to filter the dataframe based on the index (e.g. where a certain criterion is met), you can use the `df.loc` accessor (and in some cases `df.iloc`, where all you care about is the relative positon of columns/rows, and not their names/values). To do this, you can use `df.loc[<index_filter>, <column_filter>]`:

In [None]:
df.loc[df.Mineral == "spinel", [c for c in df.columns if "apfu" in c]].dropna(how="any")

In [None]:
df.columns

In [None]:
df.Site.unique()

---

#### Exporting Tables

In a similar manner to how we imported he CSV file above, we can export our table as-is (e.g. after some modification/cleaning/transformation):

In [None]:
df.to_csv("test_output.csv")

## Some Simple Plotting - Getting Started with `matplotlib`

`matplotlib` is a fairly extensive library of plotting tools originally written to avoid paying for MATLAB licenses! It's highly customisable (which is why it's used in pyrolite), but can also allow you to quickly produce simple plots from your data without too much fuss. With a bit more work you can be directly making publication quality (or better..) figures.

You can use it with array-based data, or if you already have them, use your `pandas` objects:

In [None]:
import matplotlib.pyplot as plt

plt.scatter(df["Cr_apfu"], df["Fe2_apfu"])

The core structures of `matplotlib` are a `figure` (the whole area of all plots) and `axes` (the area pertaining to a single pair of x-y or other axes):

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=1)
# create a figure with two subplots/axes - note that 'ax' is a collection/array of axes!

This is just one way to create a figure; it can also be used with shorthand `fig, ax = plt.subplots(2, 1, ...)`.

On these axes we can add a variety of different plot types, like the scatter plot above or e.g. histograms:

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(10, 4))
# plot a histogram on the first axis
ax[0].hist(df["Cr_apfu"], color="seagreen", bins=10)
ax[0].set(ylabel="Fequency", xlabel="Cr (apfu)")  # label the axes
# plot a histogram on the second axis
ax[1].hist(df["Al_apfu"], color="royalblue", bins=10)
ax[1].set(ylabel="Fequency", xlabel="Al (apfu)")  # label the axes

In [None]:
fig, ax = plt.subplots(1)
# scatter plot colored by Al2O3
cs = ax.scatter(df["MgO_pct"], df["FeO_pct"], c=df["Al2O3_pct"])
ax.set(xlabel="MgO", ylabel="FeO")  # get the current axis and set the labels

# create a colorbar for the current axes based on the scatter data
cb = plt.colorbar(cs, orientation="vertical")
# these $ allow you to put subscript/other LaTeX in the labels
cb.set_label("Al$_2$O$_3$", fontweight="bold", fontsize=12)

The `pandas` API also allows you to directly plot your data from the dataframe itself (this uses matplotlib in the background):

In [None]:
fig, ax = plt.subplots(1)
bins = np.linspace(0, 1.5, 20)

subdf = df.groupby("Enclosing Phase").hist(
    column="Cr_apfu", alpha=0.5, ax=ax, bins=bins, grid=False
)

ax.legend(df["Enclosing Phase"].unique())

In [None]:
df["Cr_apfu"].groupby(df["Site"]).plot.hist(alpha=0.5, bins=bins)
plt.legend()

`pyrolite` in many ways aims to replicate this type of API which is centred around your data, allowing you do directly work from dataframes to transform your data and create visualisations.

---

Try playing around with some of the plotting functions above - or others you find through the e.g. `df.plot` interface:

In [None]:
df.plot?

----
<div class='alert alert-warning'> <font size="+1" color="black"><b> Checkpoint & Time Check</b><br>How are things going?</font></div>

----

| [**Overview**](./00_overview.ipynb) | [Getting Started](./01_jupyter_python.ipynb) | **Examples:** | [Access](./02_accessing_indexing.ipynb) | [Transform](./03_transform.ipynb) | [Plotting](./04_simple_vis.ipynb) | [Norm-Spiders](./05_norm_spiders.ipynb) | [Minerals](./06_minerals.ipynb) | [lambdas](./07_lambdas.ipynb) | [CIPW](./08_CIPW_Norm.ipynb) | [Lattice Strain](./09_lattice_strain.ipynb) | **Extensions:** | [ML](./11_geochem_ML.ipynb) | [Spatial Data](./12_spatial_geochem.ipynb) |
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |