# Introduction to gstlearn and minigst
Emilie Chautru, Mike Pereira, and Thomas Romary


## Introduction

The **gstlearn** python package is a cross-platform python package wrapping the [gstlearn C++ Library](https://gstlearn.org) developed by the Geostatistics Team of the [Geosciences Research Center](https://www.geosciences.minesparis.psl.eu/).

To install the **gstlearn** python Package, you need python 3.8 (or higher). You can then execute the following python command:

In [None]:
#! pip install gstlearn[all]

Then, you can import gstlearn with

In [None]:
import gstlearn as gl
import gstlearn.plot as gp

In [None]:
import numpy as np  # can be useful!
import pandas as pd
import matplotlib.pyplot as plt

## About C++ & python

* Remember that *gstlearn* is primarily a C++ package with a python interface: as such, python only sees *gstlearn* objects as pointers to their corresponding C++ object. The full list of C++ classes and functions is available on the *gstlearn* API [available here](https://soft.mines-paristech.fr/gstlearn/doxygen-latest/).

* You can access the methods of a C++ object using the `.`. For instance, if `db` is a Db object, the command `db.display()` calls the method `display` of the Db object. 

* If you need to duplicate your objects, a simple assignment (e.g. `db2 = db1`) is not possible. You must use the `clone` method by doing this: `db2 = db1.clone()` or call the copy construcor of the `Db` class `db2 = gl.Db(db1)` (Otherwise you would just copy the pointer to the object)

* If you ask for the class type of a gstlearn object under python (e.g. `type(mygrid)`), you will obtain the C++ class name (e.g. gstlrn.DbGrid).
		
* People who want to recover the objects content for a future python session, have to save them to 'Neutral Files' using `dumpToNF` method before quitting python. All classes that inherits from `ASerializable` have this capability. You can then retrieve your object in another session by using. `gl.DbGrid.createFromNF` if `DbGrid` is the class of your object.


In [None]:
## Loading the  package

import minigst as mg

## The Db object

The Db objects are numerical data bases for spatial analysis. They can be seen as pandas dataframes with a geographical context, in the sense that they are numerical tables containing spatial coordinates and variables "measured" at these spatial coordinates.

To create a Db object, you can simply convert a pandas dataframe using the `df_to_db` function. Note that any strings in the original dataframe is converted to None in the Db object, as Db objects can only contain numerical values.

In [None]:
# Load Data
Scotland, _ = mg.data(
    "Scotland"
)  # <- "Scotland" is a data frame stored in the minigst package
print(Scotland.head())

# Create Db object from the dataframe
db = mg.df_to_db(df=Scotland, coord_names=["Longitude", "Latitude"])

# Print summary of content
db.display()

Db objects can be manipulated just like pandas dataFrames. Indeed, adding `.toTL` at the end of a Db object turns it into a dataframe.

In [None]:
print(type(db.toTL()))
print(db.toTL().head())

You can also extract some of the variables/columns from the Db using on of the following commands.

In [None]:
## Extracts the variables "January_temp" and "Elevation" into a new dataframe
db[["January_temp", "Elevation"]]

## Extracts the lines 5 to 14 of the variable "Elevation"
db[5:14, "Elevation"]

## Extracts the lines 5 to 14 of 4th column (i.e. "January_temp")
db[5:14, 3]

In [None]:
## You can use traditional regexp expressions
db["*temp"]  # <- Extract all variables which name ends with "temp"

You can add new variables to a Db using the `add_var_to_db` function.

In [None]:
import numpy as np

## Compute cos of coorditates
cosCoord = np.cos(db[["Longitude", "Latitude"]])

## Add them to the Db `db` under the names "cosLongitude","cosLatitude"
mg.add_var_to_db(db, var=cosCoord, vname=["cosLongitude", "cosLatitude"])

## Display Db
db

You can also delete variables from a Db using the `del_var_from_db` function.

In [None]:
## Delete the variables "cosLongitude","cosLatitude" from the Db `db`
mg.del_var_from_db(db, vname=["cosLongitude", "cosLatitude"])

## Display Db
db

You can plot the contents of a Db using function `dbplot_point` function (which relies on **matplotlib**). Note that the function returns a **ggplot** object and can therefore be associated with other **ggplot2** functions.

In [None]:
# Plot the variables Elevation (for the size of the points) and Longitude (for the color of the points)
ax = gp.plot(db, nameSize="Elevation", nameColor="Longitude", sizmax=30)
plt.title("Elevation and Longitude")
plt.axis("equal")

You can add points and lines to a **minigst** plot using the `addPoints` and `addLines` functions.

In [None]:
# Create a plot of  the variable Elevation and store it in a variable `plt`
ax = gp.plot(db, nameSize="Elevation", c="blue", sizmax=30)

# Add `plt`  a red vertical line at the level v=300 to the plot
ax = mg.add_lines(v=300, c="red")

# Add to `plt` two triangular-shaped, orange points the plot at the coordinates (100,600) and (400,1100)
ax = plt.scatter(x=[100, 400], y=[600, 1100], c="orange", marker="^", s=20)
plt.axis("equal")

## The DbGrid object

The DbGrid objects are derived from Db objects, and are aimed at storing data that located on a regular grid. 

To create a DbGrid object (from scratch), you can use the `createDbGrid` function.

In [None]:
# Define grid points
ngrid = 100  # Number of points in each dimension of the grid
xseq = np.linspace(0, 1, ngrid)  # Coordinates of the grid points in the x-axis
yseq = np.linspace(0, 1, ngrid)  # Coordinates of the grid points in the y-axis

# Create DbGrid
dbG = mg.create_dbgrid(coords=[xseq, yseq], coord_names=["xcoord", "ycoord"])
dbG

# Alternative way of creating the same DbGrid
dbG = mg.create_dbgrid(
    nx=[ngrid, ngrid], dx=[1 / (ngrid - 1), 1 / (ngrid - 1)], x0=[0, 0]
)
dbG

Alternatively, you can convert a R dataframe using the `dfToDbGrid` function (or the `dfToDb` function with the `isGrid=TRUE` argument).

In [None]:
# Load Grid data
_, ScotlandGrid = mg.data("Scotland")
print(ScotlandGrid.head)

# Create DbGrid from dataframe
dbG = mg.df_to_dbgrid(df=ScotlandGrid, coord_names=["Longitude", "Latitude"])
dbG

Since DbGrid objects are also Db objects, extracting, adding and deleting variables can be done in the same way. As for plotting, you should now use the function `dbplot_grid` (instead of `dbplot_point`).

In [None]:
# Plot the variable Elevation by color using the "RdBu" palette
gp.plot(dbG, "Elevation", cmap="RdBu")
plt.axis("equal")
plt.show()

# Plot the variable Elevation by contour
# mg.dbplot_grid(dbG,contour="Elevation",cmap = "RdBu",nlevels = 15)

As before, you can add lines, points and even plots of additional Db objects into a single plot.

In [None]:
# Create plot of the variable Elevation in the DbGrid `dbG` (by color using the "RdBu" palette)
# And store it into a variable `ax`
gp.plot(dbG, "Elevation", cmap="RdBu")

# Add to `plt` two triangular-shaped, orange points the plot at the coordinates (100,600) and (400,1100)
plt.scatter(x=[100, 400], y=[600, 1100], c="orange", marker="^", s=30)

# Add to `plt`  a red vertical line at the level v=300 to the plot
mg.add_lines(v=300, c="red")

# Add to `plt` a plot of the variable "January_temp" in the Db object `db` created earlier
gp.plot(db, nameSize="January_temp", c="gray", sizmax=20)
ax = plt.axis("equal")

## Selection

You can add a mask/selection to a Db or DbGrid object to mask off part of the points in the database. Once specified, any function (eg. plotting) applied to the Db will only be applied to the active/selected samples. This can be done with the function `addSel` which expects a binary variables specify which samples should be kept. You can remove a selection by just calling the function `clearSel`.

In [None]:
## Display and plot the Db (before adding the selection)
dbG.display()
gp.plot(dbG, "Longitude")
plt.axis("equal")
plt.show()
## Create binary variable equal to 1 when the variable "Longitude" of `dbG` is greater than 250
binarySel = dbG["Longitude"] > 250

## Add selection
mg.add_sel(dbG, binarySel)

## Display Db  (after adding the selection)
dbG.display()
gp.plot(dbG, "Longitude")
plt.axis("equal")
plt.show()
## Remove selection
mg.clear_sel(dbG)

## Display Db (after removing the selection)
dbG.display()
gp.plot(dbG, "Longitude")
plt.axis("equal")
plt.show()

## Exercise

We start by loading the data Meuse (coming from the `sp` R package). We load two data frames:

  * `meuse` is a R dataframe containing metal concentrations measured along the Meuse river in France. It contains the following variables:
    -   **x** and **y**: easting and northing (m) coordinates
    -   **cadmium**, **copper**, **lead**, **zinc**: topsoil heavy metal concentrations (ppm) (NB: obtained from composite samples **15m x 15m**)
    -   **elev**: relative elevation above the river (m)
    -   **dist**: distance to the river (normalized between 0 and 1)
    -   **om**, **soil**, **lime**: soil characteristics (content of organic matter, type of soil, presence of lime)
    -   **ffreq**: flooding frequency class: 1 = once in two years; 2 = once in ten years; 3 = one in 50 years
    -   **landuse**: landuse classes
    -   **dist.m**: distance to the river (m)
    
  * `meuse.grid` is a R dataframe containing describing a grid covering the Meuse river (and the samples in the `meuse` dataframe). It contains the following variables:
    -   **x** and **y**: easting and northing (m) coordinates
    -   **dist**: distance to the river (normalized between 0 and 1)
    -   **soil**: soil characteristics (content of organic matter, type of soil, presence of lime)
    -   **ffreq**: flooding frequency class: 1 = once in two years; 2 = once in ten years; 3 = one in 50 years
    -   **part.a, part.b**: arbitrary division of the area in two areas, a and b

1. Create a Db object from the Meuse dataset (dataframe `meuse`). Remember to set the correct variables as coordinates.

2. Add the log-concentrations of metals to the Db.

3. Compute basic statistics of each heavy metal log-concentration (see the function `summaryStats`)

4. Plot each heavy metal log-concentration. 

5. Plot each heavy metal log-concentration, but only the samples with a distance to the river smaller that 0.25. 

6. Create a **DbGrid** from the Meuse dataset (dataframe `meuse.grid`). Remember to set the correct variables as coordinates.

7. Plot the map of soil characteristics from the resulting DbGrid (use the argument `cat_color` in the `dbplot_grid` function).

Note: to load the Meuse files:

In [None]:
import minigst as mg

meuse, meuse_grid = mg.data("Meuse")