# Introduction to gstlearn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## gstlearn
import gstlearn as gl
import gstlearn.plot as gp  ## for plots

The aim of this practical session is to introduce you to a few useful utilities of the gstlearn package. We will work with the *meuse* dataset (from <https://www.rdocumentation.org/packages/sp/versions/1.3-1/topics/meuse>). It contains:

* **x** and **y**: easting and northing (m) coordinates
* **cadmium**, **copper**, **lead**, **zinc**: topsoil heavy metal concentrations (ppm) (NB: obtained from composite samples **15m x 15m**)
* **elev**: relative elevation above the river (m)
* **dist**: distance to the river (normalized between 0 and 1)
* **om**, **soil**, **lime**: soil characteristics (content of organic matter, type of soil, presence of lime)
* **ffreq**: flooding frequency
* **landuse**: landuse classes
* **dist.m**: distance to the river (m)

1. Import and explore the meuse dataset ('meuse.csv').

In [None]:
## Import through Pandas
meuse=pd.read_csv("meuse.csv")

## Turn concentrations into log-concentrations
for i in range(2,6):
    meuse.iloc[:,i]=np.log(meuse.iloc[:,i])
    
## Summary statistics
meuse.describe()

2. Create a **Db** from the jura dataset using the gstlearn *Db_fromPanda* function.

In [None]:
db_meuse=gl.Db_fromPanda(meuse)
db_meuse.display()

3. Set the correct variables as coordinates and the Cadmium concentration as the regionalized variable of interest, by assigning them the appropriate locators using the *setLocators* method of the *Db* object.

In [None]:
## Define the variables 'x'and 'y' as coordinates -> Locator "x"
db_meuse.setLocators(['x','y'],gl.ELoc.X,cleanSameLocator=True) # The cleanSameLocator option allows to clean any prior assignment of the locator to one of the variables of the Db

## Define the variables 'cadmium' as variable of interest -> Locator "z"
db_meuse.setLocators(['cadmium'],gl.ELoc.Z,cleanSameLocator=True)

db_meuse.display()

4. Plot each heavy metal concentration. 

In [None]:
## Names of all the variables in the Db
all_names=db_meuse.getAllNames()

## Metal names
metal_names=all_names[2:6]
print("Metal names :",metal_names)

## For loop for plots: The color of the points indicates the concentration
for i in range(len(metal_names)):
    fig, ax = gp.initGeographic()
    ax.symbol(db_meuse, nameColor=metal_names[i],  flagLegendColor=True, legendNameColor="Concentration")
    ax.decoration(title=metal_names[i], xlabel="x", ylabel="y")
    plt.show()

In [None]:
## For loop for plots: The size of the points indicates the concentration
for i in range(len(metal_names)):
    fig, ax = gp.initGeographic()
    ax.symbol(db_meuse,nameSize=metal_names[i], flagLegendSize=True, legendNameSize="Concentration")
    ax.decoration(title=metal_names[i], xlabel="x", ylabel="y")
    plt.show()

5. Compute basic statistics of each heavy metal concentration (using the *dbStatisticsMono* function)

In [None]:
## Names of all the variables in the Db
all_names=db_meuse.getAllNames()
## Metal names
metal_names=all_names[2:6]

## Compute statistics: Mean, Min, Max, Variance, Standard-dev
gl.dbStatisticsMono(db_meuse,
                    names=metal_names,
                    opers=gl.EStatOption.fromKeys(["MEAN","MINI","MAXI","VAR","STDV"]))


6. Fit the regression line of the coordinate **y** as a function of the coordinate **x** (To perform a linear regression with categorical variables, we use the *ols* function from the *statsmodels* package). Compare the basic statistics of each heavy metal concentration above and below the regression line. (Use locator **ELoc_SEL** for masking samples)


In [None]:
## For linear regressions (using R-style formulas to define regression)
from statsmodels.formula.api import ols

## Fit the regression line
reg_line = ols(formula='y ~ x', data=db_meuse.toTL()).fit()  ## the toTL method allows to convert a gstlearn Db into a Pandas dataframe
# print(reg_line.summary())

## Plot the line together with the Cadmium concentration values: Only the selected points remain
db_meuse.clearLocators(gl.ELoc.SEL) ## Clear any prior selection on the points
fig, ax = gp.initGeographic()
ax.symbol(db_meuse,nameSize='cadmium', flagLegendSize=True, legendNameSize="Concentration")
ax.plot(db_meuse['x'],reg_line.predict(db_meuse.toTL()),color="black",linewidth=2)
ax.decoration(title='cadmium', xlabel="x", ylabel="y")
plt.show()

In [None]:
## Create binary variable for points above the line
db_meuse["Sel_above"]=db_meuse['y'] > reg_line.predict(db_meuse.toTL())

## Use the binary variable as a selection
db_meuse.setLocators(["Sel_above"],gl.ELoc.SEL)

## Plot the line together with the Cadmium concentration values: Only the selected points remain
fig, ax = gp.initGeographic()
ax.symbol(db_meuse,nameSize='cadmium', flagLegendSize=True, legendNameSize="Concentration")
ax.plot(db_meuse['x'],reg_line.predict(db_meuse.toTL()),color="black",linewidth=2)
ax.decoration(title='cadmium', xlabel="x", ylabel="y")
plt.show()

## Compute the statistics as before: they are computed while considering only the selected points
gl.dbStatisticsMono(db_meuse,
                    names=metal_names,
                    opers=gl.EStatOption.fromKeys(["MEAN","MINI","MAXI","VAR","STDV"]))


In [None]:
## Create binary variable for points above the line
db_meuse["Sel_below"]=db_meuse['y'] < reg_line.predict(db_meuse.toTL())

## Use the binary variable as a selection
db_meuse.setLocators(["Sel_below"],gl.ELoc.SEL)

## Plot the line together with the Cadmium concentration values: Only the selected points remain
fig, ax = gp.initGeographic()
ax.symbol(db_meuse,nameSize='cadmium', flagLegendSize=True, legendNameSize="Concentration")
ax.plot(db_meuse['x'],reg_line.predict(db_meuse.toTL()),color="black",linewidth=2)
ax.decoration(title='cadmium', xlabel="x", ylabel="y")
plt.show()


## Compute the statistics as before: they are computed while considering only the selected points
gl.dbStatisticsMono(db_meuse,
                    names=metal_names,
                    opers=gl.EStatOption.fromKeys(["MEAN","MINI","MAXI","VAR","STDV"]))