# Using Data Base in gstlearn

In this preamble, we load the **gstlearn** package.

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
import gstlearn as gl
import gstlearn.plot as gp
import matplotlib.pyplot as plt
import numpy as np
import os
import urllib.request

flagInternetAvailable = True ## Set to false if no internet

# Main classes

This is the (non-exhaustive) list of classes (of objects) in gstlearn:

* Db, DbGrid: numerical data base
* DirParam, VarioParam and Vario: experimental variograms
* Model: variogram model
* Neigh: neighborhood
* Anam: gaussian anamorphosis
* Polygon: 2-D polygonal shapes
* Rule: lithotype rule for thresholds used for truncated plurigaussian models

# Importing External File

## Loading a CSV file

We start by downloading the ASCII file called *Scotland_Temperatures.csv* (organized as a CSV file) and we store it in a temporary directory and keep its path in the variable called *temp_csv*. The file is provided as a *CSV* format. We load it into a Pandas data frame (names *datcsv*) using the relevant Python command. Note that "MISS" keyword is used in this file to indicate a missing value. Such values will be replaced by NaN.

In [None]:
fileCSV='Scotland_Temperatures.csv'
if flagInternetAvailable:
    temp_csv, head = urllib.request.urlretrieve('https://soft.minesparis.psl.eu/gstlearn/data/Scotland/'+fileCSV,'./'+fileCSV)
else:
    temp_csv='./'+fileCSV

In [None]:
import pandas as pd
datcsv = pd.read_csv(temp_csv, na_values="MISS")
datcsv

We can check the contents of the data frame (by simply typing its name) and see that it contains four columns (respectively called *Longitude*, *Latitude*, *Elevation*, *January_temp*) and 236 rows (header line excluded).

## Creating Db object from a Pandas frame

The user can then create a database of the **gstlearn** package (*Db* class) directly from the previously imported Pandas frame.

In [None]:
# Create an empty Db
dat = gl.Db()
# And import all columns in one a loop using [] operator
for field in datcsv.columns :
    dat[field] = datcsv[field]
dat

## Creating Db object directly from CSV file

These operations can be performed directly by reading the CSV file again and load it directly into a Db.

Note that we introduce a *CSVformat* description where we can specifiy the specificities of the file to be read, in particular we can tell how to spell the conventional value used for coding missing information.

In [None]:
csv = gl.CSVformat.create(flagHeader=True, naString = "MISS")
dat = gl.Db.createFromCSV(temp_csv, csv=csv)
dat

Note that a "rank" variable has been automatically added. The *rank* is always 1-based and must be distinguish from an *index* (0-based). The *rank* variable could be later useful for certain functions of the **gstlearn** package.

## Importing Db File from a "Neutral File"

A last solution is to import it directly from the set of demonstration files (provided together with the package and called *temp_nf*) and stored in a specific format (Neutral file). 

These *NF* (or neutral file) are currently used for serialization of the gstlearn objects. They will probably be replaced in the future by a facility backuping the whole workspace in one step.

Note that the contents of the Db is slightly different from the result obtained when reading from CSV. Essentially, some variables have a *Locator* field defined, some do not. This concept will be described later in this chapter and the difference can be ignored.

In [None]:
fileNF='Scotland_Temperatures.NF'
if flagInternetAvailable:
    temp_nf, head = urllib.request.urlretrieve('https://soft.minesparis.psl.eu/gstlearn/data/Scotland/'+fileNF,'./'+fileNF)
else:
    temp_nf='./'+fileNF

dat = gl.Db.createFromNF(temp_nf)
dat

# Discovering Db

## The Db class

*Db* objects (as all objects that inherits from *AStringable*) have a method `display` allowing to print a summary of the content of the data base. The same occurs when typing the name of the variable at the end of a cell (see above).

In [None]:
dat.display()

There, we can check that the 4 initial fields have been considered, in addition to a first one, automatically called *rank*, for a total of 5 columns (the information regarding *UID* will not be addressed in this chapter).

We can check that each field is assigned to a numbered *Column* (0-based index). Finally the total number of samples is 236 as expected.

In addition, some interesting information tells you that this data base corresponds to a 2-D dimension one: this will be described later together with the use of the *Locator* information.


To get more information on the contents of the Db, it is possible to use the *DbStringFormat* option and to use use through the *display* method. There are several ways to specify the type of information that is searched for (see the documentation of this class for details): typically here we ask for statistics but restrict them to a list of variables 

In [None]:
dbfmt = gl.DbStringFormat.createFromFlags(flag_stats=True, names=["Elevation", "January_temp"])
dat.display(dbfmt)

Monovariate statistics are better displayed using a single function called *dbStatisticsMonoT*. This function waits for a vector of enumerators of type EStatOption as statistic operators. Such vector is created using a static function called *fromKeys* which is available in all enumerators classes (i.e. inherits from *AEnum*).

In [None]:
gl.dbStatisticsMonoT(dat,
                     names=["Elevation", "January_temp"],
                     opers=gl.EStatOption.fromKeys(["MEAN","MINI","MAXI"]))

## Assessors for Db class

We can also consider the data base as a 2D array and use the *[  ]* assessors. The following usage shows the whole content of the data base dumped as a 2D **Numpy array**.

In [None]:
dat[:]

We can access to one or several variables. Note that the contents of the Column corresponding to the target variable (i.e. *January_temp*) is produced as a 1D **numpy array**. 

Also note the presence of samples with *nan* corresponding to those where the target variable is not informed (*'MISS'* in the original dataset file).

In [None]:
dat["January_temp"]

But it can be more restrictive as in the following paragraph, where we only consider the samples 10 to 15, and only consider the variables *rank*, *Latitude*, *Elevation*. Remind that indices start from 0 to N-1. Indices slice '10:15' in Python means indices {10,11,12,13,14} (different from R) which means ranks {11,12,13,14,15}.

In [None]:
dat[10:15, ["rank", "Latitude", "Elevation"]]

We can also replace the variable *Name* by their *Column* index. Although this is not recommanded as the Column index may vary over time.

In [None]:
dat[10:15, 2:4]

A particular function is available to convert all the data base in an appropriate object of the Target Langage (here Python). A gstlearn *Db* is converted into a *Pandas frame* using **toTL**.

In [None]:
dat.toTL()

Please also note the feature that a variable whose name does not exist (*newvar*) in the data base, is created on the fly. Also note that variables may be specified with names referred to using traditional regexp expressions (i.e. the symbol '*' replaces any list of characters):

In [None]:
dat["newvar"] = 12.3 * dat["Elevation"] - 2.1 * dat["*temp"]
dat

The user also can remove a variable from the data base by doing the following:

In [None]:
dat.deleteColumn("newvar")
dat.display()

## Locators

The locators are used to specify the **role** assigned to a Column for the rest of the study (unless changed further). The *locator* is characterized by its name (*Z* for a variable and *X* for a coordinate) within the Enumerator *ELoc*.

In [None]:
dat.setLocators(["Longitude","Latitude"], gl.ELoc.X)
dat.setLocator("January_temp", gl.ELoc.Z)
dat

As can be seen in the printout, variables *Latitude* and *Longitude* have been designated as coordinates (pay attention to the order) and *January_temp* is the (unique) variable of interest. Therefore any subsequent step will be performed as a monovariate 2-D process.

The locator is translated into a *letter*,*number* pair for better legibility: e.g. *x1* for the first coordinate.

Here are all the **roles** known by **gstlearn**:

In [None]:
gl.ELoc.printAll()

# More with Db

## Plotting a Db

Plot the contents of a Db using functions of the **gstlearn.plot** sub-package (which relies on **matplotlib**). The color option (**name_color**) is used to represent the **january_temp** variable.

Note: Non availalble values (NaN) are converted into 0 for display purpose. This behavior will be modified and tunable in future versions.

In [None]:
fig, ax = gp.initGeographic()
ax.symbol(dat, name_color="January_temp", flagLegend=True, legendName="Temperature")
ax.decoration(title="January Temperature", xlabel="Easting", ylabel="Northing")
plt.show()

A more elaborated graphic representation displays the samples with a symbol proportional to the Elevation (**name_size**) and a color representing the Temperature (**name_color**).

In [None]:
fig, ax = gp.initGeographic()
ax.symbol(dat, name_size="Elevation", name_color="*temp", flagLegend=True, legendName="Elevation")
ax.decoration(title="January Temperature", xlabel="Easting", ylabel="Northing")
plt.show()

Of course, you can use your own graphical routines (for example, a direct call to **matplotlib**) by simply accessing to the *gstlearn* data base values (using '[ ]' accessor):

In [None]:
plt.figure(figsize=(20,8))
plt.scatter(dat["x1"], dat["x2"], s=20, c=dat["*temp"]) # Locator or variable name is OK
plt.title("January Temperatures")
plt.xlabel("Easting")
plt.ylabel("Northing")
plt.colorbar(label="Temperature (°C)")
plt.gca().set_aspect('equal') # Respect aspect ratio
plt.show()

## Grid Data Base

On the same area, a terrain model is available (as a demonstration file available in the package distribution). We first download it and create the corresponding data base defined on a grid support (*DbGrid*).

In [None]:
fileNF='Scotland_Elevations.NF'
if flagInternetAvailable:
    elev_nf, head = urllib.request.urlretrieve('https://soft.minesparis.psl.eu/gstlearn/data/Scotland/'+fileNF,'./'+fileNF)
else:
    elev_nf='./'+fileNF

grid = gl.DbGrid.createFromNF(elev_nf)
grid

We can check that the grid is constituted of 81 columns and 137 rows, or 11097 grid cells. We can also notice that some locators are already defined (these information are stored in the Neutral File).


## Selection

We can check the presence of a variable (called *inshore*) which is assigned to the *sel* locator: this corresponds to a *Selection* which acts as a binary filter: some grid cells are active and others are masked off. The count of active samples is given in the previous printout (3092). This selection remains active until the locator 'sel' is replaced or deleted (there may not be more than one selection defined at a time per data base). This is what can be seen in the following display where the *Elevation* is automatically represented **only** within the *inshore* selection.

Note that any variable (having values equal to 0/1 or True/False) can be considered as a Selection: it must simply be assigned to the *sel* locator using the *setLocator* method described earlier.

In [None]:
fig, ax = gp.initGeographic()
ax.raster(grid, name="Elevation", flagLegend=True)
ax.decoration(title="Elevation", xlabel="Easting", ylabel="Northing")
plt.show()

## Final plot

On this final plot, we combine grid and point representations.

In [None]:
fig, ax = gp.initGeographic()
ax.raster(grid, name="Elevation", flagLegend=True)
ax.symbol(dat, name_size="*temp", flagLegend=True, legendName="Temperature", sizmin=10, sizmax=30, c="yellow")
ax.decoration(title="Elevation and Temperatures", xlabel="Easting", ylabel="Northing")
plt.show()