# Demonstration of the Statistics procedure in gstlearn

This file demonstrates the use of Statistics functions performed on a Point and a Grids (in 2-D.

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

## Import packages

In [None]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp

Defining the Grid file called *grid*. The grid contains three variables with values generated randomly (called "SG_i")

In [None]:
grid = gl.DbGrid.create(nx=[150,100])
ngrid = grid.getSampleNumber()
grid.addColumns(gl.VectorHelper.simulateGaussian(ngrid),"SG1",gl.ELoc.Z)
grid.addColumns(gl.VectorHelper.simulateGaussian(ngrid),"SG2",gl.ELoc.Z)
grid.addColumns(gl.VectorHelper.simulateGaussian(ngrid),"SG3",gl.ELoc.Z)
grid

Defining a Point data base called *data*, covering the grid(s) extension. The data base contains three variables generated randomly (called "SD_i")

In [None]:
nech = 100
data = gl.Db.createFromBox(nech, grid.getCoorMinimum(), grid.getCoorMaximum())
data.addColumns(gl.VectorHelper.simulateGaussian(nech),"SD1",gl.ELoc.Z)
data.addColumns(gl.VectorHelper.simulateGaussian(nech),"SD2",gl.ELoc.Z)
data.addColumns(gl.VectorHelper.simulateGaussian(nech),"SD3",gl.ELoc.Z)
data

The following plot displays the variable *SG1* from the Grid Data Base (in color scale) and the variable *SD1* from the Point Data Base (in proportional symbols).

In [None]:
ax = grid.plot("SG1")
ax = data.plot(color="white")
ax.decoration(title="Data")

Note that in all subsequent tests, we will have to specify a set of statistical operations. This list is defined once for all and specified using *fromKeys* utility to make the script more legible.

In [None]:
opers = gl.EStatOption.fromKeys(["NUM", "MEAN", "STDV"])

In the next paragraph, we calculate some monovariate statistics on the variables contained in the Point Data Base. For all methods, several calls are available, depending on:
- how the target variables are specified
- how the results are produced

In [None]:
gl.dbStatisticsMonoT(data, ["SD*"], opers = opers)

The next command produces the correlation matrix of the selected variables.

In [None]:
gl.dbStatisticsCorrelT(data, ["SD*"])

The following command prints the statistics on the selected variables (including the correlation matrix).

In [None]:
gl.dbStatisticsPrint(data, ["SD*"], opers=opers, flagCorrel=True)

The following command provides an array containaing the evaluation of a given Statistical calculation for a set of variables contained in a Db.

If 'flagMono' is set to False, this satistics is calculated for each variable in turn. Otherwise this statistics is calculated on each variable, based on the only samples where one of the other variables is defined. In that case, the dimension of the output is equal to the squzre of the number of target variables.

In our case, there will be no difference in the contents of these two outputs as the data set if Isotopic.

In [None]:
gl.dbStatisticsMulti(data, ["SD*"], gl.EStatOption.MEAN,  flagMono = True)

In [None]:
gl.dbStatisticsMulti(data, ["SD*"], gl.EStatOption.MEAN,  flagMono = False)

## Using the Grid

We now calculate the statistics of the data contained in the Point Db, per cell of the output DbGrid. This function returns the results as an array of values (which has the dimension of the number of cells of the output Grid).

For those calculations, we will consider a coarse grid overlaying the initial grid, but with meshes obtained as multiples of the initial one.

In [None]:
gridC = grid.coarsify([5,5])
gridC

In [None]:
tab = gl.dbStatisticsPerCell(data, gridC, gl.EStatOption.MEAN, "SD1")
iuid = gridC.addColumns(tab, "Mean.SD1", gl.ELoc.Z)

In [None]:
ax = gridC.plot("Mean.SD1")

If may be more handy to store the statistic (say the *Mean*) directly as new variables in the output Grid File. These calculations will be performed for each input variable (Z_Locator) in the input file.

In [None]:
data.setLocators(["SD*"],gl.ELoc.Z)
err = gl.dbStatisticsOnGrid(data, gridC, gl.EStatOption.MEAN)

Obviously the results for the first variable, is similar to the previous calculation (as demonstrated using the scatter plot). But the statistics for the other variables have been calculated simultaneously.

In [None]:
ax = gp.correlation(gridC,namex="Mean.SD1",namey="Stats.SD1", bins=100)

More interesting is the ability to dilate the size of the cell while performing the calculations. Here, each grid node is dilated with a *ring* extension of 2: the initial node extension is multiplied by 5. So very few cells have no data included in their dilated dimension.

In [None]:
err = gl.dbStatisticsOnGrid(data, gridC, gl.EStatOption.MEAN, radius=2, 
                            namconv=gl.NamingConvention("Stats.Dilate"))

In [None]:
ax = gridC.plot("Stats.Dilate.SD1")

This same feature cab be used to calculate the dispersion variance of blocks (say the cells of the fine grid) whitin panels (say the cells of the coarse grid).

In [None]:
grid.setLocator("SG1",gl.ELoc.Z, cleanSameLocator=True)
err = gl.dbStatisticsOnGrid(grid, gridC, gl.EStatOption.VAR, radius=2, 
                            namconv=gl.NamingConvention("Var.Disp"))

In [None]:
ax = gridC.plot("Var.Disp.SG1")
ax.decoration(title="Dispersion Variance of blocks into panels")