# Cross-validation

<!-- SUMMARY: Example of Cross-validation calculation with various output formats -->

<!-- CATEGORY: Methodology -->

## Import packages

In [None]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc

gdoc.setNoScroll()

## Introduction

This is a small script which illustrates the capabilities of the cross-validation feature within *gstlearn*.

We generate a fictitious data set (by sampling a given simulation). Then we will use this fictitious data set to demonstrate the cross-validation tool.

We generate a data set composed of a series of few samples located randomly within a 100 by 100 square. 

The number of data can be set to a small nuùber (*10* for example) small in order to make the results more legible. However, you can turn it to a more reasonable value (say *100*) to see better results (note that you should avoid large numbers as a test is performed in Unique Neighborhood). In the latter case, you may then switch OFF the variable 'turnPrintON' to remove tedious printouts.

In [None]:
nech = 10
turnPrintON = True

data = gl.Db.createFromBox(nech, [0,0], [100,100])
gp.plot(data)
gp.decoration(title="Measurement location")

We define a model (spherical structure with range 30 and sill 4) with the Universality condition, and perform a non conditional simulation at the data locations. These values, renamed as *data* will now become the data set.

In [None]:
model = gl.Model.createFromParam(type=gl.ECov.SPHERICAL,range=30,sill=4)
model.setDriftIRF(0)
err = gl.simtub(None,data,model)
data.setName("Simu","data")
gp.plot(data, nameSize="data")
gp.decoration(title="Measurement values")

Now we perform the cross-validation step. This requires the definition of a neighborhood (called *neigh*) that we consider as unique, due to the small neumber of data. Obviously this could be turned into a moving neighborhood if necessary.

In [None]:
neighU = gl.NeighUnique()
err = gl.xvalid(data,model,neighU,flag_xvalid_est=1,flag_xvalid_std=1,
                namconv=gl.NamingConvention("Xvalid",True,True,False))

The cross-validation feature offers several types of outputs, according to the flags:

- *flag_xvalid_est* tells if the function must return the estimation error Z*-Z (flag.est=1) or the estimation Z* (flag.est=-1)

- *flag_xvalid_std* tells if the function must return the normalized error (Z*-Z)/S (flag.std=1) or the standard deviation S (flag.std=-1)

For a complete demonstration, all options are used. Note the use of *NamingConvention*  which explicitely leaves the Z-locator on the input variable (i.e. *data*).

We perform the Cross-validation step once more but change the storing option (as wellas the radix given to the output variables).

In [None]:
err = gl.xvalid(data,model,neighU,flag_xvalid_est=-1, flag_xvalid_std=-1,
                namconv=gl.NamingConvention("Xvalid2",True,True,False))
data

We know check all the results gathered on the first sample.

In [None]:
data[0,0:8]

The printed values correspond to the following information:

- the sample rank: 1
- the sample abscissae $X$: 22.7
- the sample coordinate $Y$: 83.64
- the data value $Z$: 2.502
- the cross-validation error $Z^* - Z$: -1.952
- the cross-validation standardized error $\frac{Z^* - Z} {S}$: -1.095
- the cross-validation estimated value $Z^*$: 0.550
- the standard deviation of the cross-validation error $S$: 1.781

We can also double-check these results by asking a full dump of all information when processing the first sample. The next chunk does not store any result: it is just there in order to produce some output on the terminal to better understand the process.

In [None]:
if turnPrintON:
    gl.OptDbg.setReference(1)
    err = gl.xvalid(data,model,neighU,flag_xvalid_est=1, flag_xvalid_std=1,
                    namconv=gl.NamingConvention("Xvalid3",True,True,False))

We can also double-check these calculations with a Moving Neighborhood which has been tuned to cover a pseudo-Unique Neighborhood.

In [None]:
neighM = gl.NeighMoving.create()

In [None]:
if turnPrintON:
    gl.OptDbg.setReference(1)
    err = gl.xvalid(data,model,neighM,flag_xvalid_est=1, flag_xvalid_std=1,
                    namconv=gl.NamingConvention("Xvalid4",True,True,False))

In the next paragraph, we perform the different graphic outputs that are expected after a cross-validation step. They are provided by the function *draw.xvalid* which produces:

- the base map of the absolute value of the cross-validation standardized error
- the histogram of the cross-validation standardized error
- the scatter plot of the standardized error as a function of the estimation
- the scatter plot of the true value as a function of the estimation

In [None]:
fig, axs = plt.subplots(2,2, figsize=(12,12))
axs[0,0].symbol(data,nameSize="Xvalid.data.stderr", flagAbsSize=True)
axs[0,0].decoration(title="Standardized Errors (absolute value)")
axs[0,1].histogram(data, name="Xvalid.data.stderr", bins=20)
axs[0,1].decoration(title="Histogram of Standardized Errors")
axs[1,0].correlation(data, namey="Xvalid.data.stderr", namex="Xvalid2.data.estim", 
                   asPoint=True, horizLine=True)
axs[1,0].decoration(xlabel="Estimation", ylabel="Standardized Error")
axs[1,1].correlation(data, namey="data", namex="Xvalid2.data.estim", 
                   asPoint=True, regrLine=True, flagSameAxes=True)
axs[1,1].decoration(xlabel="Estimation", ylabel="True Value")

## Difference between Kriging and Cross-Validation

This small paragraph is meant to enhance the difference between *Kriging* and *Cross-validation*. Clearly the two main differences are that:
- Estimation is performed on the data file itself: hence 'input' and 'output' data bases coincide 
- the target sample is temporarily suppressed from the data base before the estimation takes place at its initial location

This is illustrated in the following paragraph where the estimation is performed on the first sample, toggling all the verbose options ON.
We also take the opportunity for testing the option for the calculation of the the variance of the estimator (not the estimation error for once).

In [None]:
gl.OptDbg.setReference(1)
err = gl.kriging(data,data,model,neighU, flag_est=True, flag_std=True, flag_varz=True,
                namconv=gl.NamingConvention("Kriging",True,True,False))
data

In [None]:
data[0,0:15]

We also take this opportunity to double-check that Kriging is an Exact Interpolator (e.g. at the data point, estimated value coincides with data value and variance of estimation error is zero) even when using a Model containing some Nugget Effect.

So we modify the previous model by adding a Nugget Effect component.

In [None]:
model.addCovFromParam(type=gl.ECov.NUGGET, sill=1.5)
model

In [None]:
gl.OptDbg.setReference(1)
err = gl.kriging(data,data,model,neighU, flag_est=True, flag_std=True, flag_varz=True,
                namconv=gl.NamingConvention("Kriging2",True,True,False))
data

In [None]:
data[0,0:21]