# Geostatistics Athens Week project

Authors: 

General recommendations: 

In this project you will implement and compare the prediction performances of a number of methods. Please provide their comparison in terms of MSE on a unique (or an ensemble of) validation set(s) (A table summarizing your results at the end would be great). For each method please provide a prediction map over the grid as well as a standard deviation map (if possible). Justify carefully your modeling choices, comment and interpret your results.

Your notebook, named after the names of the team members, will have to be uploaded here by the night of sunday 24: 

https://cloud.minesparis.psl.eu/index.php/s/K5PEdwY5l3FBC4c

Good Luck!

## The Jura data set
The Jura data set comprises seven heavy metals measured in the top soil of the swiss Jura, along with consistently coded land use and rock type factors, as well as geographic coordinates. 

Variable description :

* Xloc: X coordinate, local grid km
* Yloc: Y coordinate, local grid km
* Landuse: Land use: 1: Forest, 2: Pasture (Weide(land), Wiese, Grasland), 3: Meadow (Wiese, Flur, Matte, Anger), 4: Tillage (Ackerland, bestelltes Land)
* Rock: Rock Types: 1: Argovian, 2: Kimmeridgian, 3: Sequanian, 4: Portlandian, 5: Quaternary.
* Cd: mg cadmium kg^-1 topsoil
* Co: mg cobalt kg^-1 topsoil
* Cr: mg chromium kg^-1 topsoil
* Cu: mg copper kg^-1 topsoil
* Ni: mg nickel kg^-1 topsoil
* Pb: mg lead kg^-1 topsoil
* Zn: mg zinc kg^-1 topsoil

You are given three different files:

* jura_pred.csv: learning dataset
* jura_grid.csv: prediction grid (contains locations and covariables)
* jura_val_loc: validation locations and covariables

## I. Exploratory analysis

### Basic statistics
1. load the dataset from jura_pred.csv (on the cloud)
2. What is the class of the dataset?
3. What is the number of observations? What is the number of variables?
4. Print the name of the variables.
5. Compute the minimum and maximum value for each coordinate.
6. Compute basic statistics for the seven different heavy metals (mean, min, max, quartiles and standard deviation)
7. Compute the mean of cobalt concentration for the four different landuses 

### Graphical Representations
1. Plot the points in the dataset using their coordinates 'Xloc' and 'Yloc'.
2. On the same plot, display the points with landuse 2 (pasture) in red.
3. Plot the seven heavy metal concentrations histograms.
4. Plot the seven heavy metal concentrations as functions of the landuse (boxplots)
5. Plot the seven heavy metal concentrations as functions of the rocktype (boxplots)

### Some statistics
1. Cobalt concentrations with Landuse, Rock and the interactions. Comment the results.
3. Do the same on the other concentrations (check the histograms prior to apply a transformation if necessary).


## II. Interpolation

Provide the maps of the cobalt concentration over the Swiss Jura obtained with several regression/interpolation methods, e.g.:


* anova
* linear regression on the coordinates
* Random Forests
* Nearest neighbours
* Inverse distance
* ...


## III. Univariate analysis

### III.1 Variography

#### Experimental variogram (isotropic case)

1. Compute and plot the experimental variogram of the cobalt concentration. Try different values of lag and comment the results. 
2. Print the number of pairs of points used to compute the variogram values for different values of lag and comment the results.

#### Experimental variogram (anisotropic case)

1. Compute and plot the variogram maps of the Cobalt concentration to check for anisotropies. Comment.
2. Compute and plot directional variograms (according to the anisotropy directions determined with the maps).

#### Model adjustement

1. Adjust a model (isotropic and anisotropic cases) on experimental variograms and print the model caracteristics.
2. Try imposing different structures or combinations of structures.

### III.2 Prediction

#### Ordinary Kriging

1. Compute and plot the ordinary kriging of the cobalt over the prediction grid. Plot the associated standard deviation map.
2. Try several variogram models (basic structures, anisotropy), and neighborhood options. Compute the prediction scores. Comment the results.


#### Universal kriging

Use the indicators of the different levels of the categorical variables (*Rock*,*Landuse* and the interactions) as covariates to compute the universal kriging prediction. 

1. Define a model with a constant mean (*order = 0*) and the number of variables with a f locator the we want to work with (*nfex = 4*).
2. Compute the variogram of the residuals.
3. Adjust a model on the variogram of the residuals. Do not forget to set the drift functions.
4. Compute the Kriging with external drift prediction on the grid as well as the standard deviation map.

## IV. Multivariate analysis

### IV.1 Fitting a multivariate model

1. Compute the  empirical directional variograms and covariograms of a carefully chosen (justify) set of variables.
2. Fit a model

### IV.2 Prediction

1. Interpolate *Co* on the grid using Ordinary Cokriging and plot the resulting map as well as the standard deviation map.
2. Implement the universal cokriging.



## V. Maximum Likelihood estimation

1. Compute the maximum likelihood estimator of the parameters of (some of) your favorite univariate model(s) for the Cobalt concentration. 
2. Compare the models through a likelihood ratio test or by computing the AIC if they are not nested.
3. Compute the predictions each model.

## VI. Conditional simulations

The information threshold for the concentration of cobalt in soils is *12 mg/kg*. 

1. Generate 100 conditional simulations of the Cobalt concentrations over the swiss Jura according to your favorite model. 
2. Compute the mean surface of the area of exceedance as well as its associated centered 95% confidence interval.
3. Compute and plot the exceedance probability map. Comment.

## VII. Summary -- Discussion

## Appendix: 
Description of the predictions submitted on kaggle (models, parameters) and corresponding prediction maps.
