# Data exploration in Python

We start by loading a few useful Python packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Figure sizes
plt.rcParams['figure.figsize'] = [8,6]

We now load the dataset that will be used throughout the course: the Jura dataset. It comprises seven heavy metals measured in the top soil of the swiss Jura, along with consistently coded land use and rock type factors, as well as geographic coordinates. Here is the description of the rocktypes and Land uses:

- Rock Types: 1: Argovian, 2: Kimmeridgian, 3: Sequanian, 4: Portlandian, 5: Quaternary.

- Land uses: 1: Forest, 2: Pasture (Weide(land), Wiese, Grasland), 3: Meadow (Wiese, Flur, Matte, Anger), 4: Tillage (Ackerland, bestelltes Land)


1. Load the dataset from the file *jura_pred.csv* (on the cloud) using the Pandas library.

In [None]:
# Dataset
jura=pd.read_csv("jura/jura_pred.csv")

2. What is the class of the dataset?

In [None]:
type(jura)

3. What is the number of observations? What is the number of variables?

In [None]:
jura.shape

3. Print the name of the variables.

In [None]:
jura.columns.values

4. Compute the minimum and maximum value for each coordinate.

In [None]:
## Here are two ways to compute the minimum of each coordinate
res= [jura['Xloc'].min(), jura['Yloc'].min()]
print("\n Min - 1st way :\n ", res)

res=jura[['Xloc','Yloc']].min(0)
print("\n Min - 2nd way :\n", res)

## Likewise for the maximum
res=[jura['Xloc'].max(), jura['Yloc'].max()]
print("\n Max - 1st way :\n", res)

res=jura[['Xloc','Yloc']].max(0)
print("\n Max - 2nd way :\n", res)

5. Compute basic statistics for the seven different heavy metals (mean, min, max, quartiles and standard deviation)

In [None]:
## 1st way: using the describe method of Pandas dataframes
res=jura.iloc[:,4:].describe()

print("With the describe method:")
print(res)

In [None]:
## 2nd way: manually
res_min=jura.iloc[:,4:].min(0)
res_max=jura.iloc[:,4:].max(0)
res_mean=jura.iloc[:,4:].mean(0)
res_std=jura.iloc[:,4:].std(0)
res_quartile_25=jura.iloc[:,4:].quantile(0.25)
res_quartile_50=jura.iloc[:,4:].quantile(0.5)
res_quartile_75=jura.iloc[:,4:].quantile(0.75)

print("Min\n",res_quartile_25)

6. Compute the mean cobalt concentration for the four different landuses 

In [None]:
mean_Co_1=jura['Co'][jura['Landuse']==1].mean()
mean_Co_2=jura['Co'][jura['Landuse']==2].mean()
mean_Co_3=jura['Co'][jura['Landuse']==3].mean()
mean_Co_4=jura['Co'][jura['Landuse']==4].mean()
print([mean_Co_1,mean_Co_2,mean_Co_3,mean_Co_4])

### Graphical Representations


1. Plot the points in the dataset using their coordinates 'Xloc' and 'Yloc'.

In [None]:
plt.scatter(jura['Xloc'],jura['Yloc'])
plt.xlabel('Xloc')
plt.ylabel('Yloc')
plt.show()

2. On the same plot, display the points with landuse 2 (pasture) in red.

In [None]:
plt.scatter(jura['Xloc'],jura['Yloc'])
plt.scatter(jura['Xloc'][jura['Landuse']==2],jura['Yloc'][jura['Landuse']==2],color="red")
plt.xlabel('Xloc')
plt.ylabel('Yloc')
plt.show()

3. Plot the seven, heavy metal concentrations histograms.

In [None]:
## Manually, on a grid
fig,ax = plt.subplots(nrows=2,ncols=4) ## To create a grid of plots
ax[0,0].hist(jura['Cd']) 
ax[0,0].set_title('Cd')
ax[0,1].hist(jura['Co']) 
ax[0,1].set_title('Co')
ax[0,2].hist(jura['Cr'])
ax[0,2].set_title('Cr')
ax[0,3].hist(jura['Cu']) 
ax[0,3].set_title('Cu')
ax[1,0].hist(jura['Ni']) 
ax[1,0].set_title('Ni')
ax[1,1].hist(jura['Pb']) 
ax[1,1].set_title('Pb')
ax[1,2].hist(jura['Zn']) 
ax[1,2].set_title('Zn')
plt.show()

## With a loop, on a single line
metal_names=jura.columns.values[4:]
fig,ax = plt.subplots(nrows=1,ncols=len(metal_names),figsize=(20,6)) ## To create a grid of plots
for i in range(len(metal_names)):
    name=metal_names[i]
    ax[i].hist(jura[name]) 
    ax[i].set_title(name)
plt.show()

4. Plot the seven heavy metal concentrations as functions of the landuse

In [None]:
## With a loop, on a single line
metal_names=jura.columns.values[4:]
fig,ax = plt.subplots(nrows=1,ncols=len(metal_names),figsize=(20,6)) ## To create a grid of plots
for i in range(len(metal_names)):
    name=metal_names[i]
    ax[i].scatter(jura['Landuse'],jura[name]) 
    ax[i].set_title(name)
plt.show()

Same but with boxplots

In [None]:
metal_names = jura.columns.values[4:]
fig, ax = plt.subplots(nrows=1, ncols=len(metal_names), figsize=(20, 6))  # To create a grid of plots

for i, name in enumerate(metal_names):
    sns.boxplot(x=jura['Landuse'], y=jura[name], ax=ax[i])
    ax[i].set_title(name)

plt.show()

5. Plot the seven heavy metal concentrations as functions of the rocktype

In [None]:
## With a loop, on a single line
metal_names=jura.columns.values[4:]
fig,ax = plt.subplots(nrows=1,ncols=len(metal_names),figsize=(20,6)) ## To create a grid of plots
for i in range(len(metal_names)):
    name=metal_names[i]
    ax[i].scatter(jura['Rock'],jura[name]) 
    ax[i].set_title(name)
plt.show()

## Some statistics


We now perform some analysis of variance of the Cobalt concentrations  with Landuse, Rock and their product as factors.

1. First, fit a linear regression model with the Cobalt concentration as a response variable, and with Landuse, Rock and their product as covariates.
 
To perform a linear regression with categorical variables, we use the *ols* function from the *statsmodels* package.

In [None]:
## For linear regressions (using R-style formulas to define regression)
from statsmodels.formula.api import ols

In [None]:
# Fit the regression model, note that we use the C() function to specify that the variables are categorical
model_aov = ols(formula='Co ~  C(Landuse) + C(Rock) + C(Landuse) * C(Rock)', data=jura).fit()
print(model_aov.summary())

2. Perform an ANOVA analysis on the regression model using the *stats.anova_lm* function from the *statsmodels* package.

In [None]:
## For ANOVA
from statsmodels.api import stats

In [None]:
# Perform anova analysis
anova_table = stats.anova_lm(model_aov)
print(anova_table)

3. Repeat these two steps on the other concentrations (check the histograms prior to apply a transformation if necessary).
