# African Soil Property Prediction Challenge 

## 0. Description

Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing of soil samples, and greater availability of earth remote sensing data provide new opportunities for predicting soil functional properties at unsampled locations. Soil functional properties are those properties related to a soil’s capacity to support essential ecosystem services such as primary productivity, nutrient and water retention, and resistance to soil erosion.

Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a highly repeatable, rapid and low cost measurement of many soil functional properties. The amount of light absorbed by a soil sample is measured, with minimal sample preparation, at hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum. The measurement can be typically performed in about 30 seconds, in contrast to conventional reference tests, which are slow and expensive and use chemicals.

This competition asks you to predict 5 target soil functional properties from diffuse reflectance infrared spectroscopy measurements.

## 1. Frame the problem and look at the big picture

* 

## 2. Getting the data

A function for making the data into a DataFrame. 

In [None]:
import pandas as pd

DATA_PATH = "datasets"

def load_data (path=DATA_PATH):
    return pd.read_csv(path)

Let's take a look at the training data.

In [None]:
training = load_data(path = "datasets/training.csv")
training.head()

In [None]:
training.info()

### Explanation of data fields 

SOC, pH, Ca, P, Sand are the five target variables for predictions. The data have been monotonously transformed from the original measurements and thus include negative values. 
* PIDN: unique soil sample identifier
* SOC: Soil organic carbon
* pH: pH values
* Ca: Mehlich-3 extractable Calcium
* P: Mehlich-3 extractable Phosphorus
* Sand: Sand content 
* m7497.96 - m599.76: There are 3,578 mid-infrared absorbance measurements. For example, the "m7497.96" column is the absorbance at wavenumber 7497.96 cm-1. We suggest you to remove spectra CO2 bands which are in the region m2379.76 to m2352.76, but you do not have to.
    Depth: Depth of the soil sample (2 categories: "Topsoil", "Subsoil")

Also included are some potential spatial predictors from remote sensing data sources. Short variable descriptions are provided below and additional descriptions can be found at AfSIS data. The data have been mean centered and scaled.

* BSA: average long-term Black Sky Albedo measurements from MODIS satellite images (BSAN = near-infrared, BSAS = shortwave, BSAV = visible)
* CTI: compound topographic index calculated from Shuttle Radar Topography Mission elevation data
* ELEV: Shuttle Radar Topography Mission elevation data
* EVI: average long-term Enhanced Vegetation Index from MODIS satellite images.
* LST: average long-term Land Surface Temperatures from MODIS satellite images (LSTD = day time temperature, LSTN = night time temperature)
* Ref: average long-term Reflectance measurements from MODIS satellite images (Ref1 = blue, Ref2 = red, Ref3 = near-infrared, Ref7 = mid-infrared)
* Reli: topographic Relief calculated from Shuttle Radar Topography mission elevation data
* TMAP & TMFI: average long-term Tropical Rainfall Monitoring Mission data (TMAP = mean annual precipitation, TMFI = modified Fournier index)


Plot all properties but the spectrums to get an idea. 

In [None]:
%matplotlib inline 

import matplotlib.pyplot as plt

training[-8].hist (bins=50, figsize=(20, 15))
plt.show()