![banner](../img/cdips_2017_logo.png)

# Loading, Visualizing, and Exploring the Data

This notebook will help you start 
to explore the soil spectral data
through [pandas](https://pandas.pydata.org/pandas-docs/stable/), 
[seaborn](http://seaborn.pydata.org/index.html), 
and [matplotlib](https://matplotlib.org/index.html).

All of these packages are well-documented and
easy-to-use, so they make for great tools
to use when first looking at data.

A good resource for learning pandas 
can be found [here](https://github.com/brandon-rhodes/pycon-pandas-tutorial).

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as skl
import numpy as np

import seaborn as sns
sns.set(font_scale=2)

%matplotlib inline

#### Load the data and view the first few columns

Pandas can read the contents of a csv file
directly into a pandas dataframe, just tell
it where to find the file.

In [None]:
data = pd.read_csv('../data/training.csv')

data.head()

#### View some randomly chosen columns

In [None]:
data.sample(5)

#### List the columns

In [None]:
data.columns

#### Pickout the output variables and look at their descriptive statistics

There are a lot of built-in features that can be used to
explore a pandas dataframe. 

`describe()` is great
for a quick look at some standard statistics
quantities.

In [None]:
output_variables = ["Ca","P","pH","SOC","Sand"]

data[output_variables].describe()

#### Visualize marginal and pairwise distribution of outputs

Seaborn is a python visualization library that
allows you to create statistical plots with
ease.  

`pairplot` creates a
grid of pairwise distributions in just a single line!

In [None]:
sns.pairplot(data=data,
                vars=output_variables,
            plot_kws={'alpha':0.01,'s':144},
            diag_kind='kde');

plt.suptitle('Pairwise Relationships for '+ ', '.join(output_variables),
            fontsize=36,fontweight='bold',y=1.05
            );

#### Select and Plot a Random Spectrum

While seaborn is good for visualizing statistics,
matplotlib is still the go-to standard
for general plotting.  Let's plot the spectrum
for one of our samples.

In [None]:
data_columns = [column for column in data.columns if column.startswith('m')]
wavenumbers = [float(column.lstrip('m')) for column in data_columns]

In [None]:
random_data_point = data.sample(1)
random_data_point

In [None]:
spectrum_as_dataframe = random_data_point[data_columns]

spectrum_as_series = spectrum_as_dataframe.iloc[0]
PIDN = random_data_point['PIDN'].iloc[0]
spectrum_array = spectrum_as_series.as_matrix()
numericID = spectrum_as_series.name

plt.figure(figsize=(16,4))
plt.plot(wavenumbers,spectrum_array); plt.title(PIDN+': '+str(numericID), 
                                    fontweight='bold',fontsize='xx-large');
plt.xlabel('wavenumber',fontweight='bold',fontsize='x-large');
plt.ylabel('measurement',fontweight='bold',fontsize='x-large');

#### Plot Average Spectrum +/- 1 SD

First we grab the descriptive stats using `.describe()`, then we use the `mean`s and `std`s to build an `errorbar` plot.

One possible alteration would be to use the median (`50%`) and quartile values (`25%` and `75%`) instead. I suspect we might get a different picture, especially of the variability.

In [None]:
all_spectra_dataframe = data[data_columns]

In [None]:
stats = all_spectra_dataframe.describe()

stats

In [None]:
average_values = stats.loc['mean'].as_matrix()
sds = stats.loc['std'].as_matrix()

In [None]:
plt.figure(figsize=(16,4))

plt.errorbar(x=wavenumbers,y=average_values,yerr=sds,
            errorevery=1,ecolor='k',color='w',alpha=0.01,zorder=1,
            label='±SD');

plt.plot(wavenumbers,average_values,color='chartreuse',
         linewidth=2,zorder=1,label='average'
        );

plt.legend()
plt.title('Average Spectrum', fontweight='bold',fontsize='xx-large');

plt.xlabel('wavenumber',fontweight='bold',fontsize='x-large');
plt.ylabel('measurement',fontweight='bold',fontsize='x-large');

#### SNR by Wavenumber

In [None]:
plt.figure(figsize=(16,4))

plt.plot(wavenumbers,
         #np.square(sds),
         np.divide(np.square(sds),
                            average_values),
         color='chartreuse',
         linewidth=2,zorder=1,
        );

plt.title('SNR by Wavenumber', fontweight='bold',fontsize='xx-large');

plt.xlabel('wavenumber',fontweight='bold',fontsize='x-large');
plt.ylabel('Variance',fontweight='bold',fontsize='x-large');