![banner](../img/cdips_2017_logo.png)

# Explore Data with Interactive visualizations

## Interactivity helps build intuition

The purpose of this notebook is twofold.

First is to provide an easy, intuitive way
for users to take a preliminary look at the 
data.  An interactive widget will help a user
to quickly see the effect input changes
will have on results, without being an expert
in python, pandas, or matplotlib.

Second is to provide some preliminary
examples based on the data so that anyone
can adjust the code to build an interaction
specific to their needs.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
import sklearn as skl
import seaborn as sns
sns.set()

import scripts.load_data as load

from ipywidgets import interact,interactive, fixed, interact_manual
import sklearn.preprocessing


%matplotlib inline
%matplotlib notebook

In [None]:
#reading in the training dataset
X,y=load.load_training_spectra(include_depth=True)

# joining features and targets into one dataframe
train = pd.concat([X, y], axis=1)

#Converting Depth to 0/1 boolean
#depthmap={'Topsoil':1,'Subsoil':0}
#train['Depth']=train['Depth'].replace(depthmap)


In [None]:
#target names
onames =[column for column in y.columns]

#columns of dataframe corresponding to spectral data
spectral_columns = [column for column in X.columns if column!='Depth']
wavenumbers=[float(column) for column in spectral_columns]
feature_columns = X.columns

scaler = sklearn.preprocessing.StandardScaler().fit(train[spectral_columns])
mean=train.mean()
var=train.var()

nrows=train.shape[0]

In [None]:
train_scaled=train.copy()
train_scaled[spectral_columns]=scaler.transform(train_scaled[spectral_columns])

## Preprocessing: looking at the log transform of data

If a target or feature exhibits a lot 
of skew, it can be difficult to see
meaningful characteristics of the
distribution.  Most samples will appear to
pile up around a single value, with a
few outliers spreading out the range of 
interest.  To better see any interesting behaviors
that may be occurring in the sharp peak
containing most of our samples, we can
take the log transform to spread the peak
out.

In the cell below, a dataframe stores information
on which variables are inspected in log space
for the rest of this notebook (initially
set to be the targets Ca, P, and SOC).  Taking
the log transform also changes the histogram
binning, so new bin information is also stored.

In [None]:
#a dataframe that stores data pertinent to plotting axes, binning, etc.
plotdf = pd.DataFrame(index = onames, columns = ['offset','logbins','linbins','type'])

for name in onames:
  plotdf.loc[name,'offset']=0.00100001 - train[name].min()
  #plotdf.loc[name,'bins'] = np.logspace(-3.0,2.0,25)
  plotdf.loc[name,'logbins'] = np.logspace(
      np.floor(np.log10(train[name].min()+plotdf.loc[name,'offset'])),
      np.ceil(np.log10(train[name].max()+plotdf.loc[name,'offset'])),20)
  plotdf.loc[name,'linbins'] = np.linspace(0,train[name].max()+plotdf.loc[name,'offset'],20)


#playing around with log and linear scales - change here
plotdf.loc['Ca','type']='logbins'
plotdf.loc['P','type']='logbins'
plotdf.loc['pH','type']='linbins'
plotdf.loc['SOC','type']='logbins'
plotdf.loc['Sand','type']='linbins'


# Interactive plots with [ipywidgets](https://ipywidgets.readthedocs.io/en/stable/)

This notebook introduces the usage 
of a few widgets that can help users 
to explore data in a more intuitive way.

The three widgets used are:

`widgets.ToggleButtons`:  the
toggle button allows for switching between
categories.

`widgets.FloatSlider`:  a slider
can change the value of a continuous
variable


`widgets.FloatRangeSlider`:  the
range slider can select a range
between the user defined minimum
and maximum



Only a few interactive ipywidgets are highlighted here,
but many more are available to serve your needs. Even 
without in-depth knowledge on 
ipywidgets, the beginning developer 
can still build some powerful tools.  More help on
getting started can be found in the
[jupyter documentation](http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html).


The functions below are 
widget creators, and will 
be used repeatedly in the
rest of the plots and figures
in this notebook.

In [None]:
def createtargetbutton():
     return widgets.ToggleButtons(
        options=onames,
        description='Soil property:',
        disabled=False,
        button_style='', # 'success', 'info', 'warning', 'danger' or ''
        tooltips=['Mehlich-3 extractable Calcium', 
                  'Mehlich-3 extractable Phosphorus', 
                  'pH values',
                  'Soil organic carbon',
                  'Sand content'],
        value=onames[0])

def createsliceslider():
    return widgets.FloatSlider(
        value=90.,
        min=0.0,
        max=90.0,
        step=0.1,
        description='Slice of distribution (%ile)',
        orientation='horizontal',
        #readout=False,
        readout_format='4.1f',
        layout=widgets.Layout(width='80%'))

def createminmaxslider():
    return widgets.FloatRangeSlider(
        value=[90., 99.99],
        min=0.0,
        max=99.99,
        step=0.01,
        description='Slice of distribution (%ile)',
        orientation='horizontal',
        layout=widgets.Layout(width='90%'))

def createdepthbutton():
    return widgets.ToggleButtons(
        options=['All','Topsoil-Subsoil'],
        description='Groupby soil depth:',
        button_style='', 
        tooltips=['Combined distribution', 
                  'Topsoil and Subsoil, separate distributions'],
        value='All')


In [None]:
def plothists(axes,data,color='blue',label='all',alpha=1.0):
    for i,name in enumerate(onames):
        axes[i].hist(data[name]+plotdf.loc[name,'offset'],
                bins=plotdf.loc[name,plotdf.loc[name,'type']],
                alpha=alpha,color=color,label=label)
        if plotdf.loc[name,'type']=='logbins':
          axes[i].set_xscale('log')
        else:        
          axes[i].set_xscale('linear')
            
        axes[i].set_yticklabels([])
        axes[i].set_xticklabels([])
        
def plotspecstandard(ax):
        ax.set_xlabel('wavenumber')
        ax.set_ylabel('spectral height, standardized')
        ax.yaxis.set_label_position("right")
        ax.yaxis.tick_right()
        ax.set_ylim(-2.5, 2.5)

def plotspecabsolute(ax):
    ax.plot(wavenumbers,
            mean[spectral_columns],color='black',linestyle='dotted',label='total mean')
    ax.set_xlabel('wavenumber')
    ax.set_ylabel('spectral height')
    ax.set_ylim(0,2.25)

## Category selection: User chosen targets

Below, we plot the averaged spectra 
of both the highest and lowest 100 samples when 
sorted by value of selected target.  This
is a quick way to see differences in spectra
between the target "extremes".  User 
chooses the target to sort by with 
provided toggle buttons.

In [None]:
@interact(prop=createtargetbutton())
def plot_centeredspectra(prop):
    
    high = train_scaled.sort_values(prop).tail(100).mean().loc[spectral_columns]
    low = train_scaled.sort_values(prop).head(100).mean().loc[spectral_columns]
    f2 = plt.figure()
    #f2.set_size_inches(12.0,6.0)
    plt.plot(wavenumbers,high,label="highest 100 mean")
    plt.plot(wavenumbers,low,label="lowest 100 mean")
    plt.legend()
    plt.title("Averaged spectrum, extremes for %s"%(prop))

The above interactive plot redraws at each button push.  
Changes to the plot below render 
more quickly by only redrawing the data.

(For interactivity, run the cell that appears below the plot)

In [None]:
#the averages of the spectra of the highest (blue) and lowest (red) 
#content samples when ordered by selected property

high = train_scaled.sort_values(onames[0]).tail(100).mean().loc[spectral_columns]
low = train_scaled.sort_values(onames[0]).head(100).mean().loc[spectral_columns]

f3 = plt.figure()
#f3.set_size_inches(12.0,6.0)
h,=plt.plot(wavenumbers,high,label="highest 100 mean")
g,=plt.plot(wavenumbers,low,label="lowest 100 mean")

plt.ylim(-2, 2)
plt.legend(loc=0)

plt.title("Averaged soil spectrum,mean-centered extremes for %s"%(onames[0]))
plt.tight_layout()

def soilpropdefine(prop):
    h.set_data(wavenumbers,train_scaled.sort_values(prop).tail(100).mean().loc[spectral_columns])
    g.set_data(wavenumbers,train_scaled.sort_values(prop).head(100).mean().loc[spectral_columns])
    plt.title("Averaged spectrum, extremes for %s"%(prop))

In [None]:
interact(soilpropdefine,prop=createtargetbutton()); ##### Run this to interact with the plot above! #####

The following figure adds histograms to see 
the distributions of your selction according 
to its other associated target values.  It plots 
the average of the absolute spectra of the
highest and lowest ranked 100 in that property.

These take a while to render between 
button pushes, since everything is 
plotted all over again.

In [None]:
prop=onames[0]
@interact(prop=createtargetbutton())
def plotdata(prop):
    f2 = plt.figure()
    f2.set_size_inches(14.0,4.5)
    
    #mini distribution plots, high slice
    ax1 = f2.add_subplot(2,9,4)
    axes = [ax1] + [f2.add_subplot(2,9,i+4) for i in range(1, len(onames))]
    plothists(axes[:len(onames)],train,color='blue',label='all',alpha=0.3)
    for i,name in enumerate(onames):
        axes[i].hist(train.sort_values(prop).tail(100)[name]+plotdf.loc[name,'offset'],
                bins=plotdf.loc[name,plotdf.loc[name,'type']],
                alpha=1.0,color='blue',label='selected')
        if i==len(onames)-1:
            axes[i].legend(loc=1)
            
    #mini distribution plots, low slice
    ax2 = f2.add_subplot(2,9,13)
    axes2 = [ax2] + [f2.add_subplot(2,9,i+13) for i in range(1, len(onames))]
    plothists(axes2[:len(onames)],train,color='red',label='all',alpha=0.3)
    for i,name in enumerate(onames):
        axes2[i].hist(train.sort_values(prop).head(100)[name]+plotdf.loc[name,'offset'],
                bins=plotdf.loc[name,plotdf.loc[name,'type']],
                alpha=1.0,color='red',label='selected')
        if i==len(onames)-1:
            axes2[i].legend(loc=1)
        axes2[i].set_xlabel(name)

    
    #plot of spectra
    ax = plt.subplot2grid((2, 9), (0, 0), colspan=3,rowspan=2)
    plotspecabsolute(ax)
    ax.plot(wavenumbers,
            train.sort_values(prop).tail(100).mean().loc[spectral_columns].as_matrix(),
            color='blue',label='highest 100 mean')
    ax.plot(wavenumbers,
            train.sort_values(prop).head(100).mean().loc[spectral_columns].as_matrix(),
            color='red',label='lowest 100 mean')
    plt.legend()
    plt.title("Averaged soil spectrum, extremes for %s"%(prop))
    f2.tight_layout()


I like displaying the "all" histograms to 
compare the sample selection against, but 
didn't want to be replotting them each time. 
There doesn't seem to be a way to 
update an Axes.hist dynamically.  Instead, 
the code below clears ax with ax.cla() 
and replaces with a new histogram.  However, 
we don't want both the "all" and "selected"
histogram on the same axes object, because the
call to ax.cla() would clear both

To avoid replotting the "all" histogram
each time, the solution below forces
two Axes objects right on top of each other -
one is the "all" histogram, the other the 
selection.  Only the selection is cleared and 
updated. Make sure that axes limits are the same for both.

## Range selection: choose the desired slice
We've also added a range selection slider below.  Instead of being 
limited to looking at just the head(100) or tail(100) 
of the sorted samples, now select the slice, between min 
and max, you're interested in looking at.

In [None]:
prop=onames[0]
top=99.99
bot=90.0
topind = int(np.ceil(top*nrows/100.00))
botind = int(np.floor(bot*nrows/100.00))

fig = plt.figure(figsize=(12,7))

#set up  placement
ax1 = fig.add_subplot(3, 5, 1)
axes = [ax1] + [fig.add_subplot(3, 5, i) for i in range(2, len(onames)+1)]
axes.append(plt.subplot2grid((5,10),(2,0),rowspan=3,colspan=5))
axes.append(plt.subplot2grid((5,10),(2,5),rowspan=3,colspan=5))

#histograms for comparison, includes all samples
plothists(axes[:len(onames)],train,color='blue',label='all',alpha=0.3)
for i,name in enumerate(onames):
    axes[i].set_xlabel(name)


#absolute spectrum, slice and average
axes[len(onames)].plot(wavenumbers,
                       mean[spectral_columns],color='black',linestyle='dotted',label='total mean')
specabs,=axes[len(onames)].plot(wavenumbers,
    train.sort_values(prop).iloc[slice(botind,topind)].mean().loc[spectral_columns].as_matrix(),
    color='blue',label='slice average')
axes[len(onames)].set_xlabel('wavenumber')
axes[len(onames)].set_ylabel('spectral height')
axes[len(onames)].set_ylim(0,2.25)

# spectrum, mean centered and standardized
plotspecstandard(axes[len(onames)+1])
specstand,=axes[len(onames)+1].plot(wavenumbers,
    train_scaled.sort_values(prop).iloc[slice(botind,topind)].mean().loc[spectral_columns],
    color='blue',label="slice average")

#overplot histogram distributions of selected slice
newax1 = fig.add_axes(axes[0].get_position(), frameon=False)
newaxes = [newax1] + [fig.add_axes(axes[i].get_position(), frameon=False) 
                      for i in range(1, len(onames))]
for i,name in enumerate(onames):
    newaxes[i].set_ylim(axes[i].get_ylim())
    newaxes[i].set_xlim(axes[i].get_xlim())
plothists(newaxes[:len(onames)],train.sort_values(prop).iloc[slice(botind,topind)],
          color='blue',label='selected',alpha=1.0)

def changeplot1(val,prop):
    topind = int(np.ceil(val[1]*nrows/100.00))
    botind = int(np.floor(val[0]*nrows/100.00))

    for i,name in enumerate(onames):
        newaxes[i].cla()
        newaxes[i].set_ylim(axes[i].get_ylim())
        newaxes[i].set_xlim(axes[i].get_xlim())
        
    plothists(newaxes[:len(onames)],train.sort_values(prop).iloc[slice(botind,topind)],
          color='blue',label='selected',alpha=1.0)
    specabs.set_data(wavenumbers,
        train.sort_values(prop).iloc[slice(botind,topind)].mean().loc[spectral_columns])
    specstand.set_data(wavenumbers,
        train_scaled.sort_values(prop).iloc[slice(botind,topind)].mean().loc[spectral_columns])

In [None]:
interact(changeplot1,val=createminmaxslider(),prop=createtargetbutton());

## Slider: increase or decrease a variable
The interactive plot below 
is pretty much the same as that above, 
but with only one slider that selects a 
slice with 10% of the data.  With one slider 
it's easier to see in real time the evolution of the 
spectrum as one of the output values increases or decreases. 

In [None]:

prop=onames[0]
#The size of the slice will be fixed to 10% of the data
slicesize_percent = 10.
slicesize = int(nrows*slicesize_percent/100.)
botind_min = 0
botind_max = nrows-slicesize-1
botind = botind_max

fig = plt.figure(figsize=(12,7))

#set up  placement
ax1 = fig.add_subplot(3, 5, 1)
axes = [ax1] + [fig.add_subplot(3, 5, i) for i in range(2, len(onames)+1)]
axes.append(plt.subplot2grid((5,10),(2,0),rowspan=3,colspan=5))
axes.append(plt.subplot2grid((5,10),(2,5),rowspan=3,colspan=5))

#histograms for comparison, includes all samples
plothists(axes[:len(onames)],train,color='blue',label='all',alpha=0.3)
for i,name in enumerate(onames):
    axes[i].set_xlabel(name)

#absolute spectrum, slice and average
plotspecabsolute(axes[len(onames)])
specabs,=axes[len(onames)].plot(wavenumbers,
    train.sort_values(prop).iloc[slice(botind,botind+slicesize)].mean().loc[spectral_columns],
    color='blue',label='slice average')

# spectrum, mean centered and standardized
plotspecstandard(axes[len(onames)+1])
specstand,=axes[len(onames)+1].plot(wavenumbers,
    train_scaled.sort_values(prop).iloc[slice(botind,botind+slicesize)].mean().loc[spectral_columns],
      color='blue',label="slice average")

#overplot histogram distributions of selected slice
newax1 = fig.add_axes(axes[0].get_position(), frameon=False)
newaxes = [newax1] + [fig.add_axes(axes[i].get_position(), frameon=False) 
                      for i in range(1, len(onames))]
for i,name in enumerate(onames):
    newaxes[i].set_ylim(axes[i].get_ylim())
    newaxes[i].set_xlim(axes[i].get_xlim())
plothists(newaxes[:len(onames)],train.sort_values(prop).iloc[slice(botind,botind+slicesize)],
          color='blue',label='selected',alpha=1.0)

fig.suptitle('Soil spectrum and histograms for {:4.1f} to {:4.1f} percentile slice in {:s}'
             .format(botind/nrows,botind/nrows+slicesize_percent,prop))
    
def changeplot2(val,prop):
    botind = int(np.floor(val*nrows/100.))

    for i,name in enumerate(onames):
        newaxes[i].cla()
        newaxes[i].set_ylim(axes[i].get_ylim())
        newaxes[i].set_xlim(axes[i].get_xlim())
        plothists(newaxes[:len(onames)],train.sort_values(prop).iloc[slice(botind,botind+slicesize)],
          color='blue',label='selected',alpha=1.0)

    specabs.set_data(wavenumbers,
        train.sort_values(prop).iloc[slice(botind,botind+slicesize)].mean().loc[spectral_columns])
    specstand.set_data(wavenumbers,
        train_scaled.sort_values(prop).iloc[slice(botind,botind+slicesize)].mean().loc[spectral_columns])
    fig.suptitle('Soil spectrum and histograms for {:4.1f} to {:4.1f} percentile slice in {:s}'
             .format(val,val+slicesize_percent,prop))

In [None]:
interact(changeplot2,val=createsliceslider(),prop=createtargetbutton());

With all the data to be looked
up and replotted, things are pretty 
slow.  Let's use the same widgets 
but only look at the standardized 
spectral height below.


## Groupby:  see differences between groups
Another set of toggle buttons is added
below, allowing the user to group by Depth.  The
soil dataset doesn't have any other categorical
variables, which is too bad, because the pandas
`groupby` method is fun and powerful.  If interested,
try creating your own groups from the data to play
around with.

In [None]:
prop=onames[0]
#The size of the slice will be fixed to 10% of the data, tweak here to change
slicesize_percent = 10.
slicesize = int(nrows*slicesize_percent/100.)
botind_min = 0
botind_max = nrows-slicesize-1
botind = botind_max

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
# spectrum, mean centered and standardized
ax.set_xlabel('wavenumber')
ax.set_ylabel('spectral height, standardized')
ax.set_ylim(-2.5, 2.5)
specstand,=ax.plot(wavenumbers,
    train_scaled.sort_values(prop).iloc[slice(botind,botind+slicesize)].mean().loc[spectral_columns]
                   ,label="All/Topsoil")
specstand_sub,=ax.plot([],[],label='Subsoil')
ax.legend()
fig.suptitle('Standardized soil spectrum for {:4.1f} to {:4.1f} percentile slice in {:s}'
             .format(botind/nrows,botind/nrows+slicesize_percent,prop))

def changeplot3(val,prop,depth):
    botind = int(np.floor(val*nrows/100.))
    dat=train_scaled.sort_values(prop).iloc[slice(botind,botind+slicesize)]
    if depth=='All':
        xtop=wavenumbers,
        ytop=dat.mean().loc[spectral_columns]
        xsub=[]
        ysub=[]
    else:
        xtop=wavenumbers,
        ytop=dat.groupby('Depth').get_group('Subsoil').mean().loc[spectral_columns]
        xsub=wavenumbers
        ysub=dat.groupby('Depth').get_group('Topsoil').mean().loc[spectral_columns]

    specstand.set_data(xtop,ytop)
    specstand_sub.set_data(xsub,ysub)
    fig.suptitle('Soil spectrum and histograms for {:4.1f} to {:4.1f} percentile slice in {:s}'
             .format(val,val+slicesize_percent,prop))



In [None]:
interact(changeplot3,val=createsliceslider(),depth=createdepthbutton(),prop=createtargetbutton());