# Graphically Exploring your data 
In this notebook we will look at ways to graphically depict your data and explore various relationships that might exist. Before running this notebook please save out the plot_data.csv and plot_subplot.csv files from the Summarizing_plot_data.ipynb notebooks. 

## Import packages

In [None]:
import pandas as pd

### Load the plot_subplot.csv files into a dataframe, separate response and predictor variables, and describe the data.

In [None]:
plot_sub_df= pd.read_csv('plot_subplot_data.csv')
plot_sub_df.columns


In [None]:
ids=['plotid', 'sampleid', ]

resp=['Use', 'Cover', 'Vegetation',
       'Herbaceous', 'Grass', 'Cultivation', 'WetLand', 'Terrain', 'Water',
       'Another Class', 'SAF',]

pred=['BLUE',
       'GREEN', 'NIR', 'RED', 'SWIR1', 'SWIR2', 'altura2', 'aspect',
       'aspectcos', 'aspectdeg', 'aspectYesn', 'brightness', 'clay_1mMed',
       'diff', 'elevation', 'evi', 'fpar', 'hand30_100', 'lai', 'mTPI', 'ndvi',
       'ocs_1mMed', 'sand_1mMed', 'savi', 'Yeslt_1mMed', 'slope', 'topDiv',
       'wetness']

In [None]:
plot_sub_df[resp].describe()

In [None]:
plot_sub_df[pred].describe()

### Note, that with the plot subplot dataset we have over 100K observations. To explore general trends in the data we don't necessarily need to use all the data. To reduce the amount of processing and speed up analyses we can take a random sample of the data. Let's random choose 1% of the observations to look for trends.

There are multiple ways to randomly choose observations from a dataframe. Here we are going to use numpy and the choice function to randomly select row indices and further use those selected index values to select records from the plots_subplots dataframe.
- Sample the plot_subplot data n=1% of the data

In [None]:
import numpy as np
N=plot_sub_df.shape[0]
n=int(N*0.01)
ind=np.random.choice(N,n,replace=False) #randomly choosing 1% of the index values without replacement
sub_df=plot_sub_df.iloc[ind]#using panda's iloc (index slicing) function

## Scatter plot matrix, correlation matrix, and box plots of predictor variables for a random subset of the data.
- Make a scatter plot matrix

In [None]:
pd.plotting.scatter_matrix(sub_df[pred], alpha=0.2, figsize=(15, 15),)

- create the correlation matrix

In [None]:
sub_df[pred].corr()

- boxplots; due to the number of box plots, we will use matplotlib directly to create subplots

In [None]:
import matplotlib.pyplot as plt
fig,ax=plt.subplots(5,6,figsize=(15,15)) #make place holder for 30 box plots. Remember, we have 28 columns.
cnt=0
for nm in pred:
    r=cnt%5
    c=cnt//5
    sub_df[[nm]].boxplot(ax=ax[r,c])
    cnt+=1

ax

## Now let's look at the response variables 
### These are categorical variables so we have fewer ways to graphically display aspects of the data. In this instance we will create a grid of pie charts depicting the proportion of each category within a given response variable.

- pie charts for 11 response variables

In [None]:
fig,ax=plt.subplots(4,3,figsize=(15,15)) #make place holder for 12 pie charts. Remember, we have 11 response columns.
cnt=0
for nm in resp:
    r=cnt%4
    c=cnt//4
    sub_df[[nm]].value_counts().plot(ax=ax[r,c],kind='pie',title=nm,autopct='%1.1f%%')
    cnt+=1

ax

## Exercise 1: Interpretation
- How many unique categories are in the Use column?
- What is the average lai value?
- What does the scatter plot matrix tell us?
- What does the correlation matrix tell us?
- What insights can we glen from the boxplots?
- What insights can we glen form the pie charts?
- Why did we ue a random sample of the plot subplot data?
- Task; explore general trends in the summarized plot data.


## From the scatter plot matrix and correlation matrix note the amount of linear correlation among the predictor variables. Let's perform a PCA using all the data and transform our data into independent components.

We wil be using sklearn to [scale](https://scikit-learn.org/stable/modules/preprocessing.html) and perform a [principal component analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA) (PCA). If you want to learn more about PCAs check out sklearn's [user guide](https://scikit-learn.org/stable/modules/decomposition.html) and [examples](https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py). Let's create a PCA and look at some of the results.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

ss=StandardScaler(with_mean=False) #scaling data without centering on the mean because PCA will center the data.
X=plot_sub_df[pred].values
ss.fit(X)
X2=ss.transform(X) # scaling our values so they are comparable

pca = PCA()
pca.fit(X2) # fit the PCA on scaled values
vexp=pd.DataFrame(pca.explained_variance_ratio_,columns=['var']) # get the proportion of variance explained by each component

#find the number of components needed to account for 95% of the variation in the data
cmp=0
s=0
for v in vexp['var']:
    s+=v
    cmp+=1
    if(s>0.95):break
    
#plot % covariance explained
print('95% of the covariation can be explained using the first',cmp,'components')
vexp.plot(kind='barh',figsize=(15,8),title='Percent variation by each component').invert_yaxis()


### Displaying the % of variance explained by each component in this manner highlights that there was substantial linear correlation among predictor variables and gives us a graphical way to select how many components we want to keep in future analyses. In this case, we are looking to keep the minimum number of component that account for 95% of the variation(information) in the data. In the horizontal bar chart it appears that the accumulative amount of variation explained by adding the next component, levels off around the 14th component. Setting a threshold of 95% of the variation indicates we need to keep 13 components to account for 95% of the variation, which agrees with our visual interpretation of the horizontal bar chart. So let's transform our Google Earth Engine data into 13 independent components.

### Transform the predictor values to 13 components and plot a scatter plot matrix with Use labels.

In [None]:
cmp_df=pd.DataFrame(pca.transform(X2)[:,0:13]) #just keep the first 13 components and turn them into a dataframe
cmp_df['Use']=plot_sub_df['Use'] # add the use response variable to the dataframe
display(cmp_df)
uvls=cmp_df['Use'].value_counts()/cmp_df.shape[0]
print('Proportion of Use categories')
uvls

#### Do the proportion match the pie chart? Why or why not?
- scatter plot using components Use labels

In [None]:
#Create a color dictionary for each category
color_dic={uvls.index[0]:'green',uvls.index[1]:'tan',uvls.index[2]:'yellow',uvls.index[3]:'purple',uvls.index[4]:'blue',uvls.index[5]:'brown',uvls.index[6]:'grey',uvls.index[7]:'grey'}
cmp_df['color']=cmp_df['Use'].map(color_dic)
tdf=cmp_df.iloc[ind]

#pd.plotting.scatter_matrix(tdf.iloc[:,0:-2])
pd.plotting.scatter_matrix(tdf[np.arange(13)],c=tdf['color'],figsize=(15,15))

### Plot the fist and second components

In [None]:
cmp_df.plot(kind='scatter',x=0,y=1,c='color',figsize=(15,15),xlabel='Comp1',ylabel='Comp2')

## Exercise 2: Interpretation 2
- What proportion of the variance is explained by the second principal component?
- Why did we scale our data before performing PCA?
- What do the colors mean in the scatter plot matrix and the scatter plot of Comp1 and Comp2?
- What does component 1 and 2 mean?
- Does it look like various categories group together?
- Do you see any extreme values?
- Task: apply some of these same visualization approaches with the summarized plot data. Remember that the response variables are no longer categorical. We converted them to continuous variables (%).


## Extreme points and potential outliers
### To identify potential extreme values we will be using [Isolation_Forest](https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py). Isolation Forest identifies extreme values by, "randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature". Extreme observations are labeled with a value of -1 while inner values are labeled with a 1. Additionally, anomaly distance can be calculated to rank extreme values. 

In [None]:
from sklearn.ensemble import IsolationForest

#set the X values
X=cmp_df[cmp_df.columns[:13]]

isf = IsolationForest(max_samples=0.75, random_state=0)
isf.fit(X)

lbl=isf.predict(X) #1 is inner -1 is outer value
dist=isf.decision_function(X)
color_dic={1:'green',-1:'red'}
cmp_df['extreme']=lbl
cmp_df['an_dist']=dist
color=cmp_df['extreme'].map(color_dic)
cmp_df.plot(kind='scatter',x=0,y=1,c=color,figsize=(15,8))

## Exercise 3: Interpretation 3
- How many observation were labeled extreme?
- Interpret the meaning of the 0 by 1 scatter plot? What do the green and red points mean?
- What threshold was used to identify extreme values? Can you change that threshold?
- Are extreme values outliers?
- Should extreme value be removed from the analysis?
- How can you check if extreme values are outliers?
- If we decide to remove extreme values, how will that impact the summarized plot data?
- Task1: Create a graph that highlights anomaly distance.
- Task2: identify extreme values for the summarized plot data.