# <center> RANDOM FORESTS <br/>
## <center> CSCAR WORKSHOP - Data Science Skills Series <br/><br/> 10/20/2017
### <center> Marcio Mourao


# <center> Setup for Anaconda / Jupyter Notebook

<ul>
    <li>Go to the page https://marcio-mourao.github.io/</li>
    <li>Download the materials under "Supervised Machine Learning in Python using Scikit-Learn (Random Forests)" to your "username/Documents"</li><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3 (64-bit)"</li>
    <li>Click "Anaconda Prompt" </li>
    <li>Enter "conda update scikit-learn"</li><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3 (64-bit)"</li>
    <li>Click "Jupyter Notebook" </li><br/>
    
    <li>Click "Workshop.ipynb" (this should open a new tab in the browser)</li>
</ul>

# <center> Introduction

<ul>
  <li>Please, sign up the sheet! </li>
  <li>Don't forget to go to: http://cscar.research.umich.edu/ to know what we're offering</li>
  <li>Any questions/comments, feel free to email me: mdam@umich.edu</li>
</ul>

# <center> References

<ul>
  <li>https://www.continuum.io/anaconda-overview</li>
  <li>http://www.numpy.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html</li>
  <li>http://matplotlib.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/10min.html</li>
  <li>http://scikit-learn.org/stable/</li>
  <li>http://statsmodels.sourceforge.net/</li>
</ul>

# <center> Summary of this workshop

<ul>
  <li>Using Random Forests on the Iris dataset (first part) and on the 1994 census dataset (second part)</li>
  <ul>
     <li>Brief description of the dataset</li>
     <li>Load and describe the data (using Pandas dataframes)</li>
     <li>Machine Learning - Random Forests</li>
  </ul>
</ul>


## Some info about the dataset

This data sets consists of 3 different types of irises: Setosa, Versicolour, and Virginica; 

The data is stored in a 150x4 numpy.ndarray;

Rows correspond to samples and the columns to: Sepal Length, Sepal Width, Petal Length and Petal Width.

For additional information: https://en.wikipedia.org/wiki/Iris_flower_data_set

In [None]:
#Check Python version
import sys
print(sys.version)

## Import relevant general modules

In [None]:
#Load some relevant modules
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load and describe the data

In [None]:
#Load the iris dataset
from sklearn.datasets import load_iris

In [None]:
# Create an object called iris with the iris data
iris = load_iris()
iris

In [None]:
#Create a dataframe with the four feature variables
flowers = pd.DataFrame(iris.data, columns=iris.feature_names)
flowers.head()

In [None]:
#Add a new column with the species names, this is what we are going to try to predict
flowers['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
flowers.head()

In [None]:
#Obtain number of observations and number of features in the data
flowers.shape

## Machine Learning

In [None]:
#Import modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn import metrics

In [None]:
#Define covariates in X and dependent variable in y
X = flowers.iloc[:,0:4]
y = flowers.species

In [None]:
#Obtain the data for the fitting
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

print('Total number of records: ', flowers.shape[0])
print('Type of X_train: ', type(X_train))
print('Number of records in X_train: ', len(X_train))
print('Fraction on X_train: ', len(X_train)/flowers.shape[0])
print('Number of records in y_train: ', len(y_train))
print('Type of y_train: \n\n', type(y_train))

print('Type of X_test: ', type(X_test))
print('Number of records in X_test: ', len(X_test))
print('Fraction on X_test: ', len(X_test)/flowers.shape[0])
print('Number of records in y_test: ', len(y_test))
print('Type of y_test: ', type(y_test))

In [None]:
#Creates a RF classification model
RF_model = RandomForestClassifier(n_estimators=10, criterion='gini')

#Fit to the data
RF_model.fit(X_train, y_train)

In [None]:
#Obtain class predictions
y_pred_RF_prob = RF_model.predict_proba(X_test)
print('Predicted probabilities: \n', y_pred_RF_prob)

#Obtain probability predictions
y_pred_RF_class = RF_model.predict(X_test)
print('Predicted classes: \n', y_pred_RF_class)

In [None]:
#Obtains accuracy score
print('RF Score: ', metrics.accuracy_score(y_test, y_pred_RF_class))

In [None]:
#Obtains confusion matrix
RF_cm=metrics.confusion_matrix(y_test,y_pred_RF_class)
RF_cm

In [None]:
#Capture feature importance from the RF model
feature_imp=RF_model.feature_importances_

#Create plot of feature importance
positions = np.arange(len(feature_imp))
plt.barh(positions, feature_imp, align='center')
plt.xlabel("Feature Importances")
plt.ylabel("Features")
plt.yticks(positions, ('Sepal Length','Sepal Width','Petal Length','Petal Width'))
plt.grid(True)

In [None]:
#KFolds and Cross_val_scores
kf = KFold(n_splits=10, shuffle=True)
print('Cross validation score: ', cross_val_score(RF_model, X, y, cv=kf).mean())