# <center> Supervised Machine Learning <br/> Random Forests and Support Vector Machines (SVMs) <br/><br/> CSCAR WORKSHOP - Data Science Skills Series <br/><br/> 04/19/2017
## <center> Marcio Mourao


# <center> Setup for Anaconda / Jupyter Notebook

<ul>
    <li>Go to the page https://marcio-mourao.github.io/</li>
    <li>Download the materials under "Supervised Machine Learning in Python using Scikit-Learn (Random Forests and SVMs)" to your "username/Documents"</li><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3 (64-bit)"</li>
    <li>Click "Anaconda Prompt" </li>
    <li>Enter "conda update scikit-learn"</li><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3 (64-bit)"</li>
    <li>Click "Jupyter Notebook" </li><br/>
    
    <li>Click "Workshop.ipynb" (this should open a new tab in the browser)</li>
</ul>

# <center> Introduction

<ul>
  <li>Please, sign up the sheet! </li>
  <li>Don't forget to go to: http://cscar.research.umich.edu/ to know what we're offering!</li>
</ul>

# <center> Summary of this workshop

<ul>
  <li>Random Forests (use the 1994 census dataset)</li>
  <ul>
     <li>Brief description of the dataset</li>
     <li>Load and describe the data (using Pandas dataframes)</li>
     <li>Machine Learning</li>
  </ul><br>
  <li>SVM (use handwritten digits dataset) </li>
  <ul>
     <li>Brief description of the dataset</li>
     <li>Load and describe the data (using numpy)</li>
     <li>Machine Learning</li>
  </ul>
</ul>


# <center> References

<ul>
  <li>https://www.continuum.io/anaconda-overview</li>
  <li>http://www.numpy.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html</li>
  <li>http://matplotlib.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/10min.html</li>
  <li>http://scikit-learn.org/stable/</li>
  <li>http://statsmodels.sourceforge.net/</li>
</ul>

In [None]:
#Check Python version
import sys
print(sys.version)

## Import relevant general modules

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# <center> Random Forests

## Some info about the dataset

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). 

<b>The prediction task is to determine whether a person makes over $50K a year!</b>

<b>Attributes:</b>

income: >50K, <=50K

age: continuous

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked

fnlwgt: continuous

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool

education-num: continuous

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black

sex: Female, Male

capital-gain: continuous

capital-loss: continuous

hours-per-week: continuous

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands

## Load and describe the data

In [None]:
#Creates a dataframe named "adults" from reading the file "adult.csv"
adults = pd.read_csv('adult.csv',na_values=['?'])
adults.head()

In [None]:
#Displays number of lines and number of columns of the dataframe
adults.shape

In [None]:
#Displays the data types associated with each dataframe column
adults.dtypes

In [None]:
#Describes everything in the dataframe
adults.describe(include='all')

In [None]:
#Displays whether columns contain any null values
adults.isnull().any(axis=0)

In [None]:
#Count the number of missing values in each column of the dataframe
adults.apply(lambda x: sum(x.isnull()),axis=0)

In [None]:
#Count the number of missing values in each column of the dataframe and sums them up
adults.apply(lambda x: sum(x.isnull()),axis=0).sum()

In [None]:
#Count number of lines with NaNs
adults.apply(lambda x: x.isnull().any(),axis=1).sum()

In [None]:
#Fraction of observations with NaNs (potentially for removal)
2399/adults.shape[0]

In [None]:
#Removes any lines from the dataframe that contains NaNs 
#(be careful about what you decide to do with missing values)
adults=adults.dropna(axis=0,how='any')
adults.head()

In [None]:
#Displays number of lines and number of columns of the dataframe
adults.shape

In [None]:
#Displays the first rows of the dataframe
adults.head(10)

## Machine Learning

In [None]:
#Import modules
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn import metrics

In [None]:
#Creates a label encoder object
le=LabelEncoder()

#Creates a copy of the dataframe
adults2=adults.copy()

#Converts category 
for col in adults.select_dtypes(include=['object']).columns.values:
    col_slice = adults2[col]
    adults2[col + '_enc']=le.fit(col_slice.values).transform(col_slice.values)

adults2.head()

In [None]:
#Check new data types
adults2.dtypes

In [None]:
#Define covariates in X and dependent variable in y
X = adults2[['age','workclass_enc','education.num','marital.status_enc','occupation_enc',
            'race_enc','sex_enc','relationship_enc','capital.gain','capital.loss',
            'hours.per.week','native.country_enc']]
y = adults2.income_enc

In [None]:
#Obtain the data for the fitting
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

print('Total number of records: ', adults2.shape[0])
print('Type of X_train: ', type(X_train))
print('Number of records in X_train: ', len(X_train))
print('Fraction on X_train: ', len(X_train)/adults2.shape[0])
print('Number of records in y_train: ', len(y_train))
print('Type of y_train: \n\n', type(y_train))

print('Type of X_test: ', type(X_test))
print('Number of records in X_test: ', len(X_test))
print('Fraction on X_test: ', len(X_test)/adults2.shape[0])
print('Number of records in y_test: ', len(y_test))
print('Type of y_test: ', type(y_test))

In [None]:
#Creates a RF classification model
RF_model = RandomForestClassifier(n_estimators=10, criterion='gini')

#Fit to the data
RF_model.fit(X_train, y_train)

In [None]:
#Obtain class predictions
y_pred_RF_prob = RF_model.predict_proba(X_test)
print('Predicted probabilities: \n', y_pred_RF_prob)

#Obtain probability predictions
y_pred_RF_class = RF_model.predict(X_test)
print('Predicted classes: \n', y_pred_RF_class)

In [None]:
#Obtains accuracy score
print('RF Score: ', metrics.accuracy_score(y_test, y_pred_RF_class))

In [None]:
#Obtains confusion matrix
RF_cm=metrics.confusion_matrix(y_test,y_pred_RF_class)
RF_cm

In [None]:
#Capture feature importance from the RF model
feature_imp=RF_model.feature_importances_

#Create plot of feature importance
positions = np.arange(12)
plt.barh(positions, feature_imp, align='center')
plt.xlabel("Feature Importances")
plt.ylabel("Features")
plt.yticks(positions, ('Age','Working Class', 'Years Education', 'Marital Status', 'Occupation',
                       'Race', 'Sex', 'Relationship Status', 'Capital Gain', 'Capital Loss',
                       'Hours per Week','Native Country'))
plt.grid(True)

In [None]:
#KFolds and Cross_val_scores
kf = KFold(n_splits=10, shuffle=True)
print('Cross validation score: ', cross_val_score(RF_model, X, y, cv=kf).mean())

# <center> Support Vector Machines

## Some info about the dataset

The handwritten digits dataset is made up of 1797 8x8 images 

Each image, like the one shown below, is of a hand-written digit

<b>The goal is to recognize handwritten digits</b>

In [None]:
#Import modules
from sklearn import datasets
from sklearn import svm

## Load and describe the data

In [None]:
#Load the digits dataset
digits = datasets.load_digits()
digits

In [None]:
print(digits.data)
print(digits.data.shape)
print(digits.images)
print(digits.images.shape)
print(digits.target)
print(digits.target.shape)

In [None]:
#As an example, displays a digit
print(digits.data[-2])
plt.imshow(digits.images[-2], cmap=plt.cm.gray_r)
plt.show()

## Machine Learning

In [None]:
#Create predictors and target sets
X, y = digits.data, digits.target

#Obtain the data for the fitting
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.85, random_state=13)

print('Total number of records: ', digits.data.shape[0])
print('Type of X_train: ', type(X_train))
print('Number of records in X_train: ', len(X_train))
print('Fraction on X_train: ', len(X_train)/digits.data.shape[0])
print('Number of records in y_train: ', len(y_train))
print('Type of y_train: \n\n', type(y_train))

print('Type of X_test: ', type(X_test))
print('Number of records in X_test: ', len(X_test))
print('Fraction on X_test: ', len(X_test)/digits.data.shape[0])
print('Number of records in y_test: ', len(y_test))
print('Type of y_test: ', type(y_test))

In [None]:
#Creates the object
SVM_model = svm.SVC(gamma=0.001, C=100, probability= True)

#Fit to the data
SVM_model.fit(X_train,y_train)

In [None]:
#Obtain probability predictions
y_pred_SVM_prob = SVM_model.predict_proba(X_test)
print('Predicted probabilities: \n', y_pred_SVM_prob)

#Obtain class predictions
y_pred_SVM_class = SVM_model.predict(X_test)
print('Predicted classes: \n', y_pred_SVM_class)

In [None]:
#Obtains accuracy score
print('SVM Score: ', metrics.accuracy_score(y_test, y_pred_SVM_class))

In [None]:
#Obtains confusion matrix
SVM_cm=metrics.confusion_matrix(y_test,y_pred_SVM_class)
SVM_cm

In [None]:
#Obtain optimal SVM for both parameters C and Gamma
from sklearn.model_selection import GridSearchCV

C_range = np.logspace(-2, 2, 2)
gamma_range = np.logspace(-3, 3, 2)
param_grid = dict(C=C_range,gamma=gamma_range)
cv = KFold(n_splits=5, random_state=42)

grid = GridSearchCV(svm.SVC(), param_grid=param_grid, cv=cv)
grid.fit(X, y)

print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))