# Random Forest

This notebook provides an example of how to use SML to read in a dataset, split the data into training and testing data, replace troublesome values such as 'NaNs' from the dataset, perform classifcation on the dataset, and Lastly, generates lattice plots, other visual metrics. For this use-case we use **'publicly'?** availiable dataset [Titanic Data Set]https://www.kaggle.com/c/titanic/data) and use Random Forest  to classify the MPG.

## SML Query
### Imports
We import the nescessary library to use SML.

In [1]:
from sml import execute

### Query
Next we create a query statement to `READ` in the data and the file is delimited by a fixed width, the header is not used, next we `REPLACE` any values of 'NaN' with the mode of the column, `SPLIT` the dataset and use 80% of it for training and 20% of it for testing, and lastly, we perform random forest classification on the 2nd column, using columns 1, 3-8 as the predictiors.

In [4]:
query = 'READ "../data/titanic.csv" (separator = ",", header = 0) AND\
REPLACE ("NaN", "mode") AND SPLIT (train = .8, test = 0.2) AND\
CLASSIFY (predictors = [1,3,4,5,6,7,8,9,10,11,12], label = 2, algorithm = forest)'

execute(query)

## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [7]:
# Libraries required for ML
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import cross_validation, metrics
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np



# Encoders/Imputers 
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn.base import BaseEstimator, TransformerMixin

### Read

Read in the titanic dataset into a pandas dataframe.

In [8]:
f = pd.read_csv("../data/titanic.csv", sep = ",", header = 0)
f.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### REPLACE

Next we drop convert all of NaNs with the mode of that particular column.

In [9]:
"""
impute_missing()
Parameters:
  data: Dataframe to impute missing values on
  columns: Columns to impute. If None, impute all columns
  impute_strategy: What to replace missing values with
     Options: 
      Imputer Class
      'most frequent'
      'median'
      'mean'
      Custom Functions
      'remove'
      'dummy'
      'rand_forest_reg'
Returns: Imputed dataframe
"""
def impute_missing(data, columns=None, impute_strategy='mode', missing_values='NaN'):
  datacopy = data
  dummy_val = 'U0'
  cols_to_impute = list()
  if columns == None:
    cols_to_impute = _find_cols_with_missing_vals(data, missing_values)
  else:
    cols_to_impute = columns
  if impute_strategy == 'mode':
    for col in cols_to_impute:
      modeVal = data[col].mode()
      datacopy[col] = data[col].fillna(modeVal[0])
    return datacopy
  elif impute_strategy == 'mean':
    for col in cols_to_impute:
      meanVal = data[col].mean()
      datacopy[col] = data[col].fillna(meanVal)
    return datacopy
  elif impute_strategy == 'median':
    for col in cols_to_impute:
      medianVal = data[col].median()
      datacopy[col] = data[col].fillna(medianVal)
    return datacopy
  elif impute_strategy == 'drop column':
    return _remove_columns(data, cols_to_impute)
  elif impute_strategy == 'maximum':
    for col in cols_to_impute:
      maxVal = max(data[col])
      datacopy[col] = data[col].fillna(maxVal)
    return data
  elif impute_strategy == 'minimum':
    for col in cols_to_impute:
      minVal = min(data[col])
      datacopy[col] = data[col].fillna(minVal)
    return data
  elif impute_strategy == 'dummy':
    return data.replace(missing_values, dummy_val) 
  # Do some more research on this before implementing
  elif impute_strategy == 'rand_forest_reg':
    print("RANDOM FOREST REGRESSOR NOT IMPLEMENTED NO IMPUTATION HAPPENED")
    return None
  else:
    print ("REPLACE COMMAND NOT RECOGNIZED")
    return None

"""
remove_columns()
Parameters:
  data: dataframe to remove columns
  delete_list: list of names of columns to delete
Returns:
  Dataframe with deleted columns 
"""
def _remove_columns(data, delete_list):
  for col in delete_list:
      del data[col]
  return data

def _find_cols_with_missing_vals(data= None, missing_values= 'NaN'):
  cols_to_impute = list()
  if missing_values == 'NaN':
    for col in f.columns:
      if(data[col].isnull().values.any()):
        cols_to_impute.append(col)
  else:
    for col in f.columns:
      if missing_values in data[col]:
        cols_to_impute.append(col)
  return cols_to_impute


f_imputed = impute_missing(f)
f_imputed.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S


### Preprocessing
Here we have to encode columns and make them catergorical if they are not numerical.

In [10]:
def encode_categorical(df):
    categorical = list()
    for col in df.columns:
        if df[col].dtype == 'object':
            categorical.append(col)
    for feature in categorical:
        l = list(df[feature])
        s = set(l)
        l2 = list(s)
        numbers = list()
        for i in range(0,len(l2)):
            numbers.append(i)
        df[feature] = df[feature].replace(l2, numbers)
    return df

f_encoded = encode_categorical(f_imputed)


### SPLIT
Here we seperate the Labels and Features. (The Dataset is split in the cross_validation sklearn function.)

In [13]:
#Seperate features/labels
labels = f_encoded['Survived']
features = f_encoded.drop('Survived',1)

### CLASSIFY

We fit our Decision Model model with our training dataset and make predictions on our testing dataset and display the accuracy (This is all occurs within the cross_validation function). 

In [15]:
rand_forest = RandomForestClassifier(n_estimators=100)
rand_forest_scores = cross_validation.cross_val_score(rand_forest, 
                                                      features,
                                                      labels,
                                                      cv=10,
                                                      scoring='accuracy')

rand_forest_scores.mean()

0.828361139484735