### Random Forest

I will go through the whole process of creating a machine learning model on the famous Titanic dataset, which is used by many people all over the world. It provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival.

In [1]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression

Getting the Data

In [2]:
test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

Data Exploration/Analysis

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Above we can see that 38% out of the training-set survived the Titanic. We can also see that the passenger ages range from 0.4 to 80. On top of that we can already detect some features, that contain missing values, like the ‘Age’ feature.

In [5]:
train_df.head(8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


Let’s take a more detailed look at what data is actually missing:

In [6]:
total = train_df.isnull().sum().sort_values(ascending=False)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(5)

Unnamed: 0,Total,%
Cabin,687,77.1
Age,177,19.9
Embarked,2,0.2
Fare,0,0.0
Ticket,0,0.0


The Embarked feature has only 2 missing values, which can easily be filled. It will be much more tricky, to deal with the ‘Age’ feature, which has 177 missing values. The ‘Cabin’ feature needs further investigation, but it looks like that we might want to drop it from the dataset, since 77 % of it are missing.

In [7]:
train_df.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

### Data Preprocessing

First, I will drop ‘PassengerId’ from the train set, because it does not contribute to a persons survival probability. I will not drop it from the test set, since it is required there for the submission.

In [8]:
train_df = train_df.drop(['PassengerId'], axis=1)

In [9]:
# extracting and then removing the targets from the training data 
targets = train_df['Survived']
train_df.drop(['Survived'], 1, inplace=True)

In [10]:
# merging train data and test data for future feature engineering
# we'll also remove the PassengerID since this is not an informative feature
combined = train_df.append(test_df)
combined.reset_index(inplace=True)
combined.drop(['index', 'PassengerId'], inplace=True, axis=1)

In [11]:
#Now let's map the title can bin them
Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Dona": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}

In [12]:
#Generate a new Title column
combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
combined['Title'] = combined['Title'].map(Title_Dictionary)

In [13]:
#let's get the median age based on people's gender, Pclass and Title
fill_mean = lambda g: g.fillna(g.mean())
combined['Age'] = combined.groupby(['Sex', 'Title', 'Pclass'])['Age'].apply(fill_mean)

In [14]:
# Name can be dropped now
combined.drop('Name', axis=1, inplace=True)

In [15]:
# removing the title variable
combined.drop('Title', axis=1, inplace=True)

In [16]:
# removing the Cabin variable
combined.drop('Cabin', axis=1, inplace=True)

In [17]:
# removing the Ticket variable
combined.drop('Ticket', axis=1, inplace=True)

In [18]:
#Fill out the missing fare data
combined['Fare'].fillna(combined['Fare'].mean(), inplace=True)

In [19]:
# two missing embarked values - filling them with the most frequent one in the train set
combined['Embarked'].fillna('S', inplace=True)

In [20]:
# encoding in dummy variable
embarked_dummies = pd.get_dummies(combined['Embarked'], prefix='Embarked')
combined = pd.concat([combined, embarked_dummies], axis=1)
combined.drop('Embarked', axis=1, inplace=True)

In [21]:
# mapping gender to numerical one 
combined['Sex'] = combined['Sex'].map({'male':1, 'female':0})

In [22]:
# introducing a new feature : the size of families (including the passenger)
combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
combined.drop('Parch', axis=1, inplace=True)
combined.drop('SibSp', axis=1, inplace=True)

### Random Forest and K-fold Validation

In [58]:
# Prepare the training dataset
df_im_input=combined.iloc[:891]
df_im_output=targets

In [59]:
def train_validation_split(data, k):
    number_of_rows = data.shape[0]
    number_of_test = int(np.floor(number_of_rows / k))
    list_of_data_index = [i for i in range(number_of_rows)]
    
    result = pd.DataFrame(columns=['train_index','validation_index'], index=[i for i in range(k)])
    
    for i in range(k):
        total_index = np.array([i for i in range(number_of_rows)])
        test_index = np.random.choice(list_of_data_index, size=number_of_test, replace=False)
        train_index = np.delete(total_index, test_index)
        
        for index in test_index:
            list_of_data_index.remove(index)
        
        # print("{0}th iteration:\n".format({i}), "test index\n", test_index, "\ntrain index\n", train_index)
        
        result.iloc[i,0] = np.sort(train_index)
        result.iloc[i,1] = np.sort(test_index)

    return result



def combination_grid(parameters):
    import itertools
    '''
    parameters: a dictionary, e.g. parameters = {'max_depth':[3, 5, 7, 9], 'min_samples_leaf': [1, 2, 3, 4]}
    '''
    # https://www.codegrepper.com/code-examples/delphi/python+list+of+lists+all+combinations
    combinations = list(itertools.product(*list(parameters.values())))
    df = pd.DataFrame(combinations, columns = parameters.keys()) 
    
    return df

In [60]:
# spliting 6-fold train_validation
split = train_validation_split(df_im_input, 6)
split

Unnamed: 0,train_index,validation_index
0,"[0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14...","[8, 17, 20, 24, 29, 36, 46, 47, 60, 65, 79, 88..."
1,"[0, 1, 2, 3, 5, 7, 8, 9, 10, 12, 13, 14, 15, 1...","[4, 6, 11, 16, 18, 23, 26, 33, 34, 35, 57, 68,..."
2,"[0, 1, 4, 5, 6, 7, 8, 9, 11, 12, 15, 16, 17, 1...","[2, 3, 10, 13, 14, 21, 25, 32, 38, 39, 43, 44,..."
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[19, 22, 30, 37, 40, 49, 50, 56, 61, 63, 64, 7..."
4,"[2, 3, 4, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16,...","[0, 1, 5, 7, 41, 48, 54, 55, 58, 59, 75, 76, 7..."
5,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 14, 16...","[9, 12, 15, 27, 28, 31, 51, 53, 62, 66, 69, 71..."


In [61]:
# make grid of paramters

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 80, stop = 2000, num = 13)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

combination_grid = combination_grid(random_grid)
combination_grid

Unnamed: 0,n_estimators,max_features,max_depth,min_samples_split,min_samples_leaf,bootstrap
0,80,auto,10,2,1,True
1,80,auto,10,2,1,False
2,80,auto,10,2,2,True
3,80,auto,10,2,2,False
4,80,auto,10,2,4,True
...,...,...,...,...,...,...
7717,2000,log2,110,10,1,False
7718,2000,log2,110,10,2,True
7719,2000,log2,110,10,2,False
7720,2000,log2,110,10,4,True


In [62]:
# apply k-fold validation to the 1st combination of parameters
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc

auc_list = []
for k in range(split.shape[0]):
    train_data = df_im_input[df_im_input.index.isin(split.iloc[k,0])]
    validation_data = df_im_input[df_im_input.index.isin(split.iloc[k,1])]
    train_target = df_im_output[df_im_output.index.isin(split.iloc[k,0])]
    
    rf = RandomForestClassifier(n_estimators=combination_grid.iloc[0,0],
                                max_features=combination_grid.iloc[0,1],
                                max_depth=combination_grid.iloc[0,2],
                                min_samples_split=combination_grid.iloc[0,3],
                                min_samples_leaf=combination_grid.iloc[0,4],
                                bootstrap=combination_grid.iloc[0,5],
                                random_state = 2345)
    
    rf.fit(train_data, train_target)
    
    preds=rf.predict(train_data)
    preds_probabilities = rf.predict_proba(train_data)
    pred_probs = preds_probabilities[:, 1]
    
    [fpr, tpr, thr] = roc_curve(train_target, pred_probs)
    auc_ = auc(fpr, tpr)
    auc_list.append(auc_)

auc_avg = sum(auc_list) / len(auc_list) 
print(auc_avg)
# combination_grid['avg_auc'][0] = auc_avg

0.9879685543861475


In [None]:
# loop whole combination grid

combination_grid['avg_auc'] = np.zeros((combination_grid.shape[0],1))

for i in range(combination_grid.shape[0]):
    auc_list = []
    for k in range(split.shape[0]):
        train_data = df_im_input[df_im_input.index.isin(split.iloc[k,0])]
        validation_data = df_im_input[df_im_input.index.isin(split.iloc[k,1])]
        train_target = df_im_output[df_im_output.index.isin(split.iloc[k,0])]
        rf = RandomForestClassifier(n_estimators=combination_grid.iloc[i,0],
                                max_features=combination_grid.iloc[i,1],
                                max_depth=combination_grid.iloc[i,2],
                                min_samples_split=combination_grid.iloc[i,3],
                                min_samples_leaf=combination_grid.iloc[i,4],
                                bootstrap=combination_grid.iloc[i,5],
                                random_state = 2345)
        
        rf.fit(train_data, train_target)
    
        preds=rf.predict(train_data)
        preds_probabilities = rf.predict_proba(train_data)
        pred_probs = preds_probabilities[:, 1]
    
        [fpr, tpr, thr] = roc_curve(train_target, pred_probs)
        auc_ = auc(fpr, tpr)
        auc_list.append(auc_)

    auc_avg = sum(auc_list) / len(auc_list)
    combination_grid.iloc[i,-1] = auc_avg
    print(auc_avg)

0.9879685543861475
0.9918622870154802
0.9690631147005165
0.9800263129686987
0.9446161020212783
0.9605946212904165
0.9763907704951311
0.9855466041931873
0.96631824970516
0.9783905261409654
0.9446161020212783
0.9605946212904165
0.9602751268534244
0.9749949338536293
0.9534549334969791
0.9692090496440976
0.9427195835081612
0.958283029935629
0.9976346107981348
0.9993321905547491
0.9790083247959958
0.9924951517122703
0.9492638813153542
0.9690284695117768
0.989096942553533
0.9971522324302605
0.9755554049916896
0.9904472859165171
0.9492638813153542
0.9690284695117768
0.9699318998628802
0.9873111607138899
0.9607822572941571
0.9794715485694688
0.9472148698785058
0.9660054089884974
0.9976346431560973
0.9993321905547491
0.9790235507123269
0.9925283183533978
0.9492625843590513
0.9690336573369885
0.9891120820811709
0.9971517205616381
0.975550374051588
0.9905063455497803
0.9492625843590513
0.9690336573369885
0.9699215625382327
0.9873445919055007
0.9607822572941571
0.9794641750746944
0.947214869878505

0.9993321905547491
0.981155785933772
0.9932339386613359
0.9543762830691372
0.9723674724088617
0.9896342524807231
0.9973077441093926
0.9781719208169551
0.9916441979492591
0.9543762830691372
0.9723674724088617


In [53]:
combination_grid

0.0