# Problem Set 5

## Trees, Bagging and RandomForest

### Manasi Kulkarni

### INFX 574

### Collaborated with: Prem Shah, Aditya Wakade, Pratik Damania, Gaurav Gohil

In [417]:
import pandas as pd
import numpy as np
from scipy.stats import entropy
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import math
import matplotlib.pyplot as plt

In [419]:
# Importing the data
data = pd.read_csv('/Users/manasi/Desktop/titanic.csv')
data['sex'] = np.where(data.sex == 'female', 1, 0)

# Making a family siz by adding the columns parch and sibsp. And then classifying
data['family size'] = data.parch + data.sibsp
data['family class'] =  np.where(data['family size'].isin([0]), 'Single', np.where(data['family size'].isin([1,2]), 'Small', np.where(data['family size'].isin([3,4,5]), 'Medium', 'large')))


We grouped the family sizes into bundles. Accounting for 0 are singles. For 1 and 2, we have a Small family. For 3,4 and 5 ; we have a medium sized family and large accounts for the rest. Based on an initial analysis, we see that Singles account for the largest in the titanic population. This might lead to revelations later that solo travelers had a much higher chance to die than to survive. In addition, people traveling in families of 2-4 people actually could have a relatively high chance to survive. This chance could be significantly lower among 5+ families.

In [421]:
# Splitting the data in survived and not survived
sur = data[data.survived == 1]
n_sur = data[data.survived == 0]

In [423]:
# Finding null values for every column
nulls = data.isnull().sum()
nl = nulls[[0,1,3,4,5,6,8,12]]
nl

pclass         0
survived       0
sex            0
age          263
sibsp          0
parch          0
fare           1
body        1188
dtype: int64

There are 263 null values in age and 1188 null values in body. 

In [425]:
# Finding mean, max and min for survived
x1 = sur.describe().T[['mean', 'min', 'max']]
x1.columns = ['survived_average', 'survived_min', 'survived_max']

In [427]:
# Finding mean, max and min for not survived
x2 = n_sur.describe().T[['mean', 'min', 'max']]
x2.columns = ['not survived_average', 'not survived_min', 'not survived_max']

In [429]:
sum_data = pd.DataFrame(list(zip(sur.mean(),n_sur.mean(),nl)))
sum_data.columns = ['survived_average', 'not survived_average', 'nulls']
sum_data.index = ['pclass', 'survived', 'sex','age', 'sibsp', 'parch', 'fare', 'body']
sum_data

Unnamed: 0,survived_average,not survived_average,nulls
pclass,1.962,2.500618,0
survived,1.0,0.0,0
sex,0.678,0.156984,0
age,28.918228,30.545369,263
sibsp,0.462,0.521632,0
parch,0.476,0.328801,0
fare,49.361184,23.353831,1
body,,160.809917,1188


Now we join the datasets to find average and min max values for survived, not survived along with null values. 

In [431]:
temp = pd.merge(sum_data, x1, on = 'survived_average')
final = pd.merge(temp, x2, on = 'not survived_average')
final.index = ['pclass', 'survived','sex', 'age', 'sibsp', 'parch', 'fare', 'body']
final

Unnamed: 0,survived_average,not survived_average,nulls,survived_min,survived_max,not survived_min,not survived_max
pclass,1.962,2.500618,0,1.0,3.0,1.0,3.0
survived,1.0,0.0,0,1.0,1.0,0.0,0.0
sex,0.678,0.156984,0,0.0,1.0,0.0,1.0
age,28.918228,30.545369,263,0.1667,80.0,0.3333,74.0
sibsp,0.462,0.521632,0,0.0,4.0,0.0,8.0
parch,0.476,0.328801,0,0.0,5.0,0.0,9.0
fare,49.361184,23.353831,1,0.0,512.3292,0.0,263.0
body,,160.809917,1188,,,1.0,328.0


From the describe table above, we can see that the class 1 people survived the least. The class 3 survived the max in fact. Could it be for the fact that the class3 passengers had better facilities or were given first preferences of life boats?

Speaking of age, the younger people seem to have survived more but the range for not survived is smaller and spans till 74. Age has a lot of null values so we cannot be sure about its results.

In alignment with the above analysis, mid sized families had a better chance of survival than singles and large families. 

In [433]:
# Subsetting the data for including only ordinal and categorical variables which show strong differences between survived and not survived
dt_data = data[['pclass', 'sex','age', 'family class', 'embarked', 'survived']]

In [435]:
#Splitting the dataset based on 75% 25%
dt_test = data.sample(frac = 0.25)
dt_train = data.drop(dt_test.index)

Now we make a function to calculate the entropy based on 2 columns. For eg, this function will calculate the entropy based on lengths of columns to result into a weighted entropy later. 

In [437]:
def entropy_finder(x,y):
    sum = x + y
    if(x==0 or y==0):
        entropy = 0
    else:
        entropy = -(x/sum)*np.log2(x/sum) - (y/sum)*np.log2(y/sum)
    return(entropy)

In [439]:
# Finding the entropy of the target variable in the training data
entropy_target = entropy_finder(len(dt_train[dt_train['survived'] == 1]), len(dt_train[dt_train['survived'] == 0]))
print('Entropy of survived in training data',entropy_target)

Entropy of survived in training data 0.9549950637253497


In [441]:
# Making a function for finding entropy and information gain
def col_entropy_finder(col_tar_data, col_name):
    res = pd.DataFrame(columns = ['value', 'entropy', 'weight'])
    len_tot = len(col_tar_data)
    ig = 0
    for elem in col_tar_data[col_name].unique():
        x = col_tar_data[col_tar_data[col_name] == elem]
        col_weight = len(x)
        e1 = len(x[x['survived'] == 1])
        e2 = len(x[x['survived'] == 0])
        entr = entropy_finder(e1, e2)
        idx = len(res) + 1
        res.loc[idx] = [elem,entr,col_weight]
        ig = ig + (entr*col_weight/len_tot)
    
    return(res,ig)

In [443]:
f, col_entr = col_entropy_finder(data, 'pclass')
print(f)
print("Column entropy is", col_entr)

   value   entropy  weight
1    1.0  0.958609   323.0
2    2.0  0.985653   277.0
3    3.0  0.819554   709.0
Column entropy is 0.8890147580167741


### Finding an optimal value for splitting age 

Creating an ordered vector of unique age values (a1, a2, . . . , ak). These will form the potential split
points for age. 

In [445]:
# We drop all NaN values for age and embarked
data_new = data.drop(index = data[data['age'] == float('nan')].index)
data_new = data.drop(index = data[data['age'].isnull() == True].index)

data_new = data.drop(index = data[data['embarked'] == float('nan')].index)
data_new = data.drop(index = data[data['embarked'].isnull() == True].index)

In [447]:
# Redefining test train for combined analysis
dt_test = data_new.iloc[0:300]
dt_train = data_new.drop(index = dt_test.index)

In [448]:
# Making an empty dataframe for storing the emtropy values for every splitting value of age
age_entropy  = pd.DataFrame(columns = ['age', 'entropy'])
# Dropping the NAN rows from training data
for elem in dt_train['age'].unique():
    # Splitting age according to threshold in this iteration
    dt_train['age class'] = np.where(dt_train['age'] > elem, 1,0)
    # Finding entropy of age column
    f, en = col_entropy_finder(dt_train, 'age class')
    idx = len(age_entropy) + 1
    age_entropy.loc[idx] = [elem, en]

age_entropy.nsmallest(10, columns = 'entropy')

Unnamed: 0,age,entropy
62,36.5,0.886966
15,36.0,0.887311
20,34.0,0.887564
73,0.3333,0.88775
37,37.0,0.887764
42,38.0,0.887887
74,0.1667,0.887913
79,34.5,0.887918
51,33.0,0.888051
59,32.5,0.888069


We have only found out this information for a single column and we need to know if this is the split based on the information gain. Just like we performed the above analysis for a single column, we perform this on all columns. Basically, we iterate this function over our dataset to find the entropy for all columns. 

In [449]:
# Consider column pclass
# Making an empty dataframe for storing the emtropy values for every splitting value of
pclass_entropy  = pd.DataFrame(columns = ['pclass', 'entropy'])
# Dropping the NAN rows from training data
for elem in dt_train['pclass'].unique():
    # Splitting age according to threshold in this iteration
    dt_train['pclass class'] = np.where(dt_train['pclass'] > elem, 1,0)
    # Finding entropy of age column
    f, en = col_entropy_finder(dt_train, 'pclass class')
    idx = len(pclass_entropy) + 1
    pclass_entropy.loc[idx] = [elem, en]

pclass_entropy.nsmallest(10, columns = 'entropy')

Unnamed: 0,pclass,entropy
2,2.0,0.869473
1,1.0,0.88877
3,3.0,0.891828


We can see that for pclass, the entropy was much smaller than age, so that is probably a better column to start as a root. Now we create a loop to run this analysis for all columns except age, considered for the training data.

Now we create an analysis for all the possible column values and their thresholds to determine which combination would be the best for splitting the root

In [450]:
# Converting all categorical values into ordinal values

# Family class
data_new['family class'] = np.where(data_new['family class'] == 'single', 0, np.where(data_new['family class'] == 'small',1, np.where(data_new['family class'] == 'medium', 2, 3)))
data_new['embarked'] = np.where(data_new['embarked'] =='Q', 0, np.where(data_new['embarked'] =='C',1,2))

In [451]:
# Redefining test train for combined analysis
dt_test = data_new.iloc[0:300]
dt_train = data_new.drop(index = dt_test.index)

In [452]:
#Finding unique values from the age
dt_train['age'].unique()

array([35.    , 64.    , 60.    , 54.    , 21.    , 55.    , 31.    ,
       57.    , 45.    , 50.    , 27.    , 51.    ,     nan, 62.    ,
       36.    , 30.    , 28.    , 18.    , 25.    , 34.    , 23.    ,
       32.    , 19.    ,  1.    ,  4.    , 12.    , 26.    , 42.    ,
       24.    , 15.    , 40.    , 20.    ,  0.8333, 22.    , 44.    ,
       52.    , 37.    , 29.    ,  8.    , 48.    , 17.    , 38.    ,
       16.    , 47.    ,  0.6667,  6.    ,  7.    , 43.    , 49.    ,
       63.    , 33.    ,  3.    , 61.    , 46.    , 13.    , 41.    ,
       39.    , 70.    , 32.5   , 14.    ,  2.    , 36.5   , 59.    ,
       18.5   ,  0.9167,  5.    , 66.    ,  9.    , 11.    ,  0.75  ,
       70.5   , 22.5   ,  0.3333,  0.1667, 65.    , 40.5   , 10.    ,
       23.5   , 34.5   , 20.5   , 30.5   , 55.5   , 28.5   , 38.5   ,
       14.5   , 24.5   , 60.5   , 74.    ,  0.4167, 11.5   , 45.5   ,
       26.5   ])

In [453]:
cols = ['pclass', 'sex','age', 'family class', 'embarked']


# Making an empty dataframe for storing the emtropy values for every splitting value of a given column
result_entropy  = pd.DataFrame(columns = ['split_value', 'entropy', 'column'])
for col in cols:
        # Unique values
        for elem in dt_train[col].unique():
            # Splitting age according to threshold in this iteration
            dt_train['class'] = np.where(dt_train[col] > elem, 1,0)
            # Finding entropy of current column
            f, en = col_entropy_finder(dt_train, 'class')
            idx = len(result_entropy) + 1
            result_entropy.loc[idx] = [elem, en, col]

result_entropy.nsmallest(10, columns = 'entropy')

Unnamed: 0,split_value,entropy,column
5,0.0,0.722999,sex
2,2.0,0.869473,pclass
99,1.0,0.883848,embarked
67,36.5,0.886966,age
20,36.0,0.887311,age
25,34.0,0.887564,age
78,0.3333,0.88775,age
42,37.0,0.887764,age
47,38.0,0.887887,age
79,0.1667,0.887913,age


Hence the 10 smallest entropies and thresholds are given in the column above. Let's make a function out of this analysis. So basically, the tree can be split on sex as the root node. 

In [454]:
def split_finder(data, cols):
    result_entropy  = pd.DataFrame(columns = ['split_value', 'entropy', 'column'])
    for col in cols:
        # Unique values
        for elem in data[col].unique():
            # Splitting age according to threshold in this iteration
            data['class'] = np.where(data[col] > elem, 1,0)
            # Finding entropy of current column
            f, en = col_entropy_finder(data, 'class')
            idx = len(result_entropy) + 1
            result_entropy.loc[idx] = [elem, en, col]
    return(result_entropy[result_entropy['entropy'] == result_entropy['entropy'].min()])

In [455]:
pd.DataFrame(result_entropy[result_entropy['entropy'] == result_entropy['entropy'].min()].iloc[0]).T

Unnamed: 0,split_value,entropy,column
5,0,0.722999,sex


The minimum entropy was found on sex column. This could be our root node. 

Now we make a function that prints the percentage of survived and not survived on each branch given data and split point. 

In [456]:
data.groupby(['survived']).survived.count()[1]/data.survived.count()*100

38.19709702062643

In [457]:
def per_finder(data):
    if(len(data['survived'].unique()) > 0):
        per_1 = len(data[data['survived'] == 1])/len(data)*100
        per_0 = len(data[data['survived'] == 0])/len(data)*100
    else:   
        if(data['survived'].unique() == 0):
            per_0 = 100
            per_1 = 0
        else:
            per_1 = 100
            per_0 = 0

    return(per_1, per_0)

In [458]:
# Preparing data
d_final = dt_train[['pclass', 'sex','age', 'family class', 'embarked', 'survived']].dropna()

In [459]:
import sys
tree = pd.DataFrame(columns = ['path', 'data_length', '% sur', '% not_sur', 'columnOfSplit', 'thresholdOfSplit'])
path = 'root'
def tree_maker(data, path):
    cols = data.columns.drop(['survived'])
    re = split_finder(data, cols)
    if(np.shape(re)[0]>1):
        return(1)
    per_sur, per_nsur = per_finder(data)
    tc = re['column'].item()
    sp = re['split_value'].item()
    data_left = data[ data[tc] > sp]
    data_right = data.drop(index = data_left.index)
    idx = len(tree) + 1
    tree.loc[idx] = [path, len(data),per_sur, per_nsur, tc, sp]
    if(per_sur == 0 or per_nsur == 0):
        idx = len(tree) + 1
        pathleaf = path + 'leaf'
        tree.loc[idx] = [pathleaf, len(data),per_sur, per_nsur, tc, sp]
        return(1)
    else:
        pathl = path + 'L'
        pathr = path + 'R'
        tree_maker(data_left, pathl)
        tree_maker(data_right, pathr)  

x = tree_maker(d_final, path)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


The decision tree classification model holds the path from root for different values of % surived, % not survived along with column name used for splitting and the threshold for splitting.

Lets prepare the testing data. To demonstrate how we will use a prediction function to use this model we have built we shall apply the function on a couple of rows

In [460]:
# Testing data
d_final_test = dt_test[['pclass', 'sex','age', 'family class', 'embarked', 'survived']].dropna()

In [461]:
# Building a function to predict the class of a given row
# Initializing a path from root
path = 'root'
def predict_class(row, path):
    rules = tree[tree['path'] == path]
    #print(rules)
    if(len(rules.index) == 0):
        path = path[:-1]
        rules = tree[tree['path'] == path]
        x = np.where(rules['% sur'].item() > rules['% not_sur'].item(), 1, 0)
        return(x.item())
    else:
        if(row[rules['columnOfSplit'].item()].item() > rules['thresholdOfSplit'].item()):
            path = path + 'L'
            return(predict_class(row, path))
        else:
            path = path + 'R'
            return(predict_class(row, path))
    

Now that we have built the predictor function, we initialize a 'root' variable and pass a test row with some values into the function to see if the prediction happens

In [462]:
path = 'root'
test = pd.DataFrame([[1,1,29.00,0,2]], columns = ['pclass', 'sex', 'age', 'family class', 'embarked'])
test

Unnamed: 0,pclass,sex,age,family class,embarked
0,1,1,29.0,0,2


In [463]:
path = 'root'
label = predict_class(test, path)
print(label)

1


In [464]:
tree[tree['path'] == 'rootLR']

Unnamed: 0,path,data_length,% sur,% not_sur,columnOfSplit,thresholdOfSplit
48,rootLR,112,90.178571,9.821429,age,55


The output value by the function is 1, that is survived and is a true positive as the test data for that row also has survived as the outcome. Now we run the function on all rows in testing data.

In [465]:
results = list()
for elem in d_final_test.iterrows():
    temp = pd.DataFrame(list(elem)[1:], columns = ['pclass', 'sex', 'age', 'family class', 'embarked'])
    #print(list(elem)[1:])
    #print('*******')
    path = 'root'
    t = predict_class(temp, path)
    results.append(t)

len(results)

263

In [466]:
d_final_test['predictions'] = results

Now that we have the results, we can find the accuracy of the labels


In [467]:
# Making a function for finding accuracy score
def accuracy_finder(data):
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    tp = np.where(data['survived'] == data['predictions'], np.where(data['predictions']==1,1,0),0)
    fp = np.where(data['survived'] != data['predictions'], np.where(data['predictions']==1,1,0),0)
    tn = np.where(data['survived'] == data['predictions'], np.where(data['predictions']==0,1,0),0)
    fn = np.where(data['survived'] != data['predictions'], np.where(data['predictions']==0,1,0),0)
    precision = tp.sum()/(tp.sum()+fp.sum())
    recall = tp.sum()/(tp.sum()+fn.sum())
    fsc = 2/((1/precision) + (1/recall))
    accuracy = (tp.sum()+tn.sum())/(tp.sum()+tn.sum()+fp.sum()+fn.sum())
    return(accuracy,fsc,recall,precision)

#accuracy_finder(d_final_test)

In [468]:
a,b,c,d = accuracy_finder(d_final_test)
print('Accuracy is',a,'FSC is',b,'Recall is',c,'Precision is',d)

Accuracy is 0.7908745247148289 FSC is 0.8135593220338982 Recall is 0.7100591715976331 Precision is 0.9523809523809523


As we can see, our algorithm has highest rate of true positives out of predicted positives. The accuracy is about 80%.

## Bagging
We implement bagging by iterating over a loop and storing the results of every bag into a dataframe. Then we calculate the majority and find final predictions.

We create 5 bags of size 150

In [469]:
d_final_test = dt_test[['pclass', 'sex','age', 'family class', 'embarked', 'survived']].dropna()


In [470]:
tree1 = pd.DataFrame(columns = ['path', 'data_length', '% sur', '% not_sur', 'columnOfSplit', 'thresholdOfSplit'])
path1 = 'root'
def tree_maker1(data, path):
    cols = data.columns.drop(['survived'])
    re = split_finder(data, cols)
    if(np.shape(re)[0]>1):
        return(1)
    per_sur, per_nsur = per_finder(data)
    tc = re['column'].item()
    sp = re['split_value'].item()
    data_left = data[ data[tc] > sp]
    data_right = data.drop(index = data_left.index)
    idx = len(tree1) + 1
    tree1.loc[idx] = [path, len(data),per_sur, per_nsur, tc, sp]
    if(per_sur == 0 or per_nsur == 0):
        idx = len(tree1) + 1
        pathleaf = path + 'leaf'
        tree1.loc[idx] = [pathleaf, len(data),per_sur, per_nsur, tc, sp]
        return(1)
    else:
        pathl = path + 'L'
        pathr = path + 'R'
        tree_maker1(data_left, pathl)
        tree_maker1(data_right, pathr)  


In [471]:
# Building a function to predict the class of a given row
# Initializing a path from root
path1 = 'root'
def predict_class1(row, path):
    rules = tree1[tree1['path'] == path]
    if(len(rules.index) == 0):
        path = path[:-1]
        rules = tree1[tree1['path'] == path]
        x = np.where(rules['% sur'].item() > rules['% not_sur'].item(), 1, 0)
        return(x.item())
    else:
        if(row[rules['columnOfSplit'].item()].item() > rules['thresholdOfSplit'].item()):
            path = path + 'L'
            return(predict_class1(row, path))
        else:
            path = path + 'R'
            return(predict_class1(row, path))


In [472]:
#results_bag = pd.DataFrame(columns = ['bag1', 'bag2', 'bag3', 'bag4', 'bag5'])
#for elem in ['bag1', 'bag2', 'bag3', 'bag4', 'bag5']:
tree1 = pd.DataFrame(columns = ['path', 'data_length', '% sur', '% not_sur', 'columnOfSplit', 'thresholdOfSplit'])
path1 = 'root'
bag_train = d_final.sample(n=100, replace = True)
tree_maker1(bag_train, path1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [473]:
results = list()
for elem in d_final_test.iterrows():
    temp = pd.DataFrame(list(elem)[1:], columns = ['pclass', 'sex', 'age', 'family class', 'embarked'])
    path = 'root'
    t = predict_class1(temp, path)
    results.append(t)

len(results)

263

In [474]:
d_final_test['predictions'] = results

In [475]:
# Making a function for finding accuracy score
def accuracy_finder(data):
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    tp = np.where(data['survived'] == data['predictions'], np.where(data['predictions']==1,1,0),0)
    fp = np.where(data['survived'] != data['predictions'], np.where(data['predictions']==1,1,0),0)
    tn = np.where(data['survived'] == data['predictions'], np.where(data['predictions']==0,1,0),0)
    fn = np.where(data['survived'] != data['predictions'], np.where(data['predictions']==0,1,0),0)
    precision = tp.sum()/(tp.sum()+fp.sum())
    recall = tp.sum()/(tp.sum()+fn.sum())
    fsc = 2/((1/precision) + (1/recall))
    accuracy = (tp.sum()+tn.sum())/(tp.sum()+tn.sum()+fp.sum()+fn.sum())
    return(accuracy, fsc, recall, precision)


In [476]:
bagging_data = pd.DataFrame(columns = ['bag1', 'bag2','bag3','bag4','bag5'])
for elem in bagging_data.columns:
    tree1 = pd.DataFrame(columns = ['path', 'data_length', '% sur', '% not_sur', 'columnOfSplit', 'thresholdOfSplit'])
    bag_train = d_final.sample(n=300, replace = True)
    tree_maker1(bag_train, 'root')
    result = list()
    d_final_test = dt_test[['pclass', 'sex','age', 'family class', 'embarked', 'survived']].dropna()
    for e in d_final_test.iterrows():
        temp = pd.DataFrame(list(e)[1:], columns = ['pclass', 'sex', 'age', 'family class', 'embarked','survived'])
        t = predict_class1(temp,'root')
        result.append(t)
    bagging_data[elem] = result

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [477]:
majority = list()
for dx in range(0,len(bagging_data)):
    vals = bagging_data.iloc[dx].value_counts()
    if(len(vals) == 1):
        if(bagging_data.iloc[dx].unique().item() == 1):
            majority.append(1)
        else:
            majority.append(0)
    else:
        maj = np.where(vals[0] < vals[1], 1, 0)
        majority.append(maj)
bagging_data['majority'] = majority

In [478]:
d_final_test['predictions'] = majority

In [479]:
a,b,c,d = accuracy_finder(d_final_test)

In [480]:
print(a,b,c,d)

0.7832699619771863 0.8041237113402062 0.6923076923076923 0.9590163934426229


A difficulty of this method is that even though deep trees are constructed, the bagged trees that are created are very similar. In turn, the predictions made by these trees are also similar, and the high variance we desire among the trees trained on different samples of the training dataset is diminished.

This is because of the greedy algorithm used in the construction of the trees selecting the same or similar split points.

## Random Forests

In [481]:
d_final_test = dt_test[['pclass', 'sex','age', 'family class', 'embarked', 'survived']].dropna()
cols = ['pclass', 'sex','age', 'family class', 'embarked', 'survived']
x = np.random.choice(5,2,replace=False)
cols = [cols[x[0]],cols[x[1]]
       ]
cols
d_final_test = d_final_test[cols]

In [482]:
tree1 = pd.DataFrame(columns = ['path', 'data_length', '% sur', '% not_sur', 'columnOfSplit', 'thresholdOfSplit'])
path1 = 'root'
def tree_maker1(data, path):
    cols = data.columns.drop(['survived'])
    re = split_finder(data, cols)
    if(np.shape(re)[0]>1):
        return(1)
    per_sur, per_nsur = per_finder(data)
    tc = re['column'].item()
    sp = re['split_value'].item()
    data_left = data[ data[tc] > sp]
    data_right = data.drop(index = data_left.index)
    idx = len(tree1) + 1
    tree1.loc[idx] = [path, len(data),per_sur, per_nsur, tc, sp]
    if(per_sur == 0 or per_nsur == 0):
        idx = len(tree1) + 1
        pathleaf = path + 'leaf'
        tree1.loc[idx] = [pathleaf, len(data),per_sur, per_nsur, tc, sp]
        return(1)
    else:
        pathl = path + 'L'
        pathr = path + 'R'
        tree_maker1(data_left, pathl)
        tree_maker1(data_right, pathr)  


In [483]:
# Building a function to predict the class of a given row
# Initializing a path from root
path1 = 'root'
def predict_class1(row, path):
    rules = tree1[tree1['path'] == path]
    if(len(rules.index) == 0):
        path = path[:-1]
        rules = tree1[tree1['path'] == path]
        x = np.where(rules['% sur'].item() > rules['% not_sur'].item(), 1, 0)
        return(x.item())
    else:
        if(row[rules['columnOfSplit'].item()].item() > rules['thresholdOfSplit'].item()):
            path = path + 'L'
            return(predict_class1(row, path))
        else:
            path = path + 'R'
            return(predict_class1(row, path))


In [484]:
#results_bag = pd.DataFrame(columns = ['bag1', 'bag2', 'bag3', 'bag4', 'bag5'])
#for elem in ['bag1', 'bag2', 'bag3', 'bag4', 'bag5']:
tree1 = pd.DataFrame(columns = ['path', 'data_length', '% sur', '% not_sur', 'columnOfSplit', 'thresholdOfSplit'])
path1 = 'root'
bag_train = d_final.sample(n=100, replace = True)
tree_maker1(bag_train, path1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [485]:
results = list()
for elem in d_final_test.iterrows():
    temp = pd.DataFrame(list(elem)[1:], columns = ['pclass', 'sex', 'age', 'family class', 'embarked'])
    path = 'root'
    t = predict_class1(temp, path)
    results.append(t)

len(results)

263

In [486]:
d_final_test['predictions'] = results

In [487]:
# Making a function for finding accuracy score
def accuracy_finder(data):
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    tp = np.where(data['survived'] == data['predictions'], np.where(data['predictions']==1,1,0),0)
    fp = np.where(data['survived'] != data['predictions'], np.where(data['predictions']==1,1,0),0)
    tn = np.where(data['survived'] == data['predictions'], np.where(data['predictions']==0,1,0),0)
    fn = np.where(data['survived'] != data['predictions'], np.where(data['predictions']==0,1,0),0)
    precision = tp.sum()/(tp.sum()+fp.sum())
    recall = tp.sum()/(tp.sum()+fn.sum())
    fsc = 2/((1/precision) + (1/recall))
    accuracy = (tp.sum()+tn.sum())/(tp.sum()+tn.sum()+fp.sum()+fn.sum())
    return(accuracy, fsc, recall, precision)


In [488]:
d_final.head(10)

Unnamed: 0,pclass,sex,age,family class,embarked,survived,class
302,1,1,35.0,3,1,1,1
303,1,0,64.0,3,1,0,1
304,1,1,60.0,3,1,1,1
305,1,0,60.0,3,2,0,1
306,1,0,54.0,3,2,0,1
307,1,0,21.0,3,2,0,1
308,1,1,55.0,3,1,1,1
309,1,1,31.0,3,2,1,1
310,1,0,57.0,3,2,0,1
311,1,1,45.0,3,2,1,1


In [489]:
bagging_data = pd.DataFrame(columns = ['bag1', 'bag2','bag3','bag4','bag5'])
for elem in bagging_data.columns:
    tree1 = pd.DataFrame(columns = ['path', 'data_length', '% sur', '% not_sur', 'columnOfSplit', 'thresholdOfSplit'])
    d_final = dt_train[['pclass', 'sex','age', 'family class', 'embarked', 'survived']].dropna()
    cols = ['pclass', 'sex','age', 'family class', 'embarked']
    x = np.random.choice(5,2,replace=False)
    cols = [cols[x[0]],cols[x[1]],'survived']
    d_final = d_final[cols]
    bag_train = d_final.sample(n=500, replace = True)
    tree_maker1(bag_train, 'root')
    result = list()
    #d_final_test = dt_test[cols].dropna()
    d_final_test = dt_test[['pclass', 'sex','age', 'family class', 'embarked', 'survived']].dropna()
    d_final_test = d_final_test[cols]
    for e in d_final_test.iterrows():
        temp = pd.DataFrame(list(e)[1:], columns = cols)
        t = predict_class1(temp,'root')
        result.append(int(t))
    bagging_data[elem] = result

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [490]:
majority = list()
for dx in range(0,len(bagging_data)):
    vals = bagging_data.iloc[dx].value_counts()
    if(len(vals) == 1):
        if(bagging_data.iloc[dx].unique().item() == 1):
            majority.append(1)
        else:
            majority.append(0)
    else:
        maj = np.where(vals[0] < vals[1], 1, 0)
        majority.append(maj)
bagging_data['majority'] = majority

In [491]:
int_list = list()
for l in majority:
    int_list.append(int(l))

In [492]:
int_list

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,


In [493]:
d_final_test['predictions'] = int_list

In [494]:
a,b,c,d = accuracy_finder(d_final_test)
print('Accuracy is',a,'FScore is',b,'Recall is',c,'Precision is',d)

Accuracy is 0.596958174904943 FScore is 0.5508474576271187 Recall is 0.38461538461538464 Precision is 0.9701492537313433


If the number of observations is large, but the number of trees is too small, then some observations will be predicted only once or even not at all. If the number of predictors is large but the number of trees is too small, then some features can (theoretically) be missed in all subspaces used. Both cases results in the decrease of random forest predictive power. But the last is a rather extreme case, since the selection of subspace is performed at each node.