<h1>
<center>
Module 6: Random Forests - another approach to Bias-Variance
</center>
</h1>
<div class=h1_cell>
<p>
In this module we will continue to look at means of tackling the Bias-Variance problem. We will focus on the Variance problem of overfitting. We will see a new concept called *ensemble* learning. I liken it to crowd-sourcing. Instead of relying on just one expert, let's round up a collection (AKA an ensemble) of experts. We can let them each, individually, come up with a prediction. Then we can take a vote and use the winning prediction.
<p>
The technique we will look at is called *Random Forests*. It is a special method falling under the more general heading of *bagging*. We will crowd-source a forest of trees to get their predictions and then take majority vote. Where, you ask, does this forest of trees come from? We build them following relatively straightforward steps. So to summarize, we first build our forest of decision trees. Then when we want to do predictions for real, we give each tree a vote and majority wins. Cool.
</div>

<h2>
Jargon alerts
</h2>
<div class=h1_cell>
<p>
Random forests have jargon that goes with them. I'll alert you to where jargony terms show up.
<p>
Jargon alert: *Random Forest* is jargon :) And so is *bagging* and *ensemble*.
</div>

In [1]:
import pandas as pd

from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
with open('/content/gdrive/My Drive/class_tables/titanic_wrangled_week2.csv', 'r') as f:
  titanic_table = pd.read_csv(f)

titanic_table.head(2)  #make sure it looks ok - we see the results of our week 2 wrangling

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,...,Child,1,0,0,0,1,0,0,0,1
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,...,Adult,0,1,0,1,0,0,1,0,0


In [0]:
pd.set_option('display.max_columns', None)

In [0]:
!rm library_w19_week5b.py

In [5]:
from google.colab import files
files.upload()

Saving library_w19_week5b.py to library_w19_week5b.py




In [6]:
from library_w19_week5b import *

%who function

accuracy	 build_pred	 build_tree_iter	 compute_prediction	 compute_training	 f1	 find_best_splitter	 generate_table	 gig	 
gini	 informedness	 k_fold	 k_fold_random	 path_id	 predictor_case	 probabilities	 produce_scores	 reorder_paths	 
tree_predictor	 verify_unique	 


In [7]:
titanic_table.head(1)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,22.0,0,0,1,0,Child,1,0,0,0,1,0,0,0,1


<h2>
A forest starts with some trees
</h2>
<div class=h1_cell>
<p>
Let's build a forest of two trees to get started. Once we see how to do that, we can scale it up to N trees.
<p>
Our approach will be to do random selections of both the rows (axis=0) and columns (axis=1) as we build the tree.
</div>

<h2>
Step 1. Generate the training data for the first tree (and don't lose the left-out data)
</h2>
<div class=h1_cell>
<p>
We will take a slice from the entire table to use as training. This may sound familiar: we did something similar when doing K-folding. The big difference here is that we will take random rows *with replacement*. This means the same row can appear more than once in our slice. With K-Folding, we did not let this happen. And BTW, the size of the slice is the size of the original table, e.g. 891 rows in slice!
<p>
We don't want to lose track of the rows we did not use. They will become important later.
<p>
Jargon alert: selecting a random sample of rows for training (with replacement) is called *bootstrapping* or *bagging*.
</div>

<h2>
Random but predictable
</h2>
<div class=h1_cell>
<p>
It is not easy debugging random algorithms. You want to use the same random numbers as you make changes and try things again. We will run into 2 types of random numbers in this module: ones generated from Python's `random` package; ones generated by `pandas` (which in turn uses `numpy`). Sorry, but will need to seed them both if we want consistent results.
  <p>
    As reminder, once we seed a generator with a constant (like 1000 below), we will get the same sequence of random numbers generated.
</div>

In [0]:
import numpy as np
import random

rng = np.random.RandomState(42)  #Will pass as arg to pandas sample method
random.seed(2000)

<h2>
Let's bootstrap!
</h2>
<div class=h1_cell>
<p>
We need a table that is same size as Titanic table (891 rows). We will select the rows in the Titanic Table randomly. We will allow replacement: the same row can be selected mulitple times. What is very cool is that Pandas gives us a method, `sample`, that does exactly what we want. Pretty nice of them.
<p>
As you can see below, I am setting the fraction of the table I want to 100%. And I am using my new_seed function to give me the random seed.
<p>
Once I have my new table, I am going to reindex it. This will create a new column `index` that has the row numbers from the original Titanic table. I'll want those later. Check it out.
</div>

In [9]:
train1 = titanic_table.sample(frac=1.0, replace=True, random_state=rng)  # Easy peasy - thanks pandas!
train1 = train1.reset_index()
train1.head()

Unnamed: 0,index,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3
0,102,0,1,"White, Mr. Richard Frasar",male,21.0,0,1,35281,77.2875,D26,S,0,21.0,0,0,1,0,Child,1,0,0,0,1,0,1,0,0
1,435,1,1,"Carter, Miss. Lucile Polk",female,14.0,1,2,113760,120.0,B96 B98,S,0,14.0,0,0,1,0,Child,1,0,0,1,0,0,1,0,0
2,860,0,3,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,,S,0,41.0,0,0,1,0,Adult,0,1,0,0,1,0,0,0,1
3,270,0,1,"Cairns, Mr. Alexander",male,,0,0,113798,31.0,,S,1,29.699118,0,0,1,0,Adult,0,1,0,0,1,0,1,0,0
4,106,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,0,0,343120,7.65,,S,0,21.0,0,0,1,0,Child,1,0,0,1,0,0,0,0,1


In [10]:
#Just for giggles, get a count of how many rows duplicated in train1
train1.duplicated(['index'], keep=False).value_counts()  #583 rows are duplicates

True     583
False    308
dtype: int64

<h2>
We will need the leftovers eventually
</h2>
<div class=h1_cell>
<p>
Since we have duplicates in `train1`, there must be some rows from the Titanic table that were not included in train1. I would like to know which rows were left out of train1.
<p>
Jargon alert: the rows that are left out are called *out of bag*.
</div>

In [11]:
left_out1 = titanic_table.loc[~titanic_table.index.isin(train1['index'])]  #what rows in titanic_table are not in train1?
left_out1 = left_out1.reset_index()  #builds a new index column that comes from original titanic table
left_out1.head()

Unnamed: 0,index,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,no_age,filled_age,emb_C,emb_Q,emb_S,emb_nan,age_bin,age_Child,age_Adult,age_Senior,sex_female,sex_male,ok_child,pclass_1,pclass_2,pclass_3
0,2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,26.0,0,0,1,0,Child,1,0,0,1,0,0,0,0,1
1,3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,35.0,0,0,1,0,Adult,0,1,0,1,0,0,1,0,0
2,5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,1,29.699118,0,1,0,0,Adult,0,1,0,0,1,0,0,0,1
3,6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0,54.0,0,0,1,0,Senior,0,0,1,0,1,0,1,0,0
4,10,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S,0,4.0,0,0,1,0,Child,1,0,0,1,0,1,0,0,1


In [12]:
#We expect no True values with this - should have unique rows
left_out1.duplicated(['index'], keep=False).value_counts()

False    342
dtype: int64


<div class=h1_cell>
<p>
We don't really need the entire left_out1 table. All we need are the values in the `index` column. We can use those to access rows in the Titanic table later. So will pull those indices out and put in a list.
</div>

In [13]:
left_out_indices1 = left_out1['index'].tolist()
left_out_indices1[:5]  # should be same as what see in head() above

[2, 3, 5, 6, 10]

<h2>
Let's congratulate ourselves
</h2>
<div class=h1_cell>
<p>
We have completed the first big step in building our two-tree forest. We have generated a bootstrapped table, `train1`, that we can use for training our first tree.
<p>
Next step is to do the training.

</div>

In [0]:
splitter_columns = [
 'emb_C',
 'emb_Q',
 'emb_S',
 'emb_nan',
 'age_Child',
 'age_Adult',
 'age_Senior',
 'no_age',
 'ok_child',
 'sex_female',
 'pclass_1',
 'pclass_2',
 'pclass_3'
]

<h2>
Training is a bit different
</h2>
<div class=h1_cell>
<p>
Normally we would use all the columns in `splitter_columns` to build our tree. We are going to do something different. For each node in the tree we are building, I will only choose the best splitter from a subset of `splitter_columns`. What I choose to be in the subset will be random. The size of the subset, which I will call `m`, is a hyper-parameter that you can set. A rule of thumb is to set `m` to (at max) the square root of the length of `splitter_columns`. That length is `14` so I will use a value of `3` for `m`.
<p>
Jargon alert: choosing a random subset of columns/features is called *attribute bagging* which is a type of *random subspace sampling*.

</div>

In [15]:
m = int(len(splitter_columns)**.5)  # default is square root of total number of splitters rounded down
m

3

<h2>There are a lot of variations</h2>

Choosing what subset of columns to consider at each node has been studied by many. The Wikipedia page on random forests is one place to start. To give you an idea, some have argued that the selection of the node splitter should be completely random (!) This means omitting computing gig scores and just choosing randomly among all the columns. Others have used a variation on that. Compute the gig score for all columns and order them. Select randomly from the top 5.
<p>
  As a reminder, this is all in the service of trying to avoid overfitting. The very general idea is that if you have a lot of trees with randomness thrown in, then it is a bit like crowd-sourcing. Maybe each tree does really well for certain columns. When you put them together, you get a whole that is better than parts.
  <p>
    Random forests have proven to be the go to method in the kinds of problems you will see on the kaggle web-site. They still are very popular, although deep learning has taken some of the spotlight off of them.

<h2>
The Fickas variation
</h2>
<div class=h1_cell>
<p>
Normally we would take a sample of m *with* replacement. That means, for m = 3, we could potentially have the result be `['no_age', 'no_age', 'no_age']`. In essence, since duplicates can be removed, this adds randomness to m. It really means that m can vary between 1 (all the same) and 3 (all different). If we have only 1 splitter in our candidate set, no need to run gig. It is basically the same as choosing a random column as discussed above.
<p>
For efficiency, I am going to sample *without* replacement. So I am not allowing duplicates in my resulting list. The price I pay is that I lose a bit of randomness.
  <p>

</div>

In [16]:
#Use random library sample method to get sample without replacement
rcols = random.sample(splitter_columns, m)
rcols

['no_age', 'emb_C', 'pclass_3']


<div class=h1_cell>
<p>
We now have the candidate splitters (3 of them) for the root node of tree. We can use our library functions from past modules to do the rest.
</div>

In [17]:
columns_sorted = find_best_splitter(train1, rcols, 'Survived')  #notice using train1 and rcols
(best_column, gig_value) = columns_sorted[0]
print((best_column, gig_value))

('pclass_3', 0.03845127329425646)



<div class=h1_cell>
<p>
Now we can build the 2 starting paths emanating from the root node.
</div>

In [18]:

current_paths = [{'conjunction': [(best_column+'_1', build_pred(best_column, 1))],
                  'prediction': None,
                  'gig_score': gig_value},
                 {'conjunction': [(best_column+'_0', build_pred(best_column, 0))],
                  'prediction': None,
                  'gig_score': gig_value}]
                 
current_paths

[{'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.03845127329425646,
  'prediction': None},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.03845127329425646,
  'prediction': None}]

<div class=h1_cell>
I'll follow another round of splitting, i.e., grow the tree to level 2. I am copying and pasting code from `build_tree_iter` here. But I am making some changes, which I'll mark with `new` comments.
</div>

In [19]:
table = train1

tree_paths = []
new_paths = []
gig_cutoff = 0.0

for path in current_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    rcols = random.sample(splitter_columns, m)       #new - chooses random subset for each new node
    print(rcols)
    columns_sorted = find_best_splitter(before_table, rcols, 'Survived')  #new - using rcols
    (best_column, gig_value) = columns_sorted[0]
    if gig_value > gig_cutoff:
        new_path_1 = {'conjunction': conjunct + [(best_column+'_1', build_pred(best_column, 1))],
                    'prediction': None,
                     'gig_score': gig_value}
        new_paths.append( new_path_1 ) #true
        new_path_0 = {'conjunction': conjunct + [(best_column+'_0', build_pred(best_column, 0))],
                    'prediction': None,
                     'gig_score': gig_value
                     }
        new_paths.append( new_path_0 ) #false
    else:
        #not worth splitting so complete the path with a prediction
        path['prediction'] = compute_prediction(before_table, 'Survived')
        tree_paths.append(path)

['pclass_3', 'age_Child', 'emb_S']
['sex_female', 'pclass_1', 'emb_S']


In [20]:
new_paths

[{'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_1', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.021049896049896044,
  'prediction': None},
 {'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_0', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.021049896049896044,
  'prediction': None},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.2601615877054527,
  'prediction': None},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.2601615877054527,
  'prediction': None}]

In [21]:
tree_paths  # should be empty list

[]

<div class=h1_cell>
I'll stop here with a tree of level 2. Now copy path info into tree_paths so includes predictions.
</div>

In [0]:
for path in new_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    path['prediction'] = compute_prediction(before_table, 'Survived')
    tree_paths.append(path)

In [23]:
tree_paths  #should see predictions on each path

[{'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_1', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.021049896049896044,
  'prediction': 0},
 {'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_0', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.021049896049896044,
  'prediction': 0},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.2601615877054527,
  'prediction': 1},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.2601615877054527,
  'prediction': 0}]

<div class=h1_cell>
I'm going to add another attribute `oob` to a tree. It stands for Out of Bag. We will make use of it later.
</div>

In [0]:
tree1 = {'paths': tree_paths, 'weight': None, 'oob': left_out_indices1}

In [0]:
forest1 = [tree1]

<h2>
Big hand clap
</h2>
<div class=h1_cell>
<p>
We have successfully built the first tree in the forest. We could stop here with a one-tree forest (boring). Let's add at least one more tree.
<p>
I'll build the second tree without much in way of comments. It will look the same as construction of tree1.

</div>

In [0]:
#First generate new training data - every tree gets its own data
train2 = titanic_table.sample(frac=1.0, replace=True, random_state=rng)
left_out2 = titanic_table.loc[~titanic_table.index.isin(train2.index)]

In [0]:
train2 = train2.reset_index()
left_out2 =left_out2.reset_index()
left_out_indices2 = left_out2['index'].tolist()

In [28]:
train2.duplicated(['index'], keep=False).value_counts()

True     558
False    333
dtype: int64

In [29]:
left_out_indices2[:5]

[1, 8, 13, 19, 20]

In [30]:
#build the root node

rcols = random.sample(splitter_columns, m)
rcols

['pclass_3', 'age_Senior', 'emb_nan']

In [31]:
columns_sorted = find_best_splitter(train2, rcols, 'Survived')
(best_column, gig_value) = columns_sorted[0]
print((best_column, gig_value))

('pclass_3', 0.04310124038694357)


In [32]:

current_paths = [{'conjunction': [(best_column+'_1', build_pred(best_column, 1))],
                  'prediction': None,
                  'gig_score': gig_value},
                 {'conjunction': [(best_column+'_0', build_pred(best_column, 0))],
                  'prediction': None,
                  'gig_score': gig_value}]
                 
current_paths

[{'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.04310124038694357,
  'prediction': None},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.04310124038694357,
  'prediction': None}]

In [33]:
table = train2

tree_paths = []
new_paths = []
gig_cutoff = 0.0

for path in current_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    rcols = random.sample(splitter_columns, m)       #new - chooses random subset for each node
    print(rcols)
    columns_sorted = find_best_splitter(before_table, rcols, 'Survived')  #using rcols
    (best_column, gig_value) = columns_sorted[0]
    if gig_value > gig_cutoff:
        new_path_1 = {'conjunction': conjunct + [(best_column+'_1', build_pred(best_column, 1))],
                    'prediction': None,
                     'gig_score': gig_value}
        new_paths.append( new_path_1 ) #true
        new_path_0 = {'conjunction': conjunct + [(best_column+'_0', build_pred(best_column, 0))],
                    'prediction': None,
                     'gig_score': gig_value
                     }
        new_paths.append( new_path_0 ) #false
    else:
        #not worth splitting so complete the path with a prediction
        path['prediction'] = compute_prediction(before_table, 'Survived')
        tree_paths.append(path)

['pclass_1', 'emb_S', 'ok_child']
['pclass_1', 'sex_female', 'age_Adult']


In [34]:
new_paths

[{'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_1', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.020929647534482787,
  'prediction': None},
 {'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_0', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.020929647534482787,
  'prediction': None},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.1949595798584,
  'prediction': None},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.1949595798584,
  'prediction': None}]

In [0]:
for path in new_paths:
    conjunct = path['conjunction']
    before_table = generate_table(table, conjunct)
    path['prediction'] = compute_prediction(before_table, 'Survived')
    tree_paths.append(path)

In [36]:
tree_paths

[{'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_1', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.020929647534482787,
  'prediction': 0},
 {'conjunction': [('pclass_3_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('emb_S_0', <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.020929647534482787,
  'prediction': 0},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_1',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.1949595798584,
  'prediction': 1},
 {'conjunction': [('pclass_3_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>),
   ('sex_female_0',
    <function library_w19_week5b.build_pred.<locals>.<lambda>>)],
  'gig_score': 0.1949595798584,
  'prediction': 0}]

In [0]:
tree2 = {'paths': tree_paths, 'weight': None, 'oob': left_out_indices2}

In [0]:
forest1.append(tree2)

<h2>
Let's stop at a two-tree forest
</h2>
<p>
<div class=h1_cell>
<p>
It would be better to have an odd number to break voting ties, but we will figure something out.
<p>
Now that we have a forest, let's see how to use it for prediction. I'll define a new function, `vote_taker`, that tallies up the votes of all the trees for a single row. Ties go to the negative outcome 0 (arbitrarily).

</div>

In [0]:
def vote_taker(row, forest):
    votes = {0:0, 1:0}
    for tree in forest:
        prediction = tree_predictor(row, tree)
        votes[prediction] += 1
    winner = 1 if votes[1]>votes[0] else 0  #ties go to 0
    return winner

In [40]:
row0 = titanic_table.loc[0]
vote_taker(row0, forest1)  #tree1 0, tree2 0, winner 0

0


<p>
<div class=h1_cell>
<p>
I'm going to define a new function that is very similar to `produce_scores`. But it will work on a forest instead of an individual tree.
I commented on the one line I had to change.</div>

In [0]:
def forest_scores(table, forest, target):
    scratch_table = pd.DataFrame(columns=['prediction', 'actual'])
    scratch_table['prediction'] = table.apply(lambda row: vote_taker(row, forest), axis=1)  #only change is to call vote_taker
    scratch_table['actual'] = table[target]  # just copy the target column
    cases = scratch_table.apply(lambda row: predictor_case(row, pred='prediction', target='actual'), axis=1)
    vc = cases.value_counts()
    return [accuracy(vc), f1(vc), informedness(vc)]

In [42]:
forest_scores(titanic_table, forest1, 'Survived')

[0.7867564534231201, 0.62890625, 0.4543667912951779]

<h2>
Not that good
</h2>
<p>
<div class=h1_cell>
<p>
I'd like to try larger forests. Maybe 10 trees in the forest. But to do that, I don't want to copy and paste all that code. What I would like is a function that can build a forest for me, taking as a hyper parameter how many trees to include. Here is the start.
<pre>
<code>
def forest_builder(table, column_choices, target, hypers):
    depth = 2 if 'max-depth' not in hypers else hypers['max-depth']
    tree_n = 5 if 'total-trees' not in hypers else hypers['total-trees']
    m = int(len(column_choices)**.5) if 'm' not in hypers else hypers['m']
</code>
</pre>
<p>
The return should be a forest, i.e., a list of trees as seen above.
<p>
My implementation borrows heavily from `tree_builder_iter` from module 4. I am still using the nested function `iterative_build` to build a tree. But I have modified it to use bootstrapped training data for each tree and a random subset of attributes for each node in a tree. At the bottom I repeatedly call `iterative_build` to generate the trees for my forest. I also added some new keys to hypers.
</div>

In [0]:
def forest_builder(table, column_choices, target, hypers):

    tree_n = 5 if 'total-trees' not in hypers else hypers['total-trees']
    m = int(len(column_choices)**.5) if 'm' not in hypers else hypers['m']
    k = hypers['max-depth'] if 'max-depth' in hypers else min(2, len(column_choices))
    gig_cutoff = hypers['gig-cutoff'] if 'gig-cutoff' in hypers else 0.0
    rgen = hypers['random-state'] if 'random-state' in hypers else 0  #an int will work as seed with the sample method.

    #build a single tree of depth n - call it multiple times to build multiple trees
    def iterative_build(n):
        train = table.sample(frac=1.0, replace=True, random_state=rgen)
        train = train.reset_index()
        left_out = table.loc[~table.index.isin(train['index'])]
        left_out = left_out.reset_index() # this gives us the old index in its own column
        oob_list = left_out['index'].tolist()  # list of row indices from original titanic table
        
        rcols = random.sample(column_choices, m)  # subspcace sampling - uses random.seed, not rng
        columns_sorted = find_best_splitter(train, rcols, target)
        (best_column, gig_value) = columns_sorted[0]

        #Note I add _1 or _0 to make it more readable for debugging
        current_paths = [{'conjunction': [(best_column+'_1', build_pred(best_column, 1))],
                          'prediction': None,
                          'gig_score': gig_value},
                         {'conjunction': [(best_column+'_0', build_pred(best_column, 0))],
                          'prediction': None,
                          'gig_score': gig_value}
                        ]
        n -= 1  # we just built a level as seed so subtract 1 from n
        tree_paths = []  # add completed paths here

        while n>0:
            new_paths = []
            for path in current_paths:
                conjunct = path['conjunction']  # a list of (name, lambda)
                before_table = generate_table(train, conjunct)  #the subtable the current conjunct leads to
                rcols = random.sample(column_choices, m)  # subspace
                columns_sorted = find_best_splitter(before_table, rcols, target)
                (best_column, gig_value) = columns_sorted[0]
                if gig_value > gig_cutoff:
                    new_path_1 = {'conjunction': conjunct + [(best_column+'_1', build_pred(best_column, 1))],
                                'prediction': None,
                                 'gig_score': gig_value}
                    new_paths.append( new_path_1 ) #true
                    new_path_0 = {'conjunction': conjunct + [(best_column+'_0', build_pred(best_column, 0))],
                                'prediction': None,
                                 'gig_score': gig_value
                                 }
                    new_paths.append( new_path_0 ) #false
                else:
                    #not worth splitting so complete the path with a prediction
                    path['prediction'] = compute_prediction(before_table, target)
                    tree_paths.append(path)
            #end for loop

            current_paths = new_paths
            if current_paths != []:
                n -= 1
            else:
                break  # nothing left to extend so have copied all paths to tree_paths
        #end while loop

        #Generate predictions for all paths that have None
        for path in current_paths:
            conjunct = path['conjunction']
            before_table = generate_table(train, conjunct)
            path['prediction'] = compute_prediction(before_table, target)
            tree_paths.append(path)
        return (tree_paths, oob_list)
    
    #let's build a forest
    forest = []
    for i in range(tree_n):
        (paths, oob) = iterative_build(k)  #always use k for now
        forest.append({'paths': paths, 'weight': None, 'oob': oob})
        
    return forest

<div class=h1_cell>
<p>
Ok, let's build a forest with 5 trees (the default).
</div>

In [44]:
forest2 = forest_builder(titanic_table, splitter_columns, 'Survived', hypers={'random-state':rng})
len(forest2)

5

<div class=h1_cell>
<p>
Now get scores.
</div>

In [45]:
forest_scores(titanic_table, forest2, 'Survived')

[0.7800224466891134, 0.684887459807074, 0.5007669446841148]

<h2>
Try 2 more
</h2>
<p>
<div class=h1_cell>
<p>
One with 11 trees with depth of 2, and one with 11 trees and depth of 1 (i.e., 11 stumps).
</div>

In [46]:
forest3 = forest_builder(titanic_table, splitter_columns, 'Survived', hypers={'total-trees':11, 'random-state':rng})
len(forest3)

11

In [47]:
forest_scores(titanic_table, forest3, 'Survived')

[0.77665544332211, 0.6924265842349303, 0.5074297766273608]

In [48]:
forest4 = forest_builder(titanic_table, splitter_columns, 'Survived', hypers={'total-trees':11, 'max-depth':1, 'random-state':rng})
len(forest4)

11

In [49]:
forest_scores(titanic_table, forest4, 'Survived')

[0.7867564534231201, 0.62890625, 0.4543667912951779]

<h2>
Should explore more
</h2>
<p>
<div class=h1_cell>
If we were ambitious, we could write yet another function that explored for us. Tried combinations of number of trees and depth and reported the best performing. I won't make you write this function but it should be in your grasp by this point. All you would be doing is repeating steps above but now in nested loops that produced various combinations to try.
</div>

<h2>
Ouf Of Bag errors
</h2>
<p>
<div class=h1_cell>
We could now use K-Folding to do a better evaluation of our forests. But I want to consider another approach called *Out of Bag error* (jargon alert). You will now see where that `oob` value on a tree comes in handy. As reminder, for each tree we generated a training sample. But that sample always leaves some rows out because of replacement. We captured these "left out" rows in the `oob` entry. We actually captured the row indices in the larger Titanic table.
<p>
Here's what I would like to do. I would like to evaluate a forest on the out of bag rows. It kind of makes sense, right? A tree was trained on some set of rows that excluded the rows in oob. So the oob rows are a bit like the test data from K-Folding. We use the oob rows for testing.

One way to look at it is I create a new testing table that is the union of the oob list on each tree. I then use this testing table to get predictions by forest vote taking. Here is the twist: a tree gets to vote on a row only if the row is in its oob list.
<p>
I am going to give you the chance to decide how to implement oob testing as part of your homework assignment.
</div>

<hr>
<h1>Write it out</h1>
<div class=h1_cell>

Did not change table but we did define new functions. Add them to your library.
</div>


<h2>
Next up
</h2>
<p>
<div class=h1_cell>
We will continue to look at forests. But study an alternative way to test them..
</div>