# Decision Trees and Random Forests



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split


df = pd.read_csv("datasets/titanic/train.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Initial Baseline

To start we'll do minimal data wrangling and save all feature engineering for later.
This is to establish a baseline for how well our models perform with as raw of a dataset as possible.
Later when we engage in feature engineering the we'll be able to determine the quality of our new features.
If some features hurt our models' performance then it may not be worthwhile to use the additional compute power to generate these additional features.

### Data Wrangling

The only data wrangling that we'll initially engage in will be:

- fill missing data values for *Age* with the median of the age of all passengers on the Titanic that have their ages recorded.
- change the *Sex* values from male/female to binary 0 for male and 1 for female since otherwise both ml models will not work
- change the *Embarked* data values from string values for the port of embarkment to 0, 1, or 2 since again both ml models will not work with string data

In [2]:
# Input missing Age and Embarked values
df.fillna({
    "Age": df["Age"].median(),
    "Embarked": "S"
    }, inplace=True)

# Convert Sex into integers
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1}).astype(int)

# Convert string data for embarked numerical or bool
df["Embarked"] = df["Embarked"].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

X = df.drop(["Name", "PassengerId", "Survived", "Cabin", "Ticket", "SibSp", "Parch"], axis=1)
print(X)
y = df['Survived']

     Pclass  Sex   Age     Fare  Embarked
0         3    0  22.0   7.2500         0
1         1    1  38.0  71.2833         1
2         3    1  26.0   7.9250         0
3         1    1  35.0  53.1000         0
4         3    0  35.0   8.0500         0
..      ...  ...   ...      ...       ...
886       2    0  27.0  13.0000         0
887       1    1  19.0  30.0000         0
888       3    1  28.0  23.4500         0
889       1    0  26.0  30.0000         1
890       3    0  32.0   7.7500         2

[891 rows x 5 columns]


### Decision Tree

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
dt_y_pred = decision_tree.predict(X_test)
dt_accuracy = decision_tree.score(X_test, y_test)

print(f"Accuracy: {dt_accuracy:.3f}\n")

Accuracy: 0.758



### Random Forest

In [4]:
from sklearn.ensemble import RandomForestClassifier


rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)
rf_y_pred = rforest.predict(X_test)
rf_accuracy = rforest.score(X_test, y_test)

print(f"Accuracy: {rf_accuracy:.3f}\n")

Accuracy: 0.785



### Initial Assessments

From both the decision tree and and random forest classifiers, we can see that the random forest is nearly $5%$ more accurate than the decision tree. That is to be expected though since we are using the default `n_estimators=100` with the random forest which in tern means that it is using the consensus of 100 decision trees in the forest.

From this point forward the way we'll try to improve this accuracy is by engineering new features from the existing data.

## Feature Engineering

### Familial Features

We'll begin by creating features around the family of different passengers aboard the Titanic.

The first feature we'll create is `family_size` to get the size of a passenger's family aboard.

In [5]:
df['family_size'] = df['SibSp'] + df['Parch'] + 1
(df[['family_size', 'Survived']].groupby(['family_size'], as_index=False)
    .mean()
    .sort_values(by='Survived', ascending=False))

Unnamed: 0,family_size,Survived
3,4,0.724138
2,3,0.578431
1,2,0.552795
6,7,0.333333
0,1,0.303538
4,5,0.2
5,6,0.136364
7,8,0.0
8,11,0.0


As we can see from the above table, families with 2, 3, 4 people aboard seem to survive at an avg of >50%.

To better evaluate our models with the new features we engineer, we're better off calling a function such as the following `generate`.

In [6]:
def generate(drop_cols=[]):
    cols = ["Name", "PassengerId", "Survived", "Cabin", "Ticket", "SibSp", "Parch"]
    cols += drop_cols
    X = df.drop(cols, axis=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    
    decision_tree = DecisionTreeClassifier(random_state=50)
    decision_tree.fit(X_train, y_train)
    dt_y_pred = decision_tree.predict(X_test)
    dt_accuracy = decision_tree.score(X_test, y_test)
    
    print(f"Decision Tree Accuracy: {dt_accuracy:.3f}\n")
    
    rforest = RandomForestClassifier(random_state=50)
    rforest.fit(X_train, y_train)
    rf_y_pred = rforest.predict(X_test)
    rf_accuracy = rforest.score(X_test, y_test)
    
    print(f"Random Forest Accuracy: {rf_accuracy:.3f}\n")

generate()

Decision Tree Accuracy: 0.767

Random Forest Accuracy: 0.789



We can see that the accuracy scores for both the decision tree and random forest models we've trained has decreased slightly, but that is not necessairly a bad thing. If we were to stop here, then we'd be better without the new feature. Since we're going to engineer multiple new features, we can evaluate everything after we engineer many of them.

The next feature we'll create might seem a bit trivial, but can be really useful is `IsAlone` which will be a simple binary 1 if the passenger is aboard alone or a zero if they are not alone. In addition to that, since we have `family_size` we can compute the `fare_per_person` in the family to get a socio-economic metric to go off of for the passenger as well.

In [7]:
# Create IsAlone
df['IsAlone'] = 0
df.loc[df['family_size'] == 1, 'IsAlone'] = 1

print("IsAlone:")
generate(['family_size'])

# Create fare_per_person
df['fare_per_person'] = df['Fare'] / df['family_size']

print('fare_per_person')
generate(['family_size', 'IsAlone'])

IsAlone:
Decision Tree Accuracy: 0.753

Random Forest Accuracy: 0.789

fare_per_person
Decision Tree Accuracy: 0.744

Random Forest Accuracy: 0.798



### Naming

We can examine and engineer features around people's names, particularly looking at a passenger's prefix using some regex could provide some insight to their socio-economic standing as well. We can call this `name_prefix`.

In [8]:
df['name_prefix'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

pd.crosstab(df['name_prefix'], df['Sex'])

Sex,0,1
name_prefix,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,1,0
Col,2,0
Countess,0,1
Don,1,0
Dr,6,1
Jonkheer,1,0
Lady,0,1
Major,2,0
Master,40,0
Miss,0,182


From the table above, there are many different prefixes from the passengers, and some are more common than others. There are also many aliases of some of the more common prefixes listed. For example, *Miss* is a very common prefix among the passengers, however *Ms* and *Mlle* (french for Miss) are aliases which should be recategorized. Any of the more obscure or rare outliers can be grouped into a saperate class called *Rare*.

In [9]:
df['name_prefix'] = df['name_prefix'].replace(['Lady', 'Countess','Capt', 
                                               'Col', 'Don', 'Dr', 'Major',
                                               'Rev', 'Sir', 'Jonkheer'], 
                                              'Rare')

df['name_prefix'] = df['name_prefix'].replace('Mlle', 'Miss')
df['name_prefix'] = df['name_prefix'].replace('Ms', 'Miss')
df['name_prefix'] = df['name_prefix'].replace('Mme', 'Mrs')

df[['name_prefix', 'Survived']].groupby(['name_prefix'], as_index=False).mean()


Unnamed: 0,name_prefix,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,Rare,0.347826


Like before, we cannot keep this data in string format, so we'll need to convert it to numerical data, and we can use the `LabelEncoder` from sklearn.

In [10]:
from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
df['name_prefix'] = le.fit_transform(df['name_prefix'])

print("name_prefix:")
generate(['family_size', 'IsAlone', 'fare_per_person'])

name_prefix:
Decision Tree Accuracy: 0.762

Random Forest Accuracy: 0.785



There are more features that could be engineered with this dataset, but out of the 4 new features engineered, we can see that all features on their own increase the accuracy of the random forest classifier. Only the `fare_per_person` feature dropped the accuracy of the decision tree classifier. Let's next train out model with all of the new features together.

In [19]:
print('All new features combined:')
generate()

All new features combined:
Decision Tree Accuracy: 0.749

Random Forest Accuracy: 0.816



To conclude, the decision tree model accuracy worsened, however it was clear even before that the random forest classifier was the better choice of the two similar models. The random forest accuracy with all 4 engineered features included in the dataset significantly improved to the point of the accuracy being over 80%. Perhaps with more feature engineering the random forest model could be improved even further.