This page is based on the [Random Forest analysis page](titanic-3.ipynb)
In that page, notice that just three of the features (Fare, Age, Gender) account are much more valuable predictors that the rest. This is consistent with our [first-pass analysis](titanic-1.ipynb). Let's explore how well just those three features would work.

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [49]:
train = pd.read_csv('titanic_train.csv')

As before, we need to fill in some missing age values. Let's use the passenger's ticket class to guess their age.

In [50]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

In [52]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

As before, convert the categorical sex field into a numeric field by encoding the "Female" and "Male" values.

In [53]:
sex = pd.get_dummies(train['Sex'],drop_first=True)

Now we add the engineered field and drop the categorical fields.

In [54]:
train = pd.concat([train,sex],axis=1)
train.drop(['Sex', 'Embarked','Name','Ticket','Cabin','PassengerId','Pclass','SibSp','Parch'],axis=1,inplace=True)

So now we should just have the Fare, Age, and encoded Sex field

In [55]:
train.head()

Unnamed: 0,Survived,Age,Fare,male
0,0,22.0,7.25,1
1,1,38.0,71.2833,0
2,1,26.0,7.925,0
3,1,35.0,53.1,0
4,0,35.0,8.05,1


Now we will seperate the inputs from the output. X is all the input data; y is what we are trying to predict:

In [56]:
X = train.drop('Survived',axis=1)
y = train['Survived']

In [57]:
from sklearn.cross_validation import train_test_split

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [59]:
from sklearn import tree

In [60]:
dtc = tree.DecisionTreeClassifier()

In [61]:
dtc = dtc.fit(X_train,y_train)

In [62]:
predictions = dtc.predict(X_test)

In [63]:
predictions

array([0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0])

In [64]:
from sklearn.metrics import classification_report, confusion_matrix

In [65]:
print('Confusion Matrix')
print(confusion_matrix(y_test,predictions))
print()
print('Classification Report')
print(classification_report(y_test,predictions))


Confusion Matrix
[[144  25]
 [ 42  84]]

Classification Report
             precision    recall  f1-score   support

          0       0.77      0.85      0.81       169
          1       0.77      0.67      0.71       126

avg / total       0.77      0.77      0.77       295



### For comparison, the results of the Logistic Regression analysis:
    Confusion Matrix
    [[153  16]
     [ 37  89]]

    Classification Report
                  precision    recall  f1-score   support

              0       0.81      0.91      0.85       169
              1       0.85      0.71      0.77       126

    avg / total       0.82      0.82      0.82       295




# Now lets try a Random Forest Classifier

In [66]:
from sklearn.ensemble import RandomForestClassifier

In [67]:
rfc = RandomForestClassifier(n_estimators=50)

In [68]:
rfc.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [69]:
predictions = rfc.predict(X_test)

In [70]:
predictions

array([0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0])

In [71]:
print('Confusion Matrix')
print(confusion_matrix(y_test,predictions))
print()
print('Classification Report')

print(classification_report(y_test,predictions))


Confusion Matrix
[[153  16]
 [ 48  78]]

Classification Report
             precision    recall  f1-score   support

          0       0.76      0.91      0.83       169
          1       0.83      0.62      0.71       126

avg / total       0.79      0.78      0.78       295



### Here are the results from the previous analysis, with more features.
    Confusion Matrix
    [[149  20]
     [ 36  90]]
    
    Classification Report
                 precision    recall  f1-score   support
    
              0       0.81      0.88      0.84       169
              1       0.82      0.71      0.76       126
    
    avg / total       0.81      0.81      0.81       295

### How does that compare?
we went from 14 features to 3, and our precision changed from 81% to 79%. 