## Pipeline: Split data into train, validation, and test set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [3]:
df = pd.read_csv('cleaned_titanic.csv',index_col=False)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,0,3,0,22.0,7.25,1,0
1,1,1,1,38.0,71.2833,1,1
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,0,0


In [4]:
features = df.drop('Survived', axis=1)
target = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)


In [5]:
for dataset in [y_train,y_test,y_val]:
    print(round(len(dataset) / len(target),2))

0.6
0.2
0.2


In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
import warnings
warnings.filterwarnings('ignore',category=FutureWarning)

In [7]:
rf = RandomForestClassifier()

scores = cross_val_score(rf,X_train,y_train,cv=5)

In [8]:
print(scores)
print(scores.mean())

[0.80373832 0.82242991 0.79439252 0.79439252 0.82075472]
0.807141597601834


In [9]:
def gcv_result(cv_results):
    means = cv_results['mean_test_score']
    stds = cv_results['std_test_score']
    best_params = cv_results['params'][means.argmax()]

    print('Best Params: {}\n'.format(best_params))
    print('Mean test score: {}'.format(round(means.mean(), 3)))
    print('Standard deviation of test score: {}'.format(round(stds.mean(), 3)))

    for mean, std, params in zip(means, stds, cv_results['params']):
        print('{} (x/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))
        
parameters = {
    'n_estimators': [5,50,100],
    'max_depth': [2,10,20,None]
}
gcv = GridSearchCV(rf, parameters, cv=5, return_train_score=True)
gcv.fit(X_train, y_train)

print(gcv_result(gcv.cv_results_))

Best Params: {'max_depth': 10, 'n_estimators': 100}

Mean test score: 0.803
Standard deviation of test score: 0.029
0.738 (x/-0.103) for {'max_depth': 2, 'n_estimators': 5}
0.792 (x/-0.123) for {'max_depth': 2, 'n_estimators': 50}
0.79 (x/-0.125) for {'max_depth': 2, 'n_estimators': 100}
0.818 (x/-0.041) for {'max_depth': 10, 'n_estimators': 5}
0.82 (x/-0.046) for {'max_depth': 10, 'n_estimators': 50}
0.826 (x/-0.059) for {'max_depth': 10, 'n_estimators': 100}
0.792 (x/-0.022) for {'max_depth': 20, 'n_estimators': 5}
0.805 (x/-0.038) for {'max_depth': 20, 'n_estimators': 50}
0.813 (x/-0.023) for {'max_depth': 20, 'n_estimators': 100}
0.815 (x/-0.051) for {'max_depth': None, 'n_estimators': 5}
0.807 (x/-0.04) for {'max_depth': None, 'n_estimators': 50}
0.815 (x/-0.036) for {'max_depth': None, 'n_estimators': 100}
None


In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report


In [14]:
rf1 = RandomForestClassifier(n_estimators=100,max_depth=10)
rf1.fit(X_train,y_train)

rf2 = RandomForestClassifier(n_estimators=50,max_depth=10)
rf2.fit(X_train,y_train)

In [18]:
for perf in [rf1,rf2]:
    y_pred = perf.predict(X_val)
    accuracy = round(accuracy_score(y_val,y_pred),3)
    precision = round(precision_score(y_val,y_pred),3)
    recall = round(recall_score(y_val,y_pred),3)
    print('max depth {} \n no. of Estimators {} \n Accuracy {} \n Precision {}\n Recall {}'\
        .format(perf.max_depth,
                perf.n_estimators,
                accuracy,
                precision,
                recall))

max depth 10 
 no. of Estimators 100 
 Accuracy 0.816 
 Precision 0.852
 Recall 0.684
max depth 10 
 no. of Estimators 50 
 Accuracy 0.827 
 Precision 0.846
 Recall 0.724


In [31]:
y_pred = rf2.predict(X_test)
accuracy = round(accuracy_score(y_test,y_pred),3)
precision = round(precision_score(y_test,y_pred),3)
recall = round(recall_score(y_test,y_pred),3)
print('Max depth {} No. of Estimators {} Accuracy {} Precision {} Recall {}'.format(rf2.max_depth,
                                                                                    rf2.n_estimators,
                                                                                    accuracy,
                                                                                    precision,
                                                                                    recall))

Max depth 10 No. of Estimators 50 Accuracy 0.792 Precision 0.741 Recall 0.662


In [33]:
print(classification_report(y_test,y_pred, target_names=['Not Survived','Survived']))

              precision    recall  f1-score   support

Not Survived       0.82      0.87      0.84       113
    Survived       0.74      0.66      0.70        65

    accuracy                           0.79       178
   macro avg       0.78      0.76      0.77       178
weighted avg       0.79      0.79      0.79       178

