# Setup

The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

Let's fetch the data and load it:

In [1]:
import os
import urllib.request
import numpy as np

TITANIC_PATH = os.path.join("datasets", "titanic")
DOWNLOAD_URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/titanic/"

def fetch_titanic_data(url=DOWNLOAD_URL, path=TITANIC_PATH):
    if not os.path.isdir(path):
        os.makedirs(path)
    for filename in ("train.csv", "test.csv"):
        filepath = os.path.join(path, filename)
        if not os.path.isfile(filepath):
            print("Downloading", filename)
            urllib.request.urlretrieve(url + filename, filepath)

fetch_titanic_data()

In [2]:
import pandas as pd

def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

In [3]:
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")


train_data['train_test'] = 1
test_data['train_test'] = 0
test_data['Survived'] = np.NaN
all_data = pd.concat([train_data,test_data])

%matplotlib inline
all_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'train_test'],
      dtype='object')

The data is already split into a training set and a test set.

Let's take a closer look at the dataset:

In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,train_test
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


The attributes have the following meaning:


* **PassengerId**: a unique identifier for each passenger
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name, Sex, Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

In [5]:
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

# Exploratory Data Analysis

Are there any missing values?

In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Name        891 non-null    object 
 3   Sex         891 non-null    object 
 4   Age         714 non-null    float64
 5   SibSp       891 non-null    int64  
 6   Parch       891 non-null    int64  
 7   Ticket      891 non-null    object 
 8   Fare        891 non-null    float64
 9   Cabin       204 non-null    object 
 10  Embarked    889 non-null    object 
 11  train_test  891 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


Okay, the **Age**, **Cabin** and **Embarked** attributes are sometimes null (less than 891 non-null), especially the **Cabin** (77% are null). We will ignore the **Cabin** for now and focus on the rest. The **Age** attribute has about 19% null values, so we will need to decide what to do with them. Replacing null values with the median age seems reasonable. We could be a bit smarter by predicting the age based on the other columns (for example, the median age is 37 in 1st class, 29 in 2nd class and 24 in 3rd class), but we'll keep things simple and just use the overall median age.

The **Name** and **Ticket** attributes may have some value, but they will be a bit tricky to convert into useful numbers that a model can consume. So for now, we will ignore them.

Let's take a look at the numerical attributes:

In [7]:
train_data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,train_test
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699113,0.523008,0.381594,32.204208,1.0
std,0.486592,0.836071,14.526507,1.102743,0.806057,49.693429,0.0
min,0.0,1.0,0.4167,0.0,0.0,0.0,1.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104,1.0
50%,0.0,3.0,28.0,0.0,0.0,14.4542,1.0
75%,1.0,3.0,38.0,1.0,0.0,31.0,1.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,1.0


* Only 38% **Survived**! That's close enough to 40%, so accuracy will be a reasonable metric to evaluate our model.
* The mean **Fare** was £32.20, which does not seem so expensive (but it was probably a lot of money back then).
* The mean **Age** was less than 30 years old.

In [8]:
train_data["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [9]:
train_data["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [10]:
train_data["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [11]:
train_data["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.


# Feature Engineering

Let's build the pipeline for the numerical attributes:

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])

Now we can build the pipeline for the categorical attributes:

In [13]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ])

Let's create an age bucket category:

In [14]:
train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()

Unnamed: 0_level_0,Survived
AgeBucket,Unnamed: 1_level_1
0.0,0.576923
15.0,0.362745
30.0,0.423256
45.0,0.404494
60.0,0.24
75.0,1.0


In [15]:
train_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,train_test,AgeBucket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,15.0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,30.0
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1,15.0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,30.0
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,30.0


Let's replace **SibSp** and **Parch** with their sum:

In [16]:
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()

Unnamed: 0_level_0,Survived
RelativesOnboard,Unnamed: 1_level_1
0,0.303538
1,0.552795
2,0.578431
3,0.724138
4,0.2
5,0.136364
6,0.333333
7,0.0
10,0.0


Let's try to identify parts of names that correlate well with the Survived attribute.

In [17]:
train_data['name_title'] = train_data.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
train_data['name_title'].value_counts()

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: name_title, dtype: int64

Let's take **Cabin**'s column first letter and treat it as a categorical attribute:

In [18]:
train_data['ticket_letters'] = train_data.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) >0 else 0)
train_data['ticket_letters'].value_counts()

0            665
pc            60
ca            41
a5            21
stono2        18
sotonoq       15
scparis       11
wc            10
a4             7
soc            6
fcc            5
c              5
sopp           3
pp             3
wep            3
ppp            2
scah           2
sotono2        2
swpp           2
fc             1
scahbasle      1
as             1
sp             1
sc             1
scow           1
fa             1
sop            1
sca4           1
casoton        1
Name: ticket_letters, dtype: int64

Let's see how many passagers traveled alone:

In [19]:
train_data['alone'] = (train_data["RelativesOnboard"] == 0)
train_data['alone'].value_counts()

True     537
False    354
Name: alone, dtype: int64

In [20]:
#create all categorical variables that we did above for both training and test sets 
all_data['ticket_letters'] = all_data.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) >0 else 0)
all_data['name_title'] = all_data.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
all_data["AgeBucket"] = all_data["Age"] // 15 * 15
all_data["RelativesOnboard"] = all_data["SibSp"] + all_data["Parch"]
all_data['alone'] = (all_data["RelativesOnboard"] == 0)

Let's join the numerical and categorical pipelines:

In [21]:
all_data = all_data.astype({"AgeBucket": 'object', "alone": 'object'})

In [22]:
from sklearn.compose import ColumnTransformer

num_attribs = ["Age", "SibSp", "Parch", "Fare","RelativesOnboard"]
cat_attribs = ["Pclass", "Sex", "Embarked", "AgeBucket", 'alone','ticket_letters', 'name_title']

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

In [23]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   PassengerId       1309 non-null   int64  
 1   Survived          891 non-null    float64
 2   Pclass            1309 non-null   int64  
 3   Name              1309 non-null   object 
 4   Sex               1309 non-null   object 
 5   Age               1046 non-null   float64
 6   SibSp             1309 non-null   int64  
 7   Parch             1309 non-null   int64  
 8   Ticket            1309 non-null   object 
 9   Fare              1308 non-null   float64
 10  Cabin             295 non-null    object 
 11  Embarked          1307 non-null   object 
 12  train_test        1309 non-null   int64  
 13  ticket_letters    1309 non-null   object 
 14  name_title        1309 non-null   object 
 15  AgeBucket         1046 non-null   object 
 16  RelativesOnboard  1309 non-null   int64  
 

In [24]:
#drop null 'embarked' rows. Only 2 instances of this in training and 0 in test 
all_data.dropna(subset=['Embarked'],inplace = True)

In [25]:
for col in all_data.columns:
    if all_data.dtypes[col] == "object":
        all_data = all_data.astype({col: 'str'})

        

In [26]:
#Split to train test again
train_data=all_data[all_data.train_test == 1]
test_data = all_data[all_data.train_test == 0]

# converted fare to category for pd.get_dummies()
all_data.Pclass = all_data.Pclass.astype(str)

#created dummy variables from categories (also can use OneHotEncoder)
all_dummies = pd.get_dummies(all_data[["Age", "SibSp", "Parch", "Fare","RelativesOnboard","Pclass", "Sex", "Embarked", 'ticket_letters', 'name_title', "AgeBucket", 'alone','train_test']])
X_train = all_dummies[all_dummies.train_test == 1].drop(['train_test'], axis =1)
X_test = all_dummies[all_dummies.train_test == 0].drop(['train_test'], axis =1)

In [27]:
X_train = preprocess_pipeline.fit_transform(
   train_data[num_attribs + cat_attribs])
X_train.shape

(889, 68)

In [28]:
X_test = preprocess_pipeline.transform(
    test_data[num_attribs + cat_attribs])
X_test.shape

(418, 68)

In [29]:
all_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,train_test,ticket_letters,name_title,AgeBucket,RelativesOnboard,alone
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1,a5,Mr,15.0,1,False
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,pc,Mrs,30.0,1,False
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1,stono2,Miss,15.0,0,True
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1,0,Mrs,30.0,1,False
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1,0,Mr,30.0,0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,0,a5,Mr,,0,True
414,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,0,pc,Dona,30.0,0,True
415,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,sotonoq,Mr,30.0,0,True
416,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,0,0,Mr,,0,True


Let's not forget the labels:

In [30]:
y_train = train_data["Survived"]

In [31]:
X_train.shape

(889, 68)

# Model Building

In [32]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

Starting with **Naive Bayes** :

In [33]:
gnb = GaussianNB()
cv = cross_val_score(gnb,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.70224719 0.46629213 0.41011236 0.42696629 0.45762712]
0.4926490192344316


**Logistic Regression**

In [34]:
lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.8258427  0.83707865 0.78089888 0.80337079 0.85875706]
0.8211896146765696


**Decision Trees**

In [35]:
dt = tree.DecisionTreeClassifier(random_state = 1)
cv = cross_val_score(dt,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.75280899 0.78651685 0.7752809  0.7752809  0.78531073]
0.775039674982543


**K-Neighbores Classifier**

In [36]:
knn = KNeighborsClassifier()
cv = cross_val_score(knn,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.76404494 0.78089888 0.79213483 0.79775281 0.84180791]
0.7953278740557355


**Random Forest**

In [37]:
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.79775281 0.79775281 0.82022472 0.76966292 0.84180791]
0.8054402336062972


**SVC**

In [38]:
svc = SVC(probability = True)
cv = cross_val_score(svc,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

[0.84269663 0.82022472 0.8258427  0.80337079 0.8700565 ]
0.8324382657271631


**So, we see that the Support Vector Classifier performs best with an 83% score.** 

Now, let's tune our models with **Grid Search** and **Randomized Serach**

# Grid Search

In [39]:
def clf_performance(classifier, model_name):
    print(model_name)
    print('Best Score: ' + str(classifier.best_score_))
    print('Best Parameters: ' + str(classifier.best_params_))

**1. SVC**

In [40]:
from sklearn.model_selection import GridSearchCV

svc = SVC(probability = True)
param_grid = tuned_parameters = [{'kernel': ['rbf'], 'gamma': [0.0001,0.0003,0.001,0.003,0.01,0.03,.1,.3,1,3,10],
                                  'C': [1, 3, 10, 30, 100, 300, 1000]},
                                 {'kernel': ['poly'], 'degree' : [2,3,4,5], 'C':[ 1, 3, 10, 30, 100,300, 1000]}]
clf_svc = GridSearchCV(svc, param_grid = param_grid, cv = 3, verbose = True, n_jobs = -1)
best_clf_svc = clf_svc.fit(X_train,y_train)
clf_performance(best_clf_svc,'SVC')

Fitting 3 folds for each of 105 candidates, totalling 315 fits
SVC
Best Score: 0.8357827524494191
Best Parameters: {'C': 1, 'gamma': 0.03, 'kernel': 'rbf'}


**2. Logistic Regression**

In [41]:
lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2','elasticnet'],
              'C' : np.logspace(-4, 4, 20),
              'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

clf_lr = GridSearchCV(lr, param_grid = param_grid, cv = 3, verbose = True, n_jobs = -1)
best_clf_lr = clf_lr.fit(X_train,y_train)
clf_performance(best_clf_lr,'Logistic Regression')

Fitting 3 folds for each of 300 candidates, totalling 900 fits
Logistic Regression
Best Score: 0.82339915673249
Best Parameters: {'C': 1.623776739188721, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'newton-cg'}


 0.61754937 0.67267267 0.61754937 0.61754937        nan        nan
        nan        nan        nan        nan        nan 0.61754937
        nan 0.61754937 0.61754937 0.61754937 0.67716201 0.61754937
 0.61754937        nan        nan        nan        nan        nan
        nan        nan 0.61754937        nan 0.61754937 0.6186717
 0.6186717  0.70303637 0.6186717  0.6186717         nan        nan
        nan        nan        nan        nan        nan 0.61754937
        nan 0.61754937 0.6535399  0.6535399  0.73455273 0.6535399
 0.6535399         nan        nan        nan        nan        nan
        nan        nan 0.61754937        nan 0.61754937 0.76269451
 0.76269451 0.78406058 0.76269451 0.76269451        nan        nan
        nan        nan        nan        nan        nan 0.61754937
        nan 0.61754937 0.79417675 0.79417675 0.79305063 0.79417675
 0.79417675        nan        nan        nan        nan        nan
        nan        nan 0.78290032        nan 0.77392165 0.815527

**3. Random Forest**

In [42]:
rf = RandomForestClassifier(random_state = 1)
param_grid =  {'n_estimators': [400,450,500,550],
               'criterion':['gini','entropy'],
                                  'bootstrap': [True],
                                  'max_depth': [15, 20, 25],
                                  'max_features': ['auto','sqrt', 10],
                                  'min_samples_leaf': [2,3],
                                  'min_samples_split': [2,3]}
                                  
clf_rf = GridSearchCV(rf, param_grid = param_grid, cv = 3, verbose = True, n_jobs = -1)
best_clf_rf = clf_rf.fit(X_train,y_train)
clf_performance(best_clf_rf,'Random Forest')

Fitting 3 folds for each of 288 candidates, totalling 864 fits
Random Forest
Best Score: 0.8234294900961568
Best Parameters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 25, 'max_features': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 500}


**4. K-Neighbores Classifier**

In [43]:
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': (1,10, 1),
            'leaf_size': (20,40,1),
              'weights' : ['uniform', 'distance'],
              'algorithm' : ['auto', 'ball_tree','kd_tree'],
              'metric': ['minkowski', 'chebyshev'],
              'p' : [1,2]}
clf_knn = GridSearchCV(knn, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_knn = clf_knn.fit(X_train,y_train)
clf_performance(best_clf_knn,'KNN')

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
KNN
Best Score: 0.8122008506316257
Best Parameters: {'algorithm': 'auto', 'leaf_size': 20, 'metric': 'minkowski', 'n_neighbors': 10, 'p': 2, 'weights': 'uniform'}


# Randomized Search

**1. SVC**

In [44]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 
from scipy import stats
from sklearn.metrics import make_scorer, roc_auc_score 

auc = make_scorer(roc_auc_score)
rand_list = {"C": stats.uniform(400, 800),
             "gamma": stats.uniform(0.001, 0.05)
            }
              
rand_search = RandomizedSearchCV(svc, param_distributions = rand_list, n_iter = 20, n_jobs = 4, cv = 3, random_state = 2017, scoring = auc) 
rand_clf_svc = rand_search.fit(X_train,y_train) 
clf_performance(rand_clf_svc,'SVC')

SVC
Best Score: 0.8126404615921059
Best Parameters: {'C': 778.7176653185328, 'gamma': 0.0013074632570218782}


In [45]:
rand_search.best_estimator_

SVC(C=778.7176653185328, gamma=0.0013074632570218782, probability=True)

In [46]:
rand_search.best_score_

0.8126404615921059

**2. Random Forest**

In [47]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 80, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [2,4]
# Minimum number of samples required to split a node
min_samples_split = [2, 5]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [48]:
# Create the param grid

param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(param_grid)

{'n_estimators': [10, 17, 25, 33, 41, 48, 56, 64, 72, 80], 'max_features': ['auto', 'sqrt'], 'max_depth': [2, 4], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2], 'bootstrap': [True, False]}


In [49]:
rf = RandomForestClassifier()
rand_search = RandomizedSearchCV(rf, param_distributions = param_grid , n_iter = 20, n_jobs = 4, cv = 3, random_state = 2017, scoring = auc) 
rand_clf_rf = rand_search.fit(X_train,y_train) 
clf_performance(rand_clf_rf,'Random Forest')

Random Forest
Best Score: 0.8067666324765441
Best Parameters: {'n_estimators': 56, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 4, 'bootstrap': False}


We see that the **Randomized Search** didn't help us at all... Let's try to combine some models:


# Combining Models

In [50]:
best_svc = best_clf_svc.best_estimator_
best_rf = best_clf_rf.best_estimator_
best_lr = best_clf_lr.best_estimator_
best_knn = best_clf_knn.best_estimator_

**1. SVC + RF**

In [51]:
from sklearn.ensemble import VotingClassifier
combined_svc_rf=VotingClassifier(estimators=[('SVC', best_svc), ('Random Forest', best_rf)], 
                       voting='soft',weights=[1.5,1.75])

In [52]:
combined_svc_rf.fit(X_train,y_train)
cv = cross_val_score(combined_svc_rf,X_train,y_train,cv=3)
print(cv)
print(cv.mean())

[0.82154882 0.84459459 0.83783784]
0.8346604179937512


**2. SVC + RF + LR**

In [53]:
from sklearn.ensemble import VotingClassifier
combined_svc_rf_lr=VotingClassifier(estimators=[('SVC', best_svc), ('Random Forest', best_rf), ('Logistic Regression', best_lr)], 
                       voting='soft',weights=[1.75,2,1])
combined_svc_rf_lr.fit(X_train,y_train)
cv = cross_val_score(combined_svc_rf_lr,X_train,y_train,cv=3)
print(cv)
print(cv.mean())

[0.81818182 0.84459459 0.83445946]
0.8324119574119574


**3. SVC + RF + KNN**

In [54]:
from sklearn.ensemble import VotingClassifier
combined_svc_rf_knn=VotingClassifier(estimators=[('SVC', best_svc), ('Random Forest', best_rf), ('KNN', best_knn)], 
                       voting='soft',weights=[1.5,2,1.2])
combined_svc_rf_knn.fit(X_train,y_train)
cv = cross_val_score(combined_svc_rf_knn,X_train,y_train,cv=3)
print(cv)
print(cv.mean())

[0.81818182 0.84121622 0.83445946]
0.8312858312858312


**4. SVC + RF + KNN + LR**

In [55]:
from sklearn.ensemble import VotingClassifier
combined_svc_rf_knn_lr=VotingClassifier(estimators=[('SVC', best_svc), ('Random Forest', best_rf), ('KNN', best_knn), ('Logistic Regression', best_lr)], 
                       voting='soft',weights=[2,2,1,1])
combined_svc_rf_knn_lr.fit(X_train,y_train)
cv = cross_val_score(combined_svc_rf_knn_lr,X_train,y_train,cv=3)
print(cv)
print(cv.mean())

[0.82154882 0.83783784 0.83108108]
0.8301559134892469


# Making Predictions 

Finally, we will use the SVC tuned model and the SVC+RF combined model to make predictions on the test set and see which one achives the best score on www.kaggle.com

In [56]:
X_test.shape

(418, 68)

**1. SVC tuned model**

In [57]:
best_svc.fit(X_train,y_train)
y_hat_svc = best_svc.predict(X_test).astype(int)

In [58]:
y_hat_svc

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [59]:
final_data = {'PassengerId': test_data.PassengerId, 'Survived': y_hat_svc}
submission = pd.DataFrame(data=final_data)
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [60]:
submission.to_csv('submission_svc.csv', index =False)

In [61]:
#77,99 % score

**2. SVC+RF combined model**

In [62]:
combined_svc_rf.fit(X_train,y_train)
y_hat_combined_svc_rf = combined_svc_rf.predict(X_test).astype(int)

In [63]:
final_data = {'PassengerId': test_data.PassengerId, 'Survived': y_hat_combined_svc_rf}
submission = pd.DataFrame(data=final_data)
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [64]:
submission.to_csv('submission_combined_svc_rf.csv', index =False)

In [65]:
# 78,70 % score 

**3. SVC + RF + LR**

In [66]:
combined_svc_rf_lr.fit(X_train,y_train)
y_hat_combined_svc_rf_lr = combined_svc_rf_lr.predict(X_test).astype(int)

In [67]:
final_data = {'PassengerId': test_data.PassengerId, 'Survived': y_hat_combined_svc_rf_lr}
submission = pd.DataFrame(data=final_data)
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [68]:
submission.to_csv('submission_combined_svc_rf_lr.csv', index =False)

In [69]:
# 78,46 % score 

**4. SVC + RF + KNN**

In [70]:
combined_svc_rf_knn.fit(X_train,y_train)
y_hat_combined_svc_rf_knn = combined_svc_rf_knn.predict(X_test).astype(int)

In [71]:
final_data = {'PassengerId': test_data.PassengerId, 'Survived': y_hat_combined_svc_rf_knn}
submission = pd.DataFrame(data=final_data)
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [72]:
submission.to_csv('submission_combined_svc_rf_knn.csv', index =False)

In [73]:
# 78,947 % score (top 9%)

**5. SVC + RF + KNN + LR**

In [74]:
combined_svc_rf_knn_lr.fit(X_train,y_train)
y_hat_combined_svc_rf_knn_lr = combined_svc_rf_knn_lr.predict(X_test).astype(int)

In [75]:
final_data = {'PassengerId': test_data.PassengerId, 'Survived': y_hat_combined_svc_rf_knn_lr}
submission = pd.DataFrame(data=final_data)
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [76]:
submission.to_csv('submission_combined_svc_rf_knn_lr.csv', index =False)

In [77]:
# 78,70 % score 

# Results

After trying many models, tuning them and combining, we were able to achive a **78,947% (top 9%)** on the **Titanic dataset competiton** on Kaggle with **the SVC,Random Forest and KNN combined model**.