## Titanic Disaster Dataset Feature Engineering

This notebook is an attempt towards kick-starting my way around the Titanic dataset to improve my model scores based on feature engineering techniques.
This is not going to be an extensive EDA notebook for viewers, as there are already tons of notebooks available with great amount of EDA to understand the intricacies of the dataset from all aspects.
* We will work with missing value imputations
* We will also try to combine some features with each other to come up with new ones
* We will try to improve our scores by building up on our feature engineering process

In [2]:
# Import the required packages here

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Import the datasets here
train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')

In [5]:
print(train.shape)
print(test.shape)

(891, 12)
(418, 11)


In [6]:
# Append the train and test dataframes for data cleaning

train['test_flag'] = 0
test['test_flag'] = 1
df_combined = pd.concat([train, test], axis=0, copy=True)

### High level EDA

In [7]:
# Check the % missing values in all the columns of the train set
print(df_combined.isnull().sum()*100/df_combined.shape[0])

# Ignore the missing values for the output class 'Survived' as they are from the test set

PassengerId     0.000000
Survived       31.932773
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            20.091673
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.076394
Cabin          77.463713
Embarked        0.152788
test_flag       0.000000
dtype: float64


In [87]:
# Subsetting for the list of columns which has less to no missing values

df_subset = df_combined[['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Name', 'Embarked', 'test_flag']]

### Data Cleaning and missing value imputation for the columns

In [88]:
# Cleaning and level modifications for the categorical features

for dataset in df_subset:
    df_subset['Title'] = df_subset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(df_subset['Title'], df_subset['Sex'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,0,1
Col,0,4
Countess,1,0
Don,0,1
Dona,1,0
Dr,1,7
Jonkheer,0,1
Lady,1,0
Major,0,2
Master,0,61


In [89]:
for dataset in df_subset:
    df_subset['Title'] = df_subset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
     'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    df_subset['Title'] = df_subset['Title'].replace('Mlle', 'Miss')
    df_subset['Title'] = df_subset['Title'].replace('Ms', 'Miss')
    df_subset['Title'] = df_subset['Title'].replace('Mme', 'Mrs')
    
df_subset[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value inste

Unnamed: 0,Title,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,Rare,0.347826


In [90]:
cols_new = ['Title']

for col in cols_new:
    df_subset[col] = pd.factorize(df_subset[col])[0] + 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [91]:
df_subset = df_subset.drop(['Name'], axis = 1)

In [92]:
cols_new = ['Sex']

for col in cols_new:
    df_subset[col] = pd.factorize(df_subset[col])[0] + 1

In [93]:
df_subset['Cabin'] = df_subset['Cabin'].replace(np.nan, 'U')
df_subset['Cabin_Class'] = df_subset['Cabin'].astype(str).str[0]

In [94]:
cols_new = ['Cabin_Class']

for col in cols_new:
    df_subset[col] = pd.factorize(df_subset[col])[0] + 1

In [95]:
df_subset = df_subset.drop(['Ticket', 'Cabin'], axis = 1)

In [96]:
median_age = df_subset["Age"].median()
df_subset["Age"].fillna(median_age, inplace=True)

In [97]:
df_subset.Age[df_subset.Age <= 12] = 0
df_subset.Age[(df_subset.Age > 12) & (df_subset.Age <= 30)] = 1
df_subset.Age[(df_subset.Age > 30) & (df_subset.Age <= 50)] = 2
df_subset.Age[(df_subset.Age > 50) & (df_subset.Age <= 65)] = 3
df_subset.Age[df_subset.Age > 65] = 4

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A

In [98]:
df_subset['Age'] = df_subset['Age'].astype(int)

In [99]:
for dataset in df_subset:
    df_subset['FamilySize'] = df_subset['SibSp'] + df_subset['Parch'] + 1

In [100]:
df_subset = df_subset.drop(['SibSp', 'Parch'], axis = 1)

In [101]:
df_subset['Fare'].describe()

count    1308.000000
mean       33.295479
std        51.758668
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: Fare, dtype: float64

In [102]:
df_subset['Fare'].fillna(df_subset['Fare'].dropna().median(), inplace=True)

In [103]:
df_subset.Fare[df_subset.Fare <= 7.91] = 0
df_subset.Fare[(df_subset.Fare > 7.91) & (df_subset.Fare <= 14.454)] = 1
df_subset.Fare[(df_subset.Fare > 14.454) & (df_subset.Fare <= 31)] = 2
df_subset.Fare[df_subset.Fare > 31] = 3
df_subset['Fare'] = df_subset['Fare'].astype(int)

df_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Embarked,test_flag,Title,Cabin_Class,FamilySize
0,1,0.0,3,1,1,0,S,0,1,1,2
1,2,1.0,1,2,2,3,C,0,2,2,2
2,3,1.0,3,2,1,1,S,0,3,1,1
3,4,1.0,1,2,2,3,S,0,2,2,2
4,5,0.0,3,1,2,1,S,0,1,1,1


In [104]:
mode_embarked = df_subset["Embarked"].mode()
df_subset["Embarked"].fillna(mode_embarked, inplace=True)

In [105]:
cols_new = ['Embarked']

for col in cols_new:
    df_subset[col] = pd.factorize(df_subset[col])[0] + 1
    
# df_subset = df_subset.drop(['Embarked'], axis = 1)

In [107]:
df_subset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Embarked,test_flag,Title,Cabin_Class,FamilySize
0,1,0.0,3,1,1,0,1,0,1,1,2
1,2,1.0,1,2,2,3,2,0,2,2,2
2,3,1.0,3,2,1,1,1,0,3,1,1
3,4,1.0,1,2,2,3,1,0,2,2,2
4,5,0.0,3,1,2,1,1,0,1,1,1


In [109]:
train_set = df_subset[df_subset['test_flag']==0]
test_set = df_subset[df_subset['test_flag']==1]

In [110]:
train_set['Survived'] = train_set['Survived'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [111]:
print(train_set.shape)
print(test_set.shape)

(891, 11)
(418, 11)


In [112]:
test_set = test_set.drop(['Survived', 'test_flag'], axis = 1)
train_set = train_set.drop(['test_flag'], axis = 1)

In [113]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Embarked,Title,Cabin_Class,FamilySize
0,1,0,3,1,1,0,1,1,1,2
1,2,1,1,2,2,3,2,2,2,2
2,3,1,3,2,1,1,1,3,1,1
3,4,1,1,2,2,3,1,2,2,2
4,5,0,3,1,2,1,1,1,1,1


In [114]:
from sklearn.model_selection import train_test_split

X = train_set[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Title', 'Cabin_Class',
               'FamilySize']]
y = train_set[['Survived']]

In [115]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(596, 8) (596, 1)
(295, 8) (295, 1)


In [116]:
X_test.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,Cabin_Class,FamilySize
709,3,1,1,2,2,4,1,3
439,2,1,2,1,1,1,1,1
840,3,1,1,1,1,1,1,1
720,2,2,0,3,1,3,1,2
39,3,2,1,1,2,3,1,2


In [117]:
# Logistic Regression Model

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logistic_model = logreg.fit(X_train, y_train)
predictions = logistic_model.predict(X_test)

acc_logistic_regressions = round(logistic_model.score(X_train, y_train) * 100, 2)
acc_logistic_regressions

  return f(*args, **kwargs)


78.69

In [None]:
# Other method to look at model performance

# from sklearn.metrics import classification_report

# print(classification_report(y_test, predictions))
# # test_set.head()

In [118]:
# Decision Tree model
from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()
decision_tree = dec_tree.fit(X_train, y_train)
prediction_tree = decision_tree.predict(X_test)

acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
acc_decision_tree

91.44

In [119]:
# Random Forest Model

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100)
random_forest = rf.fit(X_train, y_train)
prediction_forest = random_forest.predict(X_test)

acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
acc_random_forest

  


91.44

In [120]:
# Naive Bayes

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gaussian = gnb.fit(X_train, y_train)
prediction_naive = gaussian.predict(X_test)

acc_gaussian = round(gaussian.score(X_train, y_train) * 100, 2)
acc_gaussian

  return f(*args, **kwargs)


77.85

In [122]:
# Perceptron Model

from sklearn.linear_model import Perceptron

pcp = Perceptron()
perceptron = pcp.fit(X_train, y_train)
prediction_perceptron = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, y_train) * 100, 2)
acc_perceptron

  return f(*args, **kwargs)


76.17

### Use the section below to submit on the test set

In [124]:
test_feature = test_set[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Title',
                         'Cabin_Class', 'FamilySize']]
test_id = test_set['PassengerId']
test_feature.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,Cabin_Class,FamilySize
0,3,1,2,0,3,1,1,1
1,3,2,2,0,1,2,1,2
2,2,1,3,1,3,1,1,1
3,3,1,1,1,1,1,1,1
4,3,2,1,1,1,2,1,3


In [125]:
test_predictions = random_forest.predict(test_feature)

In [126]:
output = pd.DataFrame({'PassengerId' : test_id, 'Survived': test_predictions})
output.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [None]:
# Use this if the score comes as 0 in submission

# model.predict(test_data).astype(int)

In [127]:
output.to_csv('/kaggle/working/submission.csv', index=False)