# Titanic Dataset Prediction Model - Kaggle Competition

## Introduction

In this Jupyter Notebook, I have explored and built a predictive model for the Titanic dataset, aiming to participate in the Kaggle competition. The sinking of the Titanic is a tragic event in history, and this dataset offers an opportunity to apply machine learning techniques to predict the survival status of passengers based on various features.

### About the Dataset

The Titanic dataset contains information about passengers, including their age, gender, ticket class, and survival status. The challenge posed by the Kaggle competition is to develop models that accurately predict whether a passenger survived or not. The goal is to contribute a submission to the competition and gain insights into the factors influencing passenger survival.

### Notebook Structure

1. **Data Exploration:** I began by delving into the dataset, conducting Exploratory Data Analysis (EDA) through visualizations and statistical summaries to uncover patterns, trends, and potential relationships between features.


2. **Data Preprocessing:** Before feeding the data into a machine learning model, I addressed tasks such as handling missing values, encoding categorical variables, and scaling numerical features.


3. **Feature Engineering:** Exploration of creating new features or transforming existing ones to enhance the predictive power of the model. It has been not done. Still, results are pretty good.


4. **Model Building:** I utilized various machine learning algorithms, including Random Forest, Support Vector Machines, and Gradient Boosting, to train and evaluate models predicting the survival status of passengers.


5. **Model Evaluation:** Performance evaluation of the models using metrics such as accuracy.


6. **Submission:** The final step involves preparing predictions for submission to the Kaggle competition.


Whether you are a beginner looking to learn or an experienced data scientist seeking inspiration, this notebook provides insights into the process of building a predictive model for a real-world problem.

Let's dive in and explore the Titanic dataset!

## Loading Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Loading Dataset

In [40]:
df_train=pd.read_csv('Downloads/train.csv')
df_test=pd.read_csv('Downloads/test.csv')

## Exploratory Data Analysis

In [41]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [42]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [43]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Fill missing value

In [44]:
df_train['Age'].fillna(df_train['Age'].mean(), inplace=True)
df_train['Cabin'].fillna('N0', inplace=True)
df_train['Embarked'].fillna(df_train['Embarked'].mode(), inplace=True)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [45]:
df_test['Age'].fillna(df_test['Age'].mean(), inplace=True)
df_test['Cabin'].fillna('N0', inplace=True)

### Data encoding

In [46]:
df_train["Sex"].replace(["male", "female"], [0, 1], inplace=True)

In [47]:
df_test["Sex"].replace(["male", "female"], [0, 1], inplace=True)

In [48]:
df_train['Ticket'].unique()

array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450',
       '330877', '17463', '349909', '347742', '237736', 'PP 9549',
       '113783', 'A/5. 2151', '347082', '350406', '248706', '382652',
       '244373', '345763', '2649', '239865', '248698', '330923', '113788',
       '347077', '2631', '19950', '330959', '349216', 'PC 17601',
       'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',
       'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
       'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',
       '2662', '349237', '3101295', 'A/4. 39886', 'PC 17572', '2926',
       '113509', '19947', 'C.A. 31026', '2697', 'C.A. 34651', 'CA 2144',
       '2669', '113572', '36973', '347088', 'PC 17605', '2661',
       'C.A. 29395', 'S.P. 3464', '3101281', '315151', 'C.A. 33111',
       'S.O.C. 14879', '2680', '1601', '348123', '349208', '374746',
       '248738', '364516', '345767', '345779', '330932', '113059',
       'SO/C 14885', '31012

### Data extraction from columns into multiple other columns

In [49]:
df_train['Ticket_Num']=df_train['Ticket'].apply(lambda x: x.split(' ')[-1])
df_train['Ticket']=df_train['Ticket'].apply(lambda x: x.split(' ')[0:-1])
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_Num
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,[A/5],7.25,N0,S,21171
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,[PC],71.2833,C85,C,17599
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,[STON/O2.],7.925,N0,S,3101282
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,[],53.1,C123,S,113803
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,[],8.05,N0,S,373450


In [50]:
df_test['Ticket_Num']=df_test['Ticket'].apply(lambda x: x.split(' ')[-1])
df_test['Ticket']=df_test['Ticket'].apply(lambda x: x.split(' ')[0:-1])
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_Num
0,892,3,"Kelly, Mr. James",0,34.5,0,0,[],7.8292,N0,Q,330911
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,[],7.0,N0,S,363272
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,[],9.6875,N0,Q,240276
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,[],8.6625,N0,S,315154
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,[],12.2875,N0,S,3101298


In [51]:
df_train['Cabin'].unique()

array(['N0', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64'

In [52]:
df_train['Cabin']=df_train['Cabin'].str.split(' ')
df_train=df_train.explode('Cabin')
df_train.shape

(925, 13)

In [53]:
df_test['Cabin']=df_test['Cabin'].apply(lambda x: x.split(' ')[0])

In [54]:
df_train['Cabin'].unique()

array(['N0', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6', 'C23',
       'C25', 'C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33', 'F',
       'G73', 'E31', 'A5', 'D10', 'D12', 'D26', 'C110', 'B58', 'B60',
       'E101', 'E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49',
       'F4', 'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22', 'C26', 'C106', 'C65', 'E36', 'C54', 'B57',
       'B59', 'B63', 'B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91',
       'E40', 'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96', 'B98',
       'E10', 'E44', 'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12',
       'E63', 'A14', 'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73',
       'C95', 'B38', 'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68',
       'A10', 'E68', 'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50',
       'A26', 'D48', 'E58', 'C126', 'B71', 'B51', 'B53', 'B55', 'D49',
    

In [55]:
df_train['Cabin_Alpha']=df_train['Cabin'].str[0]
df_train['Cabin_Num']=df_train['Cabin'].str[1:]
df_train.drop(['Cabin'], axis=1)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_Num,Cabin_Alpha,Cabin_Num
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,[A/5],7.25,N0,S,21171,N,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,[PC],71.2833,C85,C,17599,C,85
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,[STON/O2.],7.925,N0,S,3101282,N,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,[],53.1,C123,S,113803,C,123
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,[],8.05,N0,S,373450,N,0


In [56]:
df_test['Cabin_Alpha']=df_test['Cabin'].str[0]
df_test['Cabin_Num']=df_test['Cabin'].str[1:]
df_test.drop(['Cabin'], axis=1)
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_Num,Cabin_Alpha,Cabin_Num
0,892,3,"Kelly, Mr. James",0,34.5,0,0,[],7.8292,N0,Q,330911,N,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,[],7.0,N0,S,363272,N,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,[],9.6875,N0,Q,240276,N,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,[],8.6625,N0,S,315154,N,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,[],12.2875,N0,S,3101298,N,0


In [57]:
def fun(x):
    temp=x.split('.')[0]
    return temp.split(',')[1]
df_train['Title']=df_train['Name'].apply(lambda x: fun(x))

In [58]:
def fun(x):
    temp=x.split('.')[0]
    return temp.split(',')[1]
df_test['Title']=df_test['Name'].apply(lambda x: fun(x))

In [59]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_Num,Cabin_Alpha,Cabin_Num,Title
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,[A/5],7.25,N0,S,21171,N,0,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,[PC],71.2833,C85,C,17599,C,85,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,[STON/O2.],7.925,N0,S,3101282,N,0,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,[],53.1,C123,S,113803,C,123,Mrs
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,[],8.05,N0,S,373450,N,0,Mr


In [60]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 925 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  925 non-null    int64  
 1   Survived     925 non-null    int64  
 2   Pclass       925 non-null    int64  
 3   Name         925 non-null    object 
 4   Sex          925 non-null    int64  
 5   Age          925 non-null    float64
 6   SibSp        925 non-null    int64  
 7   Parch        925 non-null    int64  
 8   Ticket       925 non-null    object 
 9   Fare         925 non-null    float64
 10  Cabin        925 non-null    object 
 11  Embarked     923 non-null    object 
 12  Ticket_Num   925 non-null    object 
 13  Cabin_Alpha  925 non-null    object 
 14  Cabin_Num    925 non-null    object 
 15  Title        925 non-null    object 
dtypes: float64(2), int64(6), object(8)
memory usage: 122.9+ KB


In [61]:
df_train=df_train.drop(['Name'], axis=1)
df_train=df_train.drop(['Ticket'], axis=1)
df_train=df_train.drop(['Cabin'], axis=1)
df_train=pd.get_dummies(df_train, columns=['Embarked', 'Cabin_Alpha', 'Title'])
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 925 entries, 0 to 890
Data columns (total 39 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PassengerId          925 non-null    int64  
 1   Survived             925 non-null    int64  
 2   Pclass               925 non-null    int64  
 3   Sex                  925 non-null    int64  
 4   Age                  925 non-null    float64
 5   SibSp                925 non-null    int64  
 6   Parch                925 non-null    int64  
 7   Fare                 925 non-null    float64
 8   Ticket_Num           925 non-null    object 
 9   Cabin_Num            925 non-null    object 
 10  Embarked_C           925 non-null    uint8  
 11  Embarked_Q           925 non-null    uint8  
 12  Embarked_S           925 non-null    uint8  
 13  Cabin_Alpha_A        925 non-null    uint8  
 14  Cabin_Alpha_B        925 non-null    uint8  
 15  Cabin_Alpha_C        925 non-null    uin

### Data encoding

In [62]:
df_test=df_test.drop(['Name'], axis=1)
df_test=df_test.drop(['Ticket'], axis=1)
df_test=df_test.drop(['Cabin'], axis=1)
df_test=pd.get_dummies(df_test, columns=['Embarked', 'Cabin_Alpha', 'Title'])
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 29 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    418 non-null    int64  
 1   Pclass         418 non-null    int64  
 2   Sex            418 non-null    int64  
 3   Age            418 non-null    float64
 4   SibSp          418 non-null    int64  
 5   Parch          418 non-null    int64  
 6   Fare           417 non-null    float64
 7   Ticket_Num     418 non-null    object 
 8   Cabin_Num      418 non-null    object 
 9   Embarked_C     418 non-null    uint8  
 10  Embarked_Q     418 non-null    uint8  
 11  Embarked_S     418 non-null    uint8  
 12  Cabin_Alpha_A  418 non-null    uint8  
 13  Cabin_Alpha_B  418 non-null    uint8  
 14  Cabin_Alpha_C  418 non-null    uint8  
 15  Cabin_Alpha_D  418 non-null    uint8  
 16  Cabin_Alpha_E  418 non-null    uint8  
 17  Cabin_Alpha_F  418 non-null    uint8  
 18  Cabin_Alph

In [63]:
df_train['Cabin_Num'].unique()

array(['0', '85', '123', '46', '6', '103', '56', '23', '25', '27', '78',
       '33', '30', '52', '28', '83', '', '73', '31', '5', '10', '12',
       '26', '110', '58', '60', '101', '69', '47', '86', '2', '19', '7',
       '49', '4', '32', '80', '36', '15', '93', '35', '87', '77', '67',
       '94', '125', '99', '118', '22', '106', '65', '54', '57', '59',
       '63', '66', '34', '18', '124', '91', '40', '128', '37', '50', '82',
       '96', '98', '44', '104', '111', '92', '38', '21', '14', '20', '79',
       '95', '39', '70', '16', '68', '41', '9', '48', '126', '71', '51',
       '53', '55', '62', '64', '24', '90', '45', '8', '121', '11', '3',
       '84', '17', '102', '42', '148'], dtype=object)

In [64]:
df_test['Cabin_Num'].unique()

array(['0', '45', '31', '57', '36', '21', '78', '34', '19', '9', '15',
       '23', '', '61', '53', '43', '130', '132', '101', '55', '71', '46',
       '116', '29', '6', '28', '51', '54', '97', '22', '10', '4', '52',
       '30', '58', '62', '11', '80', '33', '85', '37', '86', '89', '26',
       '69', '32', '2', '18', '106', '60', '50', '39', '24', '41', '7',
       '40', '38', '105'], dtype=object)

### Handling missing values  and type casting

In [65]:
df_train.loc[(df_train['Cabin_Num']==''), 'Cabin_Num']='0'
df_train['Cabin_Num']=df_train['Cabin_Num'].astype('int64')

In [66]:
df_test.loc[(df_test['Cabin_Num']==''), 'Cabin_Num']='0'
df_test['Cabin_Num']=df_test['Cabin_Num'].astype('int64')

In [67]:
df_train.loc[(df_train['Ticket_Num'].apply(lambda x: x.isnumeric())==False), 'Ticket_Num']='0'
df_train['Ticket_Num']=df_train['Ticket_Num'].astype('int64')

In [68]:
df_test.loc[(df_test['Ticket_Num'].apply(lambda x: x.isnumeric())==False), 'Ticket_Num']='0'
df_test['Ticket_Num']=df_test['Ticket_Num'].astype('int64')

In [69]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 925 entries, 0 to 890
Data columns (total 39 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PassengerId          925 non-null    int64  
 1   Survived             925 non-null    int64  
 2   Pclass               925 non-null    int64  
 3   Sex                  925 non-null    int64  
 4   Age                  925 non-null    float64
 5   SibSp                925 non-null    int64  
 6   Parch                925 non-null    int64  
 7   Fare                 925 non-null    float64
 8   Ticket_Num           925 non-null    int64  
 9   Cabin_Num            925 non-null    int64  
 10  Embarked_C           925 non-null    uint8  
 11  Embarked_Q           925 non-null    uint8  
 12  Embarked_S           925 non-null    uint8  
 13  Cabin_Alpha_A        925 non-null    uint8  
 14  Cabin_Alpha_B        925 non-null    uint8  
 15  Cabin_Alpha_C        925 non-null    uin

In [70]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 29 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    418 non-null    int64  
 1   Pclass         418 non-null    int64  
 2   Sex            418 non-null    int64  
 3   Age            418 non-null    float64
 4   SibSp          418 non-null    int64  
 5   Parch          418 non-null    int64  
 6   Fare           417 non-null    float64
 7   Ticket_Num     418 non-null    int64  
 8   Cabin_Num      418 non-null    int64  
 9   Embarked_C     418 non-null    uint8  
 10  Embarked_Q     418 non-null    uint8  
 11  Embarked_S     418 non-null    uint8  
 12  Cabin_Alpha_A  418 non-null    uint8  
 13  Cabin_Alpha_B  418 non-null    uint8  
 14  Cabin_Alpha_C  418 non-null    uint8  
 15  Cabin_Alpha_D  418 non-null    uint8  
 16  Cabin_Alpha_E  418 non-null    uint8  
 17  Cabin_Alpha_F  418 non-null    uint8  
 18  Cabin_Alph

## Model Development

### Dataset Splitting

In [71]:
X_train=df_train
Y_train=df_train['Survived']
X_test=df_test

In [72]:
# Get missing columns in the training test
missing_cols = set( df_train.columns ) - set( df_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X_train.drop(c, axis=1, inplace=True)
    
missing_cols = set( df_test.columns ) - set( df_train.columns )
for c in missing_cols:
    X_test.drop(c, axis=1, inplace=True)

In [73]:
X_test['Fare'].fillna(X_test['Fare'].mean(), inplace=True)

### Model training

In [74]:
from sklearn.model_selection import train_test_split
X_train_train, X_test_train, Y_train_train, Y_test_train = train_test_split(X_train, Y_train, test_size=0.33, random_state=42)

In [75]:
from sklearn.linear_model import LogisticRegression
model1=LogisticRegression().fit(X_train_train, Y_train_train)
Y_pred1=model1.predict(X_test_train)

In [76]:
from sklearn import metrics
metrics.accuracy_score(Y_test_train, Y_pred1)

0.6601307189542484

In [78]:
from sklearn.tree import DecisionTreeClassifier
model2=DecisionTreeClassifier().fit(X_train_train, Y_train_train)
Y_pred2=model2.predict(X_test_train)
metrics.accuracy_score(Y_test_train, Y_pred2)

0.7549019607843137

In [79]:
from sklearn.neighbors import KNeighborsClassifier
model3=KNeighborsClassifier(n_neighbors=5).fit(X_train_train, Y_train_train)
Y_pred3=model3.predict(X_test_train)
metrics.accuracy_score(Y_test_train, Y_pred3)

0.6111111111111112

In [81]:
from sklearn.ensemble import RandomForestClassifier
model4=RandomForestClassifier().fit(X_train_train, Y_train_train)
Y_pred4=model4.predict(X_test_train)
metrics.accuracy_score(Y_test_train, Y_pred4)

0.8169934640522876

In [82]:
test_pred=model4.predict(X_test)
X_test.insert((X_test.shape[1]), 'Survived', test_pred)
X_test.to_csv('result.csv')