# Logistic Regression on the Titanic Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
train = pd.read_csv('titanic_train.csv')

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Exploratory Data Analysis

In [None]:
train.isnull().head()

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=True,cmap='viridis')

If we glimpse at the data, we're missing some age information, we're missing a lot of cabin info and we're missing one row of embarked.
We'll come back to this problem of missing data a little later. But before that lets focus on some exploratory data analysis on a visual level.

In [None]:
sns.set_style('whitegrid')

In [None]:
sns.countplot(x='Survived',hue='Survived',data=train)

In [None]:
sns.countplot(x='Survived',data=train,hue='Sex',palette='RdBu_r')

Clearly there's a trend here. It looks like people that did not survive were much more likely to be men. While those who survived were twice as likely to be female.

In [None]:
sns.countplot(x='Survived',data=train,hue='Pclass')

Also it looks like the people who did not survive were overwhelmingly part of 3rd class. People that did survive were from the higher classes.

Now lets try and understand the age of the onboard passengers.

In [None]:
sns.distplot(train['Age'].dropna(),bins=30,kde=False)

There seems to be an interesting bi-modal distribution where there are quite a few young passengers between age 0 and 10. Then the average age tends to be around 20-30.

In [None]:
sns.countplot(x='SibSp',hue='SibSp',data=train)

In [None]:
train['Fare'].hist(bins=80,figsize=(10,4))

## Cleaning Data

As we saw earlier there are few columns that are missing some data. We need to clean our dataset before we begin to train our logistic regression model. Lets first try and fill in the missing age values. I'm going to do this by filling in the missing age with the mean age of the passenger class that the passenger belongs to.

In [None]:
plt.figure(figsize=(10,7))
sns.boxplot(x='Pclass',y='Age',data=train)

In [None]:
train.groupby('Pclass').mean()['Age'].round()

In [None]:
mean_class1 = train.groupby('Pclass').mean()['Age'].round().loc[1]
mean_class2 = train.groupby('Pclass').mean()['Age'].round().loc[2]
mean_class3 = train.groupby('Pclass').mean()['Age'].round().loc[3]

In [None]:
train.loc[train['Pclass']==1,'Age'] = train.loc[train['Pclass']==1,'Age'].fillna(value=mean_class1)
train.loc[train['Pclass']==2,'Age'] = train.loc[train['Pclass']==2,'Age'].fillna(value=mean_class2)
train.loc[train['Pclass']==3,'Age'] = train.loc[train['Pclass']==3,'Age'].fillna(value=mean_class3)

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

I'm going to just drop the cabin column since there's too much missing information.

In [None]:
train.drop('Cabin',axis=1,inplace=True)

In [None]:
train.dropna(inplace=True) # dropping the 1 missing value in Embarked column

I will now convert some of the categorical features in the dataset into dummy variables that our machine learning model can accept.

In [None]:
sex = pd.get_dummies(train['Sex'],drop_first=True)

In [None]:
embark = pd.get_dummies(train['Embarked'],drop_first=True)

In [None]:
train = pd.concat([train,sex,embark],axis=1)

In [None]:
train.head(10)

In [None]:
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [None]:
train.drop('PassengerId',axis=1,inplace=True)

In [None]:
train.head()

Now lets perform similar data cleaning on the test data.

In [None]:
test = pd.read_csv('titanic_test.csv')

In [None]:
test.loc[test['Pclass']==1,'Age'] = test.loc[test['Pclass']==1,'Age'].fillna(value=mean_class1)
test.loc[test['Pclass']==2,'Age'] = test.loc[test['Pclass']==2,'Age'].fillna(value=mean_class2)
test.loc[test['Pclass']==3,'Age'] = test.loc[test['Pclass']==3,'Age'].fillna(value=mean_class3)

In [None]:
sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
test.drop('Cabin',axis=1,inplace=True)

In [None]:
test.dropna(inplace=True)

In [None]:
sex = pd.get_dummies(test['Sex'],drop_first=True)
embark = pd.get_dummies(test['Embarked'],drop_first=True)

In [None]:
test = pd.concat([test,sex,embark],axis=1)

In [None]:
test.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [None]:
test.head()

## Train and build Classifier

In [None]:
X = train.drop('Survived',axis=1)
y = train['Survived']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
logmodel.score(X_train,y_train)

In [None]:
logmodel.score(X_test,y_test)

## Making Predictions

In [None]:
test_x = test.drop('PassengerId',axis=1)

In [None]:
predictions = logmodel.predict(test_x)

In [None]:
final_prediction = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':predictions})

In [None]:
final_prediction.head()