* Problem Statement - Whether a person has survived or not while sinking of the Titanic

* Image 9

pclass: Passenger Class
1st = Upper
2nd = Middle
3rd = Lower

sibsp: No.of siblings and spouse
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Embarked: Port of Embarkation	
C = Cherbourg, Q = Queenstown, S = Southampton

#### Collecting Data

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math

ImportError: Can't determine version for bottleneck

In [None]:
titanic_data = pd.read_csv('Data Sets/titanic.csv')
titanic_data.head(10)

In [None]:
# print no.of passengers

print("No. of passengers travelling : "+str(len(titanic_data.index)))

#### Analysing Data

In [None]:
# Plot between survived and not survived

sns.countplot(x="Survived", data = titanic_data)

In [None]:
# no. of males and female survived

sns.countplot(x="Survived", hue="Sex", data = titanic_data)

In [None]:
# analysis for passenger class
sns.countplot(x="Survived", hue="Pclass", data = titanic_data)

In [None]:
# age analysis

titanic_data["Age"].plot.hist()

In [None]:
# fare analysis

titanic_data["Fare"].plot.hist(bins=20, figsize=(10,5))

In [None]:
titanic_data.info()

In [None]:
sns.countplot(x = "SibSp",data = titanic_data)

#### Data Wrangling

In [None]:
# check data set is null or not

titanic_data.isnull()

In [None]:
titanic_data.isnull().sum()

In [None]:
# can also checked by plotting heatmap

sns.heatmap(titanic_data.isnull(), yticklabels=False, cmap="viridis")

In [None]:
# first working on age column
# plot box plot

sns.boxplot(x="Pclass", y="Age", data=titanic_data)

# passenger travelling in class 1 and 2 are older than 3

In [None]:
# removing unused column

titanic_data.head()

In [None]:
titanic_data.drop("Cabin", axis = 1, inplace=True)

In [None]:
titanic_data.head()

In [None]:
# drop all NA values

titanic_data.dropna(inplace=True)

In [None]:
# again checking null

sns.heatmap(titanic_data.isnull(), yticklabels=False, cmap="viridis")

In [None]:
titanic_data.isnull().sum()

In [None]:
titanic_data.head(2)

# here we see lots of string values that must be converted to categorical variables for logistic regression
# for this we will convert them to dummy variable using pandas
# for machine learning there should be no string 

In [None]:
# first converting sex value to dummy variable 1-male 0-female

pd.get_dummies(titanic_data["Sex"])

In [None]:
sex = pd.get_dummies(titanic_data["Sex"],drop_first=True)
sex.head()

# 1-male 0-female

In [None]:
# now for embarked

embark = pd.get_dummies(titanic_data["Embarked"])
embark.head()

In [None]:
embark = pd.get_dummies(titanic_data["Embarked"],drop_first=True)
embark.head()

# if Q and S both are 0 then the value is definately C

In [None]:
Pcls = pd.get_dummies(titanic_data["Pclass"])
Pcls.head()

In [None]:
Pcls = pd.get_dummies(titanic_data["Pclass"],drop_first=True)
Pcls.head()

# if 2 and 3 both are 0 then the passenger is travelling at class 1

In [None]:
# now all values are made categorical
# now concatinating all new rows into a new dataset

titanic_data = pd.concat([titanic_data,sex,embark,Pcls],axis=1)
titanic_data.head()

In [None]:
# row drop all unnecessary columns

titanic_data.drop(['Sex','Embarked','PassengerId','Name','Ticket','Pclass'],axis=1, inplace=True)

In [None]:
titanic_data.head(5)

# final data set

#### Training Data

In [None]:
Y = titanic_data['Survived']
X = titanic_data[['Age','SibSp','Parch','Fare','male','Q','S',2,3]]  # drop survived and so all columns under X

In [None]:
# Split data into training datasets and testing datasets

from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.20,random_state=0)

In [None]:
# creating a model
from sklearn.linear_model import LogisticRegression
# creating instance of logistic regression
logmodel = LogisticRegression()

In [None]:
# fitting the model

logmodel.fit(X_train,Y_train)

In [None]:
# prediction

Y_pred = logmodel.predict(X_test)

In [None]:
# calculate classification report for accuaracy check

from sklearn.metrics import classification_report
print(classification_report(Y_test,Y_pred))


In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test,Y_pred)
cm

In [None]:
# accuracy score 

from sklearn.metrics import accuracy_score
accu = accuracy_score(Y_test,Y_pred)
accu

In [None]:
X_test

In [None]:
Y_pred = logmodel.predict([[60,2,2,20,0,0,0,0,0]])
Y_pred