<a href="https://colab.research.google.com/github/kindaa/ML-Group-project/blob/main/titanic_survival_prediciton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ML GROUP project(group B, group 6)

***Importing libraries***

In [2]:
#Data wrangling
import pandas as pd
import numpy as np
import missingno
from collections import Counter

#Data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Machine learning models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier


#Model evaluation
from sklearn.model_selection import cross_val_score

#Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

#Warnings
import warnings
warnings.filterwarnings('ignore')

***Importing datasets***

In [3]:
train=pd.read_csv('/kaggle/input/titanic/train.csv')
test=pd.read_csv('/kaggle/input/titanic/test.csv')
ss=pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

FileNotFoundError: ignored

***Undesrstanding the shape of the data***

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
ss.shape

The only difference between the shape of train and test data sets is the Survived column in the train dataset.

In [None]:
train.info()
test.info()

***Data description***

Survived: 0 - Did not surivved ; 1 - Survived
Pclass: 1 - First class ; 2 - Second class ; 3 - Third Class
Sex: Male or Female
Age: In years
SibSp: Number of siblings or spouses on Titanic
Parch: Number of parents or children on Titanic
Ticket: Passenger's ticket number
Fare: Passenger's fare
Cabin: Cabin number
Embarked: Point of embarkation C = Cherbourg ; Q = Queenstown ; S = Southampton

***EDA***

In [None]:
train.isnull().sum().sort_values(ascending=False)

In [None]:
test.isnull().sum().sort_values(ascending=False)

Cabin, Age and Embarked have missing values in train dataset
Cabin, Age and Fare have missing values in test dataset

In [None]:
import missingno
from collections import Counter
missingno.matrix(train)

In [None]:
train.describe()

In [None]:
test.describe()

***Feature analysis***

Knowing which feature is numerical and which categorical helps us structure analysis more properly. 

Categorical - Sex, Pclass and Embarked

Numerical - SibSp, Parch, Age and Fare

In [None]:
train['Sex'].value_counts(dropna=False)
#There were more male passengers than female

In [None]:
train[['Sex','Survived']].groupby('Sex', as_index=False).mean().sort_values(by='Survived', ascending=False)
#Females had more probability than male to survive

In [None]:
#Survival probability barplot by gender
sns.barplot(x='Sex', y='Survived', data=train)
plt.ylabel('Survival Probability')
plt.title('Survival probability by gender')

In [None]:
train['Pclass'].value_counts(dropna=False)
#Most of passengers were in third class

In [None]:
#Mean of survival rate by passenger class
train[['Pclass','Survived']].groupby('Pclass', as_index=False).mean().sort_values(by='Survived', ascending=False)
#The better class the better survival probability

In [None]:
#Survival rate by passenger class barplot
sns.barplot(x='Pclass', y='Survived', data=train)
plt.ylabel('Survival probability')
plt.title('Survival mean by passenger class')


In [None]:
# Survival probability by sex and passenger class
t=sns.factorplot(x='Pclass', y='Survived', hue='Sex', data=train, kind='bar')
t.despine(left=True)
plt.ylabel("Survival probability")
plt.title("Survival probability by sex and passenger class")

In [None]:
train['Embarked'].value_counts(dropna=False)

In [None]:
#Mean of survival by point of embarkation
train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
sns.barplot(x='Embarked', y='Survived', data=train)
plt.ylabel('Survival probability')
plt.title('Survival probability by point of embarktion')
#Survival probabilty is the highest for Cherbourg

Let's make comparison between embarkation point and class because maybe the majority of Cherbourg's passengers were in first class. (it doesn't seem logical that passengers from any point were priortised during the evacuation)


In [None]:
sns.factorplot('Pclass', col='Embarked', data=train, kind='count')

Most of Cherbourg's passengers were in the first class. On the other hand most of passengers from Southtampton were in third class. This explains that class matters not point of embarktion when it comes to the mean of survival.

In [None]:
#Survival rate by all categorical variables

grid = sns.FacetGrid(train, row = 'Embarked', size = 2.2, aspect = 1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette = 'deep')
grid.add_legend()

***Outliers detecting and removing***
Turey's rule
Outliers are the values more than 1.5 times the interquartiel range from the quartiles - either Q1-1.5IQR or above Q3+1.5IQR.
We will use these as part of writing a function to indentify outliers according to Tukey's rule.

In [None]:
 def detect_outliers(df,n,features):
        outlier_indices=[]
        for col in features:
            Q1=np.percentile(df[col], 25)
            Q3=np.percentile(df[col], 75)
            IQR=Q3-Q1
            outlier_step=1.5*IQR
            outlier_list_col=df[(df[col]<Q1-outlier_step) | (df[col]>Q3+outlier_step)].index
            outlier_indices.extend(outlier_list_col)
        outlier_indices=Counter(outlier_indices)
        multiple_outliers=list(key for key, value in outlier_indices.items() if value>n)
        return multiple_outliers
outliers_to_drop=detect_outliers(train, 2, ['Age', 'SibSp', 'Parch', 'Fare'])
print("We will drop these {} indices: ".format(len(outliers_to_drop)), outliers_to_drop)

In [None]:
train.loc[outliers_to_drop, :]

In [None]:
#drop outliers and reset index
print("Before: {} rows".format(len(train)))
train=train.drop(outliers_to_drop, axis=0).reset_index(drop=True)
print("After: {} rows".format(len(train)))

In [None]:
#Numerical variables correlation with survival
sns.heatmap(train[['Survived', 'SibSp', 'Parch', 'Age', 'Fare']].corr(), annot=True, fmt='.2f', cmap='coolwarm')

In [None]:
train['SibSp'].value_counts(dropna=False)

In [None]:
#Survival mean by SibSp
train[['SibSp', 'Survived']].groupby('SibSp', as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
sns.barplot(x='SibSp', y='Survived', data=train)
plt.ylabel('Survival probability')
plt.title("Survival mean by SibSp")

In [None]:
train['Parch'].value_counts(dropna=False)

In [None]:
#Mean of survival by Parch
train[['Parch', 'Survived']].groupby('Parch', as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
sns.barplot(x='Parch', y='Survived', data=train)
plt.ylabel('Survival probability')
plt.title('Survival rate by Parch')

In [None]:
#Null values in Age column
train['Age'].isnull().sum()

In [None]:
#Passenger age distribution
sns.distplot(train['Age'], label='Skewness: %.2f'%(train['Age'].skew()))
plt.legend(loc='best')
plt.title('Passenger age distribution')

In [None]:
# Age distribution by survival

t = sns.FacetGrid(train, col = 'Survived')
t.map(sns.distplot, 'Age')

In [None]:
sns.kdeplot(train['Age'][train['Survived']==0], label='Did not survive')
sns.kdeplot(train['Age'][train['Survived']==1], label='Survived')
plt.xlabel('Age')
plt.title('Age distribution by Survival outcome')

In [None]:
train['Fare'].isnull().sum()

In [None]:
#Passenger fare distribution
sns.distplot(train['Fare'], label='Skewness: %.2f'%(train['Fare'].skew()))
plt.legend(loc='best')
plt.ylabel('Passenger fare distribution')

***Data preprocessing***

Dropping and filling missing values

In [None]:
#Drop ticket and cabin features from training and test set for simplicity
train=train.drop(['Ticket', 'Cabin'], axis=1)
test=test.drop(['Ticket', 'Cabin'], axis=1)

In [None]:
train.isnull().sum().sort_values(ascending=False)

In [None]:
#Compute the most frequent value of embarked in training set
mode=train['Embarked'].dropna().mode()[0]
mode

In [None]:
#Fill missing value in embarked with mode
train['Embarked'].fillna(mode,inplace=True)

In [None]:
test.isnull().sum().sort_values(ascending=False)

In [None]:
median=test['Fare'].dropna().median()
median

In [None]:
test['Fare'].fillna(median, inplace=True)

In [None]:
combine=pd.concat([train,test], axis=0).reset_index(drop=True)
combine.head()

In [None]:
#Missing values in combined dataset
combine.isnull().sum().sort_values(ascending=False)

In [None]:
#Converting sex into numerical values ; 0=male 1=female
combine['Sex']=combine['Sex'].map({'male':0,'female':1})

In [None]:
sns.factorplot(y='Age', x='Sex', hue='Pclass', kind='box', data=combine)

In [None]:
sns.factorplot(y='Age', x='Parch', kind='box', data=combine)

In [None]:
sns.factorplot(y='Age', x='SibSp', kind='box', data=combine)

In [None]:
sns.heatmap(combine.drop(['Survived', 'Name', 'PassengerId', 'Fare'], axis=1).corr(),annot=True, cmap='coolwarm')

In [None]:
#Age is not correlated with sex but slightly negatively correlated to SibSp, Parch and Pclass

In [None]:
age_nan=list(combine[combine['Age'].isnull()].index)
len(age_nan)

In [None]:
#Loop through list and impute missing ages
for index in age_nan:
    median_age=combine['Age'].median()
    predict_age=combine['Age'][(combine['SibSp']==combine.iloc[index]['SibSp'])
        &(combine['Parch']==combine.iloc[index]['Parch'])
        &(combine['Pclass']==combine.iloc[index]['Pclass'])].median()
    
    if np.isnan(predict_age):
        combine['Age'].iloc[index]=median_age
    else:
        combine['Age'].iloc[index]=predict_age

In [None]:
combine['Age'].isnull().sum()

Data Transformation

In [None]:
#Passenger fare distribution
sns.distplot(combine['Fare'],label='Skewness: %.2f'%(combine['Fare'].skew()))
plt.legend(loc='best')
plt.title('Passenger Fare distribution')

In [None]:
#Apply log transformation to fare column to reduce skewness
combine['Fare']=combine['Fare'].map(lambda x: np.log(x) if x>0 else 0)

In [None]:
sns.distplot(combine['Fare'], label='Skewness: %.2f'%(combine['Fare'].skew()))
plt.legend(loc='best')
plt.title("Passenger fare distribution after log transformation")

**Feature Engineering**

In [None]:
combine['Title']=[name.split(',')[1].split('.')[0].strip() for name in combine['Name']]
combine[['Name','Title']].head()

In [None]:
combine['Title'].value_counts()

In [None]:
 combine['Title'].nunique()

In [None]:
combine['Title']=combine['Title'].replace(['Dr', 'Rev','Col', 'Major', 'Lady', 'Jonkheer', 'Don','Capt','the Countess', 'Sir', 'Dona'],'Rare')
combine['Title']=combine['Title'].replace(['Mlle','Ms'],'Miss')
combine['Title']=combine['Title'].replace('Mme', 'Mrs')

In [None]:
sns.countplot(combine['Title'])

In [None]:
combine[['Title', 'Survived']].groupby(['Title'], as_index = False).mean().sort_values(by = 'Survived', ascending = False)

In [None]:
sns.factorplot(x = 'Title', y = 'Survived', data = combine, kind = 'bar')
plt.ylabel('Survival Probability')
plt.title('Mean of survival by Title')

In [None]:
#Dropping name column - unnecessary one
combine = combine.drop('Name', axis = 1)
combine.head()

In [None]:
#Family size calculation
combine['FamilySize'] = combine['SibSp'] + combine['Parch'] + 1
combine[['SibSp', 'Parch', 'FamilySize']].head(10)

In [None]:
# Mean of survival by family size

combine[['FamilySize', 'Survived']].groupby('FamilySize', as_index = False).mean().sort_values(by = 'Survived', ascending = False)

In [None]:
# Creating IsAlone feature

combine['IsAlone'] = 0
combine.loc[combine['FamilySize'] == 1, 'IsAlone'] = 1

In [None]:
combine[['IsAlone', 'Survived']].groupby('IsAlone', as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# Dropoing SibSp, Parch and FamilySize features from combine dataframe

combine = combine.drop(['SibSp', 'Parch', 'FamilySize'], axis = 1)
combine.head()

In [None]:
# In order to create Age Class feature we need to trasform Age into ordinal variable.
#We will separate Age into 5 age groups and assign a number to each age groups.

combine['AgeGroup'] = pd.cut(combine['Age'], 5)
combine[['AgeGroup', 'Survived']].groupby('AgeGroup', as_index=False).mean().sort_values(by = 'AgeGroup')

In [None]:
# Assign ordinals to each age group 

combine.loc[combine['Age'] <= 16.136, 'Age'] = 0
combine.loc[(combine['Age'] > 16.136) & (combine['Age'] <= 32.102), 'Age'] = 1
combine.loc[(combine['Age'] > 32.102) & (combine['Age'] <= 48.068), 'Age'] = 2
combine.loc[(combine['Age'] > 48.068) & (combine['Age'] <= 64.034), 'Age'] = 3
combine.loc[combine['Age'] > 64.034 , 'Age'] = 4

In [None]:
# Dropping age band feature

combine = combine.drop('AgeGroup', axis = 1)

In [None]:
combine[['Age', 'Pclass']].dtypes

In [None]:
combine['Age']=combine['Age'].astype('int')
combine['Age'].dtype

In [None]:
#Create Age*Pclass variable
combine['AgePclass']=combine['Age']*combine['Pclass']
combine[['Age','Pclass','AgePclass']].head()

**FEATURE ENCODING**

As ML models require all the inputs and outputs variables to be numeric we need to encode all categorical data before we can fit the date to our models.

We already have encoded sex column where 0=female and 1=male. We need to do the same process for Title and Embarked. In addition, similar to the age column, we'll need to transform fare into an ordinal variable rather than conrinuous variable.

In [None]:
combine=pd.get_dummies(combine,columns=['Title'])
combine=pd.get_dummies(combine, columns=['Embarked'], prefix='Em')
combine.head()

In [None]:
combine['FareGroup']=pd.cut(combine['Fare'],4)
combine[['FareGroup','Survived']].groupby(['FareGroup'], as_index=False).mean().sort_values(by='FareGroup')

The higher fare the better chances for surviving

In [None]:
#Assign ordinal to each fare band
combine.loc[combine['Fare']<=1.56, 'Fare']=0
combine.loc[(combine['Fare']>1.56) & (combine['Fare']<=3.119), 'Fare']=1
combine.loc[(combine['Fare']>3.110) & (combine['Fare']<=4.678),'Fare']=2
combine.loc[combine['Fare']>4.678, 'Fare']=3

In [None]:
combine['Fare']=combine['Fare'].astype('int')

In [None]:
combine=combine.drop('FareGroup', axis=1)

In [None]:
combine.head()

In [None]:
train=combine[:len(train)]
test=combine[len(train):]

In [None]:
combine.info()
train.info()
test.info()

In [None]:
train=train.drop('PassengerId', axis=1)
train.head()

In [None]:
#Converting survived back to integer in train dataset
train['Survived']=train['Survived'].astype('int')
train.head()

In [None]:
test.head()

In [None]:
#Dropping survived column from test datasset
test=test.drop('Survived', axis=1)
test.head()

**ML MODELLING**
In order to train our data and make predictions we'll need to use classification models for the Titanic dataset.
SVM is our chosen classifier.


**Splitting training data**

First, we need to split the training data into independent variables, represented by X and the dependent variable represented by Y.

Y_train is the survived column in the training set
X_train are the other columns in the trainng set without Survived columnn. 
Our model will learn to classify survival, Y_train based on all X_train and make a prediction on X_test.

In [None]:
X_train=train.drop('Survived', axis=1)
Y_train=train['Survived']
X_test=test.drop("PassengerId", axis=1).copy()

print("X_train shape: ", X_train.shape)
print("Y_train shape: ", Y_train.shape)
print("X_test shape: ", X_test.shape)

**Fit data to model and make predictions**

Step 1: Instantiate the model

Step 2: Fitting the training data to the training set

Step 3: Predict the test set

In [None]:
#SVM
svc=SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train)*100,2)
acc_svc

**Model evaluation + hyperparameter tuning**

After training our model, the next step is to assess the performance of the model and select one which has the highest prediction accuracy

**Training accuracy**

Training accuracy shows how well the model has learned from the training set.

In [None]:
models = pd.DataFrame({'Model': ['Support Vector Machines'],
                       'Score': [acc_svc]})



**K-Fold Cross Validation**


In [None]:
# Classifier 

classifiers = []
classifiers.append(SVC())


len(classifiers)

In [None]:
# Creating a list which contains cross validation results for each classifier

cv_results = []
for classifier in classifiers:
    cv_results.append(cross_val_score(classifier, X_train, Y_train, scoring = 'accuracy', cv = 10))

In [None]:
cv_mean=[]
cv_std=[]
for cv_result in cv_results:
    cv_mean.append(cv_result.mean())
    cv_std.append(cv_result.std())

In [None]:
param_grid={'C':[0.1, 1,10,100,100],
           'gamma':[1,0.1,0.01,0.001,0.0001],
            'kernel':['rbf']}
grid=GridSearchCV(SVC(), param_grid, refit=True, verbose=3)
grid.fit(X_train, Y_train)

In [None]:
print("Best parameters: ", grid.best_params_)
print("Best estimator: ", grid.best_estimator_)

In [None]:
#Training accuracy
svc=SVC(C=100, gamma=0.01, kernel='rbf')
svc.fit(X_train, Y_train)
Y_pred=svc.predict(X_test)
acc_svc=round(svc.score(X_train, Y_train)*100,2)
acc_svc

In [None]:
# Cross Validation Mean Score
cross_val_score(svc, X_train, Y_train, scoring='accuracy', cv=10).mean()

In [None]:
Y_pred

In [None]:
len(Y_pred)

***Data submission***

In [None]:
ss.head()

In [None]:
ss.shape

In [None]:
submit_data=pd.DataFrame({'PassengerId': test['PassengerId'],
                        'Survived':Y_pred})
submit_data.head()

In [None]:
submit_data.shape

In [None]:
submit_data.to_csv("subimission.csv", index=False)
