In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats 
#The following imports are my own implementations of the Classification algorithms using Python and Numpy
from MLAlgorithms.Supervised.Classification.knnclassifier import *
from MLAlgorithms.Supervised.Classification.logisticregression import *

ModuleNotFoundError: No module named 'MLAlgorithms'

The data set of this project was downloaded from Kaggle website https://www.kaggle.com/competitions/titanic

### Titanic Disaster

### Description of the Data set:

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others

Here is a description of the columns of the data set.
Survival column is  our response variable where 0 indicates that the passenger did not survive the accident and 1 means the passenger survived  it.

Let us read the data set using Pandas

In [None]:
titanic = pd.read_csv('titanic/train.csv')
response = titanic['Survived']

We can view the head of the data to see what the data attributes look like.

In [None]:
titanic.head()

Let find out more about the data set

In [None]:
titanic.info()

There are some missing values in the data set, we can find how the number of the missing value.

In [None]:
titanic.isnull().sum()

Most of the missing values are in the Cabin column and Age column, and we have 2 missing values in Embarked. We can apply feature engineering to the categorical variables in the data. Drop columns that do not contribute to the survival rate of those on board

In [None]:
dropped_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin']

In [None]:
titanic.drop(dropped_cols, axis=1, inplace=True)

In [None]:
titanic.head()

From the data info we can see that we only have some nan values in the Age column, we can handle the Nan values first. We can fill the missing age with the average age of the Age. It seems reasonable to fill the Age with the average of the people on board. We found tha average of the people on board is about 30 years.

In [None]:
av_age = titanic.Age.mean()
av_age

In [None]:
titanic['Age']  = titanic['Age'].fillna(int(np.round(av_age)))

There are some null  values in Embarked, we will replace  it with the max of the Embarked. We see that majority of the passengers comes from Southampton, so we fill the 2 missing values with S for Southampton.

In [None]:
titanic.groupby('Embarked').count()

In [None]:
titanic['Embarked']  = titanic['Embarked'].fillna('S')

In [None]:
titanic.info()

There is no longer any missing value in the data set we will be working with.

In [None]:
titanic.describe()

In [None]:
titanic.groupby('Sex').count()

It seems that the values of both Age and Fare deviate way too much from the rest of the data, so scaling will be useful. Let us first apply the above feature engineering to our test data.

### Define hypothesis (Null and Alternative)

The next step is to define the hypothesis to be tested. Hypothesis is defined in two ways - null hypothesis and alternative hypothesis. Null hypothesis is a statistical hypothesis which assumes that the difference in observations is due to a random factor. It is denoted by Ho. Alternative hypothesis is the opposite of null hypothesis. It assumes that the difference in observations is the result of a real effect. The alternate hypothesis is denoted by H1..

### First Hypothesis

$ 𝐻_{0}: $ There is no difference between the survival of male and female.\
$ 𝐻_{A}: $ There is a difference between the survival of male and the female.

### Second Hypothesis

$ 𝐻_{0}: $ There survival depends on the age of the passenger.\
$ 𝐻_{A}: $ There is no relationship between the survival and age of those on board.

### Third Hypothesis

$ 𝐻_{0}: $ The survival depends on the socio-economic status of the passengers. First class passengers survived more than second and third class.\
$ 𝐻_{A}: $ There is no relationship between the survival and socio-economic status of those on board.

In [None]:
female=titanic.loc[titanic.Sex=="female"]
male=titanic.loc[titanic.Sex=="male"]

Let us Test the first hypothesis

In [None]:
survived_female = female.Survived
survived_male = male.Survived

In [None]:
survived_group = titanic[titanic['Survived'] == 1]
sns.displot(survived_group.Sex,color='green')
plt.title('Survival of Female vs Male');

From the graph, we already see that the female actually survived more than male.

In [None]:
print('Female', survived_female.mean())
print('Male', survived_male.mean())

This means about 74% of female on board survived the incident while a handful of 19% of male survived.

### Formal Significance Test 

Next, we will obtain our statistics, t-value and p-value. We will use `scipy.stats` library and `ttest_ind()` function to calculate these parameters.
We will Conduct a formal significance test for our first hypotheses and discuss the results

In [None]:
alpha=0.05
t_value, p_value = stats.ttest_ind(survived_female, survived_male)
print("t_value = ",t_value, ", p_value = ", p_value)

In [None]:
if p_value <alpha:
    print("Conclusion: since p_value {} is less than alpha {} ". format (p_value,alpha))
    print("Reject the null hypothesis that there is no difference between the survival of females and survival of males.")
    
else:
    print("Conclusion: since p_value {} is greater than alpha {} ". format (p_value,alpha))
    print("Fail to reject the null hypothesis that there is a difference between the survival of females and survival of males.")

### Suggestions for next steps in analyzing this data

We do not have the full list of the data set, some data set was reserved so need to test our data more when we have additional data on Titanic.

### Summary

As only worked on a subset of Titanic data, we need more data to infer more quality insights. 

We will now further clean the data for machine learning. 

In [None]:
titanic.head()

In [None]:
embarked_cat = titanic['Embarked'].unique().tolist()
embarked_cat

In [None]:
parch_cat = titanic['Parch'].unique().tolist()
parch_cat

In [None]:
sibsp_cat = titanic['SibSp'].unique().tolist()
sibsp_cat

In [None]:
pclass_cat = titanic['Pclass'].unique().tolist()
pclass_cat

In [None]:
sex_cat = titanic['Sex'].unique().tolist()
sex_cat

In [None]:
titanic['Sex'] = titanic['Sex'].apply(lambda x: 0 if x=='female' else 1)

In [None]:
vars = ['Pclass', 'Parch', 'SibSp', 'Embarked']
titanic = pd.get_dummies(data=titanic, columns=vars, dtype=int)
titanic.head()

In [None]:
cor = titanic.corr().fillna(0)
cor

In [None]:
features = cor['Survived'].sort_values()
features

In [None]:
features.plot(kind='bar',figsize=(8,8))
plt.title('The correlation of the features');

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(cor,annot=False,cmap='RdYlGn')

In [None]:
titanic.drop('Survived', axis=1, inplace=True)

In [None]:
titanic.head()

In [None]:
y = np.array(response)

In [None]:
X_train = titanic.iloc[:600, :]
X_test = titanic.iloc[600:, :]
y_train = y[:600]
y_true = y[600:]
model = KNNClassifier(X_train, y_train, K=3)
#CV_n = model.loocv() #Apply Leave-Out-One Cross-Validation
#CV_n
predictions = model.predict(X_test)
#predictions, y_test = model.slice_cv(600)

In [None]:
model.accuracy(y_true)

In [None]:
count, values = model.error_count(y_true)
count

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)
knn_predictions
accuracy = (len(np.where(knn_predictions == y_true)[0]) / len(y_true))*100
print(accuracy)
error_count = len(np.where(knn_predictions != y_true)[0])
error_count