<a href="https://colab.research.google.com/github/gndede/python/blob/main/TitanicData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf
# linear algebra
import numpy as np
# data processing
import pandas as pd

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

In [None]:
test_df = pd.read_csv("/content/test.csv")
train_df = pd.read_csv("/content/train.csv")

The training-set has 891 examples and 11 features + the target variable (survived). 2 of the features are floats, 5 are integers and 5 are objects. Below I have listed the features with a short description:

In [None]:
'''survival:    Survival
PassengerId: Unique Id of a passenger.
pclass:    Ticket class
sex:    Sex
Age:    Age in years
sibsp:    # of siblings / spouses aboard the Titanic
parch:    # of parents / children aboard the Titanic
ticket:    Ticket number
fare:    Passenger fare
cabin:    Cabin number
embarked:    Port of Embarkation'''
train_df.info()

In [None]:
train_df.describe()

In [None]:
train_df.head(10)

In [None]:
'''
From the table above, we can note a few things.
First of all, that we need to convert a lot of features
into numeric ones later on, so that the machine learning
algorithms can process them.
Furthermore, we can see that the features have widely different ranges,
that we will need to convert into roughly the same scale.
We can also spot some more features, that contain
missing values (NaN = not a number), that we need to deal with.
'''
#Let’s take a more detailed look at what data is actually missing:
total = train_df.isnull().sum().sort_values(ascending=False)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100

percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(5)

**Embarked**
The Embarked feature has only 2 missing values, which can easily be filled. It will be much more tricky, to deal with the ‘Age’ feature, which has 177 missing values. The ‘Cabin’ feature needs further investigation, but it looks like that we might want to drop it from the dataset, since 77% of it are missing.

In [None]:
train_df.columns.values

In [None]:
#To me it would make sense if everything except ‘PassengerId’, ‘Ticket’
#and ‘Name’ would be correlated with a high survival rate.
survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 4))
women = train_df[train_df['Sex']=='female']
men = train_df[train_df['Sex']=='male']

ax = sns.distplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False)
ax = sns.distplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False)
ax.legend()
ax.set_title('Female/Survived or Not')

ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False)
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False)
ax.legend()

_ = ax.set_title('Male/Survide or Not')

You can see that men have a high probability of survival when they are between 18 and 30 years old, which is also a little bit true for women but not fully. For women the survival chances are higher between 14 and 40.

In [None]:
#Embarked, Pclass and Sex:
FacetGrid = sns.FacetGrid(train_df, row='Embarked', size=4.5, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None,  order=None, hue_order=None )
FacetGrid.add_legend()

In [None]:
sns.barplot(x='Pclass', y='Survived', data=train_df)

In [None]:
#Pclass is contributing to a persons chance of survival,
#especially if this person is in class 1.
#We will create another pclass plot below.
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

The plot above confirms our assumption about passenger class 1, but we can also spot a high probability that a person in passenger class 3 will not survive.

In [None]:
'''SibSp and Parch would make more sense as a combined feature,
that shows the total number of relatives, a person has on the Titanic.
We will create it below and also a feature that sows if someone is not alone.'''
data = [train_df, test_df]
for dataset in data:
    dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
    dataset.loc[dataset['relatives'] > 0, 'not_alone'] = 0
    dataset.loc[dataset['relatives'] == 0, 'not_alone'] = 1
    dataset['not_alone'] = dataset['not_alone'].astype(int)
train_df['not_alone'].value_counts()

In [None]:
axes = sns.factorplot('relatives','Survived',
                      data=train_df, aspect = 2.5, )

Here we can see that you had a high probabilty of survival with 1 to 3 realitves, but a lower one if you had less than 1 or more than 3 (except for some cases with 6 relatives).


In [None]:
#drop ‘PassengerId’ from the train set, because
#it does not contribute to a persons survival probability.
train_df = train_df.drop(['PassengerId'], axis=1)
#train_df

**Cabin:**
As a reminder, we have to deal with Cabin (687), Embarked (2) and Age (177). First I thought, we have to delete the ‘Cabin’ variable but then I found something interesting. A cabin number looks like ‘C123’ and the letter refers to the deck. Therefore we’re going to extract these and create a new feature, that contains a persons deck. Afterwords we will convert the feature into a numeric variable. The missing values will be converted to zero. In the picture below you can see the actual decks of the titanic, ranging from A to G.

In [None]:
import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data = [train_df, test_df]

for dataset in data:
    dataset['Cabin'] = dataset['Cabin'].fillna("U0")
    dataset['Deck'] = dataset['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
    dataset['Deck'] = dataset['Deck'].map(deck)
    dataset['Deck'] = dataset['Deck'].fillna(0)
    dataset['Deck'] = dataset['Deck'].astype(int)
# we can now drop the cabin feature
train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)

**Age:**
Now we can tackle the issue with the age features missing values.
I will create an array that contains random numbers, which are computed based on the mean age value in regards to the standard deviation and is_null.

In [None]:
data = [train_df, test_df]

for dataset in data:
    mean = train_df["Age"].mean()
    std = test_df["Age"].std()
    is_null = dataset["Age"].isnull().sum()

    # compute random numbers between the mean, std and is_null
    rand_age = np.random.randint(mean - std, mean + std, size = is_null)

    # fill NaN values in Age column with random values generated
    age_slice = dataset["Age"].copy()
    age_slice[np.isnan(age_slice)] = rand_age
    dataset["Age"] = age_slice
    dataset["Age"] = train_df["Age"].astype(int)
train_df["Age"].isnull().sum()

In [None]:
#Embarked:
#Since the Embarked feature has only 2 missing values,
#we will just fill these with the most common one.
train_df['Embarked'].describe()

In [None]:
common_value = 'S'
data = [train_df, test_df]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].fillna(common_value)

In [None]:
train_df.info()

Above you can see that ‘Fare’ is a float and we have to deal with 4 categorical features:
Name, Sex, Ticket and Embarked. Lets investigate and transfrom one after another.

In [None]:
#Fare:
#Converting “Fare” from float to int64, using the “astype()” function pandas provides:
data = [train_df, test_df]

for dataset in data:
    dataset['Fare'] = dataset['Fare'].fillna(0)
    dataset['Fare'] = dataset['Fare'].astype(int)

In [None]:
#Name:
#We will use the Name feature to extract the Titles
#from the Name, so that we can build a new feature out of that.
data = [train_df, test_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in data:
    # extract titles
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    # replace titles with a more common title or as Rare
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                            'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    # convert titles into numbers
    dataset['Title'] = dataset['Title'].map(titles)
    # filling NaN with 0, to get safe
    dataset['Title'] = dataset['Title'].fillna(0)
train_df = train_df.drop(['Name'], axis=1)
test_df = test_df.drop(['Name'], axis=1)

In [None]:
#Convert 'Sex' feature into numeric
genders = {"male": 0, "female": 1}
data = [train_df, test_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

In [None]:
#Ticket:
train_df['Ticket'].describe()
#Since the Ticket attribute has 681 unique tickets,
#it will be a bit tricky to convert them into useful categories.


In [None]:
#So we will drop it from the dataset.
train_df = train_df.drop(['Ticket'], axis=1)
test_df = test_df.drop(['Ticket'], axis=1)

In [None]:
#Convert "Embarked" feature into numeric
ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].map(ports)

# **Creating Categories:**
We will now create categories within the following features:

In [None]:
#Age:
'''Now we need to convert the ‘age’ feature.
First we will convert it from float into integer.
Then we will create the new ‘AgeGroup” variable,
by categorizing every age into a group.
Note that it is important to place attention
on how you form these groups, since you don’t
want for example that 80% of your data falls into group 1.'''
data = [train_df, test_df]
for dataset in data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
    dataset.loc[ dataset['Age'] > 66, 'Age'] = 6
#print(dataset)
# let's see how it's distributed train_df['Age'].value_counts()

In [None]:
#train_df['Age']

In [None]:
'''Fare:
Fare:
For the ‘Fare’ feature, we need to do the same as with the ‘Age’ feature.
But it isn’t that easy, because if we cut the range of the fare values
into a few equally big categories, 80% of the values would fall into the
first category. Fortunately, we can use sklearn “qcut()” function,
that we can use to see, how we can form the categories.'''
train_df.head(10)

In [None]:
data = [train_df, test_df]

for dataset in data:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare']   = 3
    dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare']   = 4
    dataset.loc[ dataset['Fare'] > 250, 'Fare'] = 5
    dataset['Fare'] = dataset['Fare'].astype(int)

In [None]:
data = [train_df, test_df]
for dataset in data:
    dataset['Age_Class']= dataset['Age']* dataset['Pclass']

In [None]:
for dataset in data:
    dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
    dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)
# Let's take a last look at the training set, before we start training the models.
train_df.head(10)

In [None]:
data = [train_df, test_df]
#data
for dataset in data:
  dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0
  dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
  #dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare']) <= 31), 'Fare'] = 2
  dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
  #dataset.loc[dataset['Fare'] > 31) & (dataset['Fare']) <= 99), 'Fare'] = 3
  dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare']   = 3
  #dataset.loc[dataset['Fare'] > 99) & (dataset['Fare']) <= 250), 'Fare'] = 4
  dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare']   = 4
  dataset.loc[dataset['Fare'] > 250,  'Fare'] = 5
  dataset['Fare'] = dataset['Fare'].astype(int)

In [None]:
#Creating new features.
data = [train_df, test_df]
for dataset in data:
    dataset['Age_Class'] = dataset['Age'] * dataset['Pclass']

In [None]:
#Fare per person
'''for dataset in data:
  dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
  dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)

  #Let's take a look at the training set, before we start traiing the model
  train_df.head(10)'''

for dataset in data:
    dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
    dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)
# Let's take a last look at the training set, before we start training the models.
train_df.head(10)

**Building the Machine Learning Models**



In [None]:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()

**Stochastic Gradient Descent (SGD)**
Stochastic gradient descent is an optimization algorithm often used in machine learning applications to find the model parameters that correspond to the best fit between predicted and actual outputs. It's an inexact but powerful technique. Stochastic gradient descent is widely used in machine learning applications.

In [None]:
#Stochastic gradient descent classier
sgd = linear_model .SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)

sgd.score(X_train, Y_train)
acc_sgd = round(sgd.score(X_train, Y_train)* 100,2)
acc_sgd

In [None]:
#Random Forest Classifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_train)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train)*100,2)
acc_random_forest

In [None]:
#Logistic Regeression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_train)
acc_log = round(logreg.score(X_train, Y_train)*100, 2)
acc_log

In [None]:
#KNN K Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, Y_train)

Y_pred = knn.predict(X_train)

acc_knn = round(knn.score(X_train, Y_train)*100, 2)
acc_knn

In [None]:
#Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)

Y_pred = gaussian.predict(X_train)

acc_gaussian = round(gaussian.score(X_train, Y_train)*100, 2)
acc_gaussian

In [None]:
#Decicion Tree Classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred = decision_tree.predict(X_train)

acc_decision_tree = round(decision_tree.score(X_train, Y_train)*100, 2)
acc_decision_tree

In [None]:
#Linear Support Vector Machine
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_train)

acc_linear_svc = round(linear_svc.score(X_train, Y_train)*100,2)
acc_linear_svc

In [None]:
#Perceptron Classifier

'''A Perceptron is an algorithm for supervised learning
of binary classifiers. This algorithm enables neurons
to learn and processes elements in the training set
one at a time. There are two types of Perceptrons:
Single layer and Multilayer. Single layer Perceptrons can
learn only linearly separable patterns.'''
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_train)

acc_perceptron = round(perceptron.score(X_train, Y_train)*100, 2)
acc_perceptron

In [None]:
results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
              'Random Forest', 'Naive Bayes', 'Perceptron',
              'Stochastic Gradient Decent',
              'Decision Tree'],
    'Score': [acc_linear_svc, acc_knn, acc_log,
              acc_random_forest, acc_gaussian, acc_perceptron,
              acc_sgd, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(9)

In [None]:
'''#Cross Validation
results = pd.DataFrame({'Model': ['Support Vector Machines', 'KNN', 'Logistics Regression', 'Random Forest', 'Naive Bayes','Stochastic Gradient Decent', 'Decision Tree', 'Perceptron'],
                        'Score': [acc_linear_svc, acc_knn, acc_log, acc_random_forest, acc_gaussian, acc_percetron, acc_sgd, acc_decision] })

result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head()'''

In [None]:
'''As we can see, the Random Forest classifier goes on the first place.
But first, let us check, how random-forest performs, when we use cross validation.'''

K-FOLD CROSS VALIDATION:

