# Titanic - Machine Learning

This project aims to introduce the most important steps of data analysis and explore the different stages. We will use the data of Titanic survivors available on the Kaggle website at the following link:
https://www.kaggle.com/competitions/titanic/overview

You can download the dataset and explore all the information about it in the following link: https://www.kaggle.com/competitions/titanic/data

# Importing the dependncies


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Reading the data


In [3]:
data = pd.read_csv('titanic.csv')

In [4]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
data.shape

(891, 12)

In [10]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.002015,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,29.699118,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# Data Preprocessing  

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


# Dealing with messing data

Three options to fix:
Choose the suitable one for each case.

*   Delete rows that contains missing values.
*   Delete the whole column thats contains missing values.
*   Replace missing values with some values(mean,median,mode,constant)






In [8]:
data.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


There are three columns contains Missing values: **Age, Cabin, Embarked**. <br>
In the Age column, we will fill the missing values with the mean since it is a simple and quick method to handle missing data and helps maintain the overall distribution of the dataset.

In [11]:

#fill the missing values in Age with the mean of Age column
#you can simply use 'filllna' function, or any other way such as SimpleImputer
data['Age'].fillna(data['Age'].mean(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].mean(), inplace=True)


In [14]:
data['Age'].isnull().sum()

np.int64(0)

There are a large number of missing values in the Cabin column, so we will drop this column from the dataset.

In [15]:
data.drop(['Cabin'], axis=1, inplace=True)

In [16]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In the Embarked column, there are only two missing values. Let's see what the categories in this column are.

In [17]:
data['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [19]:
data['Embarked'].value_counts()

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,644
C,168
Q,77


In [20]:
data['Embarked'].mode()[0]

'S'

In [21]:
data['Embarked'].fillna('S', inplace=True)

In [22]:
data['Embarked'].isnull().sum()

np.int64(0)

# Drop useless columns

As you know, the PassengerId and Name of the Passenger do not affect the probability of survival. and ticket column does not have a clear relationship to the survival of passengers, so they will be dropped:

In [23]:
data = data.drop(['PassengerId','Name','Ticket'], axis=1)

In [24]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


sibsp: 	# of siblings / spouses aboard the Titanic
parch: 	# of parents / children aboard the Titanic

In [25]:
data['SibSp'].value_counts()

Unnamed: 0_level_0,count
SibSp,Unnamed: 1_level_1
0,608
1,209
2,28
4,18
3,16
8,7
5,5


In [26]:
data['Parch'].value_counts()

Unnamed: 0_level_0,count
Parch,Unnamed: 1_level_1
0,678
1,118
2,80
5,5
3,5
4,4
6,1


In [28]:
data['family'] = data['SibSp']+data['Parch']
data.drop(['SibSp','Parch'], axis=1,inplace=True)

In [29]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,family
0,0,3,male,22.0,7.25,S,1
1,1,1,female,38.0,71.2833,C,1
2,1,3,female,26.0,7.925,S,0
3,1,1,female,35.0,53.1,S,1
4,0,3,male,35.0,8.05,S,0


# Dealing with duplicates

In [32]:
data.duplicated().sum()

np.int64(0)

In [31]:
data.drop_duplicates(inplace=True)

# Encode categorical columns

Sex and Embarked columns values are text, we can't give this text directly to the machine learning model, so we need to replace this text values to meaningful numerical values.

In Age column we will replace all male values with 1 and all the female values with 0. <br>
and we will do the same in Embarked column: S, C, Q

In [39]:

data.replace({'Sex':{'male':0,'female':1}}, inplace=True)


In [40]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,family
0,0,3,0,22.0,7.25,1,1
1,1,1,1,38.0,71.2833,2,1
2,1,3,1,26.0,7.925,1,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,1,0


In [41]:
data['Embarked'].replace({'S':1,'C':2,'Q':3}, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Embarked'].replace({'S':1,'C':2,'Q':3}, inplace=True)


In [42]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,family
0,0,3,0,22.0,7.25,1,1
1,1,1,1,38.0,71.2833,2,1
2,1,3,1,26.0,7.925,1,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,1,0


# Data anylysis

# Split the dataset

**Separating features & Target** <br><br>
Separating features and target so that we can prepare the data for training machine learning models. In the Titanic dataset, the Survived column is the target variable, and the other columns are the features.

In [49]:
x = data.drop(columns = ['Survived'], axis=1)
y = data['Survived']

**Splitting the data into training data & Testing data**

To build and evaluate a machine learning model effectively, it's essential to split the dataset into training and testing sets. The training set is used to train the model, allowing it to learn patterns and relationships within the data. The testing set, on the other hand, is used to evaluate the model's performance on unseen data, ensuring it can generalize well to new instances. This split helps prevent overfitting and provides a reliable estimate of the model's predictive accuracy.

In [50]:

from sklearn.model_selection import train_test_split

# Split the data into training data & Testing data using train_test_split function :
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)

In [51]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,family
0,0,3,0,22.0,7.25,1,1
1,1,1,1,38.0,71.2833,2,1
2,1,3,1,26.0,7.925,1,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,1,0


# Normailization

In [52]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,family
0,0,3,0,22.0,7.25,1,1
1,1,1,1,38.0,71.2833,2,1
2,1,3,1,26.0,7.925,1,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,1,0


In [53]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train = scaler.fit_transform(x_train)

In [54]:
x_test = scaler.transform(x_test)

# Model Training

Model training is a crucial step in the machine learning where the algorithm learns from the training data to make predictions. **Logistic Regression** is a commonly used algorithm for binary classification tasks, such as predicting whether a passenger survived in the Titanic dataset. By training the model on our training data, we aim to find the best-fit parameters that minimize prediction errors. Once trained, this model can be used to predict outcomes on new, unseen data.

In [55]:

from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model and Train it on the training data:

model = LogisticRegression()
model.fit(x_train, y_train)



# Model Evaluation

Model evaluation is crucial in machine learning to assess the performance of a trained model on testing data. The **accuracy score**, a common evaluation metric, measures the proportion of correct predictions out of all predictions. This helps to gauge the model's effectiveness, ensure it generalizes well to new data, and guide further improvements.

In [56]:


from sklearn.metrics import accuracy_score

# accuracy on testing data
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(y_test, x_test_prediction)
print('Accuracy score of training data : ', test_data_accuracy)


Accuracy score of training data :  0.75
