## Project Goal: Train a logistic regression classifier to predict survival of passengers in titanic dataset


In the dataset, each row represents information about the passengers of titanic, Like their name, gender, class etc.

The target column is 'Survived' which tells us whether this particular passenger sirvived or not

The project is to train a logistic regression model which takes the input featues and predicts the whether a passenger with these features would survive or not.

In [None]:
from IPython.display import clear_output

In [None]:
%pip install gdown==4.5

clear_output()

In [None]:
!gdown 18YfCgT3Rk7uYWrUzgjb2UR3Nyo9Z68bK  # Download the csv file.

Downloading...
From: https://drive.google.com/uc?id=18YfCgT3Rk7uYWrUzgjb2UR3Nyo9Z68bK
To: /content/titanic.csv
  0% 0.00/60.3k [00:00<?, ?B/s]100% 60.3k/60.3k [00:00<00:00, 58.2MB/s]


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
titanic_data = pd.read_csv('titanic.csv')

In [None]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Prepare the data

In [None]:
# I drop unnessery colmuns Name,Ticket,Cabin,passengerId
titanic_data = titanic_data.drop('Name',axis=1)
titanic_data = titanic_data.drop('Ticket',axis=1)
titanic_data = titanic_data.drop('Cabin',axis=1)
titanic_data = titanic_data.drop('PassengerId',axis=1)
titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [None]:
# change the sex feature from string to 1 for male and 0 for female
titanic_data['Sex'] = titanic_data['Sex'].astype('category').cat.codes

In [None]:
# also I change the Embarked to numbers 0,1,2,... instead of letters
titanic_data['Embarked'] = titanic_data['Embarked'].astype('category').cat.codes
titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


In [None]:
titanic_data.info()
# There is null values in the data so I will removed it

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int8   
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    int8   
dtypes: float64(2), int64(4), int8(2)
memory usage: 43.6 KB


In [None]:
# remove the null values in the rows
titanic_data = titanic_data.dropna()

In [None]:
# checking that data don't having another null values
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Sex       714 non-null    int8   
 3   Age       714 non-null    float64
 4   SibSp     714 non-null    int64  
 5   Parch     714 non-null    int64  
 6   Fare      714 non-null    float64
 7   Embarked  714 non-null    int8   
dtypes: float64(2), int64(4), int8(2)
memory usage: 40.4 KB


In [None]:
# Split the data to input and output
data_y = titanic_data['Survived'] # The output or the labels
data_x = titanic_data.drop(columns=['Survived']) # inputs

In [None]:
data_x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,22.0,1,0,7.25,2
1,1,0,38.0,1,0,71.2833,0
2,3,0,26.0,0,0,7.925,2
3,1,0,35.0,1,0,53.1,2
4,3,1,35.0,0,0,8.05,2


In [None]:
data_y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Using Sklearn library for training

In [None]:
from sklearn.model_selection import train_test_split
# split the data into train (to train the model using it) and test (to check the my model performance)
X_train, X_test, y_train, y_test = train_test_split(data_x,data_y,test_size = 0.2,shuffle=True)

In [None]:
from sklearn.linear_model import LogisticRegression
# train the model
model = LogisticRegression(max_iter=2500).fit(X_train,y_train)
# make a predict to compaire between the predicted and the actual value
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
# calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 81.82%


In [None]:
correct_mask = (y_pred == y_test)
correct_count = correct_mask.sum()
total_count = len(y_pred)

print(f'The model predicted {correct_count} out of {total_count} predictions')
print(f'The Accuracy is {(correct_count/total_count):.3f} ({correct_count/total_count*100:.2f}%)')

The model predicted 117 out of 143 predictions
The Accuracy is 0.818 (81.82%)


In [None]:
from sklearn.metrics import confusion_matrix
# calcualte the confusion matrix
confusion_matrix(y_test, y_pred)

array([[76,  8],
       [18, 41]])

In [None]:
from sklearn.metrics import classification_report
# calcualte the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85        84
           1       0.84      0.69      0.76        59

    accuracy                           0.82       143
   macro avg       0.82      0.80      0.81       143
weighted avg       0.82      0.82      0.81       143

