# Titanic Survival Prediction Model
`Github : hellopavi`


This project demonstrates how to build a machine learning model to predict survival on the Titanic using the Gaussian Naive Bayes algorithm. We will go through loading the dataset, preprocessing the data, and training a model, followed by evaluating its performance.

---

## Step 1: Import Libraries

First, we need to import the necessary libraries: `pandas` for data manipulation and `numpy` for numerical operations.



In [233]:
import pandas as pd
import numpy as np

##Step 2: Load the Dataset
We'll load the dataset from a CSV file using pandas. This file contains cleaned data of passengers aboard the Titanic.

In [234]:
dataset = pd.read_csv('train_clean.csv')


##Step 3: Explore the Dataset
Let's examine the shape of the dataset and take a quick look at the first few rows.

In [235]:
print(dataset.shape)
dataset.head(3)

(891, 14)


Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,Family_Size
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171,Mr,1
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599,Mrs,1
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282,Miss,0


##Step 4: Dataset Summary
We can get a summary of the dataset, which includes the data types of each column and the count of non-null values.

In [236]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          891 non-null    float64
 1   Cabin        204 non-null    object 
 2   Embarked     891 non-null    object 
 3   Fare         891 non-null    float64
 4   Name         891 non-null    object 
 5   Parch        891 non-null    int64  
 6   PassengerId  891 non-null    int64  
 7   Pclass       891 non-null    int64  
 8   Sex          891 non-null    object 
 9   SibSp        891 non-null    int64  
 10  Survived     891 non-null    float64
 11  Ticket       891 non-null    object 
 12  Title        891 non-null    object 
 13  Family_Size  891 non-null    int64  
dtypes: float64(3), int64(5), object(6)
memory usage: 97.6+ KB


##Step 5: Encode the 'Sex' Column
We will convert the 'Sex' column to binary values, with 'female' mapped to 0 and 'male' to 1

In [237]:
# Identify unique values in the 'Sex' column
gender_set = set(dataset['Sex'])
print(gender_set)


{'female', 'male'}


In [238]:
# Map 'Sex' column to binary values: 'female' to 0 and 'male' to 1
dataset['Sex'] = dataset['Sex'].map({'female':0 , 'male':1})
dataset.head(3)

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,Family_Size
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,1,1,0.0,A/5 21171,Mr,1
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,0,1,1.0,PC 17599,Mrs,1
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,0,0,1.0,STON/O2. 3101282,Miss,0


##Step 6: Feature Selection
In this step, we'll drop irrelevant columns and prepare our features (`x`) and target variable (`y`).

In [239]:
# Check the colunms
dataset.columns

Index(['Age', 'Cabin', 'Embarked', 'Fare', 'Name', 'Parch', 'PassengerId',
       'Pclass', 'Sex', 'SibSp', 'Survived', 'Ticket', 'Title', 'Family_Size'],
      dtype='object')

In [240]:
# Drop irrelevant columns and prepare the feature matrix 'x' and target vector 'y'
x = dataset.drop(['Name', 'Parch', 'PassengerId','SibSp', 'Survived','Family_Size', 'Cabin', 'Embarked','Ticket', 'Title'], axis=1)
y = dataset['Survived'].values

# Printing first 10 value of 'x' and 'y'
print(x[0:10],y[:10])

    Age     Fare  Pclass  Sex
0  22.0   7.2500       3    1
1  38.0  71.2833       1    0
2  26.0   7.9250       3    0
3  35.0  53.1000       1    0
4  35.0   8.0500       3    1
5  30.0   8.4583       3    1
6  54.0  51.8625       1    1
7   2.0  21.0750       3    1
8  27.0  11.1333       3    0
9  14.0  30.0708       2    0 [0. 1. 1. 1. 0. 0. 0. 0. 1. 1.]


In [241]:
# we can use fillna when data have missing values but in our case we dont need that ...
x.columns[x.isna().any()]
x.Age.fillna(x.Age.mean())

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
886,27.0
887,19.0
888,22.0
889,26.0


##Step 7: Split the Data
We will split the data into training and testing sets, which will allow us to train the model and then evaluate its performance on unseen data.

In [242]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.20,random_state=0)

##Step 8: Train, Predict, and Evaluate the Model
Finally, we'll train the `Gaussian Naive Bayes model`, make predictions, and evaluate the accuracy of the model.

In [243]:
# Initialize and train the Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(x_train,y_train)

In [244]:
# Predict outcomes for the test set
y_pred = model.predict(x_test)
comp = np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1).astype(int)
print(comp[:10])


[[0 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [0 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]]


In [245]:
# Calculate and print the accuracy of the model
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test,y_pred)*100
print(f'Accuracy : {accuracy.round(2)}')

Accuracy : 78.21


### **`- by Pavithran`**






