<h2 style='color:purple' align='center'>Naive Bayes Classifier : Predicting survival from titanic crash</h2>

## Import the necessary libraries

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Load `titanic` dataset 

Take from following URL: 
https://raw.githubusercontent.com/kulkarni62sushil/Data/main/Titanic.csv


In [16]:
tita = pd.read_csv("titanic.csv")

## Exploratory Data Analysis (EDA)

**Q.** Display first 10 tuples of the table

In [17]:
tita.head(10)

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0
5,6,"Moran, Mr. James",3,male,,0,0,330877,8.4583,,Q,0
6,7,"McCarthy, Mr. Timothy J",1,male,54.0,0,0,17463,51.8625,E46,S,0
7,8,"Palsson, Master. Gosta Leonard",3,male,2.0,3,1,349909,21.075,,S,0
8,9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",3,female,27.0,0,2,347742,11.1333,,S,1
9,10,"Nasser, Mrs. Nicholas (Adele Achem)",2,female,14.0,1,0,237736,30.0708,,C,1


**Q.** Find total number of rows and columns

In [18]:
tita.shape

(891, 12)

**Q.** Find features and total number with data type

In [19]:
tita.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Name         891 non-null    object 
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
 11  Survived     891 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


**Q.** Find statistical measures of the data

In [20]:
tita.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Survived
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,446.0,2.308642,29.699118,0.523008,0.381594,32.204208,0.383838
std,257.353842,0.836071,14.526497,1.102743,0.806057,49.693429,0.486592
min,1.0,1.0,0.42,0.0,0.0,0.0,0.0
25%,223.5,2.0,20.125,0.0,0.0,7.9104,0.0
50%,446.0,3.0,28.0,0.0,0.0,14.4542,0.0
75%,668.5,3.0,38.0,1.0,0.0,31.0,1.0
max,891.0,3.0,80.0,8.0,6.0,512.3292,1.0


## Feature Engineering

**Q.** Drop features that do not have any impact on the survival status of passengers

In [21]:
tita.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
tita.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


**Q.** Divide dataset into `input` and `target`. `target` contains only `Survived` feature and all other features are in `input`  

In [22]:
input = tita.drop('Survived',axis='columns')
target = tita['Survived']

In [23]:
input.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,male,22.0,7.25
1,1,female,38.0,71.2833
2,3,female,26.0,7.925
3,1,female,35.0,53.1
4,3,male,35.0,8.05


**Q.** Convert categorical variable `Sex` to indicator variable say `dummies`

In [24]:
dummies=pd.get_dummies(input['Sex'])
dummies.head(3)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0


**Q.** Concaniate `input` and `dummies` and assign to `input`

In [25]:
input = pd.concat([input,dummies],axis='columns')
input.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0


**Q.** Drop `male` column because One column is enough to represent male vs female**

In [26]:
input.drop(['Sex'],axis='columns',inplace=True)
input.head(3)

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0


**Q.** Find if there are any `null` value in features and also if exists then locate one or two locations

In [30]:
input.columns[input.isna().any()]

Index(['Age'], dtype='object')

In [61]:
input.isnull().sum()

Pclass      0
Age       177
Fare        0
female      0
dtype: int64

**Q.** Print first 10 values of `Age`

In [31]:
input.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

**Observe that there are `null` values in `Age` feature**

**Q.** Fill missing values with mean of dataset

In [34]:
input.Age=input.Age.fillna(input.Age.mean())
input.head(6)

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1
5,3,29.699118,8.4583,0,1


**Q.** Check again for `null` value

In [35]:
input.isnull().sum()

Pclass    0
Age       0
Fare      0
female    0
male      0
dtype: int64

In [36]:
input.head()

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1


## Naive Bayesian Classifier using `GaussianNB`

**Q.** Divide train and test dataset using `slklearn`

In [49]:
# Library requied and is added in the beginning
# Divide 20 and 80 % ratio
X_train, X_test, Y_train, Y_test = train_test_split(input,target,test_size=0.2)

In [39]:
# Check length of X_train
len(X_train)

712

In [40]:
# Check length of X_test
len(X_test)

179

In [42]:
# total of input
len(input)

891

**Q.** Use GaussianNB class from sklearn.naive_bayes library and assign the model as `bayesian` 

In [50]:
# Gaussian Naive is used when database is normal i.e bell shape
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

**Q.** Train the model using `fit` method

In [51]:
model.fit(X_train,Y_train)

GaussianNB()

**Q.** Measure the score of accuracy for this model

In [52]:
bayesian.score(X_test,Y_test)

0.7932960893854749

**Q.** Check first `X_test and Y_test samples`

In [54]:
X_test[0:10]

Unnamed: 0,Pclass,Age,Fare,female,male
326,3,61.0,6.2375,0,1
8,3,27.0,11.1333,1,0
644,3,0.75,19.2583,1,0
344,2,36.0,13.0,0,1
301,3,29.699118,23.25,0,1
842,1,30.0,31.0,1,0
353,3,25.0,17.8,0,1
430,1,28.0,26.55,0,1
555,1,62.0,26.55,0,1
159,3,29.699118,69.55,0,1


In [77]:
Y_test[0:10]

114    0
780    1
672    0
173    0
386    0
763    1
510    1
770    0
75     0
885    0
Name: Survived, dtype: int64

**Q.** Compare `Y_test` with `X_test` 

In [55]:
model.predict(X_test[0:10])

array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0])

**NOTE** Observe that our score is 79% so obviously it will do some mistakes. You run or train the model again to improve the score

**Q.** Find the probability of each class for making prediction

In [57]:
model.predict_proba(X_test[:10])

array([[0.98621996, 0.01378004],
       [0.08746186, 0.91253814],
       [0.0392486 , 0.9607514 ],
       [0.98071942, 0.01928058],
       [0.9912729 , 0.0087271 ],
       [0.00934787, 0.99065213],
       [0.9909878 , 0.0090122 ],
       [0.91963962, 0.08036038],
       [0.87490346, 0.12509654],
       [0.97611441, 0.02388559]])

**NOTE** Larger probability means he didn't survive and less probability means he survived. For example, 0.98621996 means not survived(0) and 0.01928058 means survived(1)