## Naive Bayse
**Naïve Bayes** is a supervised machine learning algorithm used for classification. It works by using Bayes' Theorem to calculate the probability of an event occurring given the prior knowledge of certain conditions. For example, it is used to classify an email as spam or not spam based on the words in the email and the likelihood that those words appear in a spam email. Naïve Bayes is a simple yet effective algorithm that can be used in a variety of applications such as sentiment analysis and text classification.

In [3]:
import pandas as pd
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [4]:
# So first we drop all the columns which don't have any effect on the survived rate:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


In [6]:
# Let's seperate the target variable:
target = df.Survived
target.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [9]:
# To create an independent of feature (input) variables, we drop the target (Survived) column:
inputs = df.drop(["Survived"], axis = "columns")
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,male,22.0,7.25
1,1,female,38.0,71.2833
2,3,female,26.0,7.925
3,1,female,35.0,53.1
4,3,male,35.0,8.05


In [10]:
# Now the next step is, we see the Sex column is text, so we need to convert it into dummy columns. 
# Here we use 'get_dummies' method to generate dummies variables:
dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)


Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0


In [11]:
# Next we should join the dummy columns with the input DataFrame:
inputs = pd.concat([inputs,dummies],axis='columns')
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0


In [14]:
# Next we should drop the Sex column, because it's no more needed. 
# Addition to Sex column we shold drop one of the dummy column:    // finally it will give us the 'X'
X = inputs.drop(["Sex", "male"], axis = "columns")
X.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1
3,1,35.0,53.1,1
4,3,35.0,8.05,0


In [15]:
# We also check if there is any n.a. cell in the dataset. The way we do that is:
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [17]:
# So it says that in column age we have n.a. values. So we take a mean of the entire column 'age' to fill the n.a. avlues:
X.Age = X.Age.fillna(X.Age.mean())
X.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1
3,1,35.0,53.1,1
4,3,35.0,8.05,0


In [39]:
# Now let's call train_test_split method to split the dataset into train and test samples:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,target,test_size=0.2)

In [19]:
# The number of train samples:
len(X_train)

712

In [20]:
# The lenght of test samples:
len(X_test)

179

In [21]:
# The total samplse were:
len(X)

891

In [33]:
# You can also print the train and test simples individually. Let's see the train samples:
X_train.head(15)

Unnamed: 0,Pclass,Age,Fare,female
205,3,2.0,10.4625,1
464,3,29.699118,8.05,0
144,2,18.0,11.5,0
168,1,29.699118,25.925,0
618,2,4.0,39.0,1
41,2,27.0,21.0,1
795,2,39.0,13.0,0
421,3,21.0,7.7333,0
816,3,23.0,7.925,1
509,3,26.0,56.4958,0


In [40]:
# So next we create the Naive Bayse model. There are couple of Naive Bayes model, here we use the GaussianNB.
# GaussianNB is normally used when your data distribution is normal:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

In [41]:
# So the next step is to train the model using the available dataset:
nb.fit(X_train, y_train)

GaussianNB()

In [42]:
# So the model is trained. Let's see the model scores:
nb.score(X_test, y_test)

0.7988826815642458

In [43]:
# Now let's see the X_test sample and y_test samples to see the model accuracy:
# X_test:
X_test.head(15)

Unnamed: 0,Pclass,Age,Fare,female
470,3,29.699118,7.25,0
667,3,29.699118,7.775,0
195,1,58.0,146.5208,1
607,1,27.0,30.5,0
658,2,23.0,13.0,0
726,2,30.0,21.0,1
18,3,31.0,18.0,1
133,2,29.0,26.0,1
26,3,29.699118,7.225,0
32,3,29.699118,7.75,1


In [44]:
# y_test samples:
y_test.head(15)

470    0
667    0
195    1
607    1
658    0
726    1
18     0
133    1
26     0
32     1
605    0
523    1
89     0
260    0
201    0
Name: Survived, dtype: int64

* So we can see where the model get miss classified.

In [45]:
# Or we can predict the first 15 samples through our model and compare it with the y_test to see the model accuracy.
nb.predict(X_test[:15])

array([0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0], dtype=int64)

In [46]:
# We can also see the probability for each class:
nb.predict_proba(X_test[:15])

array([[9.62884097e-01, 3.71159031e-02],
       [9.62973326e-01, 3.70266740e-02],
       [1.70490968e-04, 9.99829509e-01],
       [7.57820633e-01, 2.42179367e-01],
       [9.17858700e-01, 8.21413001e-02],
       [2.50454225e-01, 7.49545775e-01],
       [4.24880611e-01, 5.75119389e-01],
       [2.42688527e-01, 7.57311473e-01],
       [9.62879669e-01, 3.71203307e-02],
       [4.19810957e-01, 5.80189043e-01],
       [9.64684245e-01, 3.53157547e-02],
       [4.84375774e-02, 9.51562423e-01],
       [9.60477914e-01, 3.95220855e-02],
       [9.62969239e-01, 3.70307611e-02],
       [8.99773218e-01, 1.00226782e-01]])

* Thats were all for this notebook....