# Naïve Bayes

- [NB in Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Relation_to_logistic_regression)

- Naive bayes is used for strings and numbers(categorically) it can be used for classification so it can be either 1 or 0 nothing in between like 0.5 (regression)

- [Technical Note: Naive Bayes for Regression](https://link.springer.com/content/pdf/10.1023%2FA%3A1007670802811.pdf) Shows that NB is not proper for regression task.

- Relation to logistic regression: naive Bayes classifier can be considered a way of fitting a probability model that optimizes the joint likelihood p(C , x), while logistic regression fits the same probability model to optimize the conditional p(C | x).


## Advantages

- It is not only a simple approach but also a fast and accurate method for prediction.
- Naive Bayes has very low computation cost.
- It can efficiently work on a large dataset.
- It performs well in case of discrete response variable compared to the continuous variable.
- It can be used with multiple class prediction problems.
- It also performs well in the case of text analytics problems.
  When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression.

## Disadvantages

- The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors which are entirely independent.
- If there is no training tuple of a particular class, this causes zero posterior probability. In this case, the model is unable to make predictions. This problem is known as Zero Probability/Frequency Problem.


In [1]:
!gdown --id 1t3gVQVCAn19xSa-CzTzxLHRq3XeeoFlU

Downloading...
From: https://drive.google.com/uc?id=1t3gVQVCAn19xSa-CzTzxLHRq3XeeoFlU
To: /content/titanic.csv
100% 61.2k/61.2k [00:00<00:00, 38.7MB/s]


In [2]:
import pandas as pd

In [41]:
df = pd.read_csv("titanic.csv")
df.head(1)

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0


In [43]:
df.groupby('Sex').describe()

Unnamed: 0_level_0,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,Pclass,Pclass,...,Fare,Fare,Survived,Survived,Survived,Survived,Survived,Survived,Survived,Survived
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
female,314.0,431.028662,256.846324,2.0,231.75,414.5,641.25,889.0,314.0,2.159236,...,55.0,512.3292,314.0,0.742038,0.438211,0.0,0.0,1.0,1.0,1.0
male,577.0,454.147314,257.486139,1.0,222.0,464.0,680.0,891.0,577.0,2.389948,...,26.55,512.3292,577.0,0.188908,0.391775,0.0,0.0,0.0,0.0,1.0


In [5]:
df.drop(
    ['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked'],
    axis='columns',
    inplace=True)
df.head(1)

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0


In [6]:
target = df.Survived
inputs = df.drop('Survived', axis='columns')

In [None]:
# inputs.Sex = inputs.Sex.map({'male': 1, 'female': 2})

In [None]:
dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)

In [11]:
inputs = pd.concat([inputs, dummies], axis='columns')
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0


In [12]:
inputs.drop(['Sex', 'male'], axis='columns', inplace=True)
inputs.head(3)

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1


In [16]:
inputs.isna().any()

Pclass    False
Age        True
Fare      False
female    False
dtype: bool

In [13]:
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [None]:
inputs.isna().sum()

In [None]:
inputs.Age[:20]

In [None]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(inputs,
                                                    target,
                                                    test_size=0.3)

In [29]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

In [30]:
model.fit(X_train, y_train)

GaussianNB()

In [31]:
model.score(X_test, y_test)

0.7686567164179104

In [35]:
X_test[0:10]

Unnamed: 0,Pclass,Age,Fare,female
379,3,19.0,7.775,0
677,3,18.0,9.8417,1
353,3,25.0,17.8,0
226,2,19.0,10.5,0
299,1,50.0,247.5208,1
572,1,36.0,26.3875,0
694,1,60.0,26.55,0
9,2,14.0,30.0708,1
425,3,29.699118,7.25,0
103,3,33.0,8.6542,0


In [34]:
y_test[0:10]

379    0
677    1
353    0
226    1
299    1
572    1
694    0
9      1
425    0
103    0
Name: Survived, dtype: int64

In [36]:
model.predict(X_test[0:10])

array([0, 1, 0, 0, 1, 0, 0, 1, 0, 0])

In [37]:
model.predict_proba(X_test[:10])

array([[9.63497382e-01, 3.65026176e-02],
       [4.11945863e-01, 5.88054137e-01],
       [9.67688610e-01, 3.23113897e-02],
       [9.22419018e-01, 7.75809823e-02],
       [3.77604330e-10, 1.00000000e+00],
       [7.66326029e-01, 2.33673971e-01],
       [6.98657089e-01, 3.01342911e-01],
       [2.04002927e-01, 7.95997073e-01],
       [9.69246325e-01, 3.07536748e-02],
       [9.70148818e-01, 2.98511824e-02]])

In [40]:
from sklearn.model_selection import cross_val_score

cross_val_score(GaussianNB(), X_train, y_train, cv=5,
                scoring='accuracy').mean()

0.7721032258064516