# Naive Bayes

Naive Bayes algorithm is a supervised machine learning algorithm based on the Bayes Probability theorem. Naive Bayes assumes that there is no correlation between the features in a dataset used to train the model. We will get back to this later.

Despite the oversimplified assumptions, Naive Bayes works very well in many real world complex problems. They require a relatively small number of training data samples to perform classification efficiently, compared to other algorithms like Logistic Regression and Decision trees, that we studied earlier.

\
# Bayes Theorem

Bayes theorem describes the probability of a feature, based on prior knowledge of situations related to that feature.

For example, if the probability someone having diabetes is related to his or her age, then by using the Bayes Theorem, the age can be used to more accurately predict the probability of having diabetes.

\
#Naive

The word **naive** implies that every pair of features in the dataset is independent of each other. Naive Bayes works on the assumption that the value of a particular feature is independent of any other feature. 

For example, A vegetable may be classified as a tomato if it's round, about 4 cm in diameter, and red in color. With Naive Bayes, each of these three features (shape, size and color) contributes independently to the probability that the vegetable is a tomato. Also, it assumes that there is no possible correlation between the shape, size and color.



---



\
It's absolutely fine if you're not holding up with all this information. Many people believe that Logistic Regression and Naive Bayes are similar in nature that give similar outcomes, and thus, get's confused on when to use what.

\
Let's write some code to understand the differences between the two as we try to understand Naive Bayes a little more.

In [1]:
#Uploading the csv
from google.colab import files
data_to_load = files.upload()

Saving diabetes.csv to diabetes.csv


In [2]:
import pandas as pd

df = pd.read_csv('diabetes.csv')

print(df.head())

   glucose  bloodpressure  diabetes
0       40             85         0
1       40             92         0
2       45             63         1
3       45             80         0
4       40             73         1


In the data that we have, we can see that we have **glucose**, **bloodpressure** and we know if the given person has **diabetes** or not.

Here, we will use the **glucose** and the **bloodpressure** to predict if the person has diabetes or not using Naive Bayes.

In [3]:
from sklearn.model_selection import train_test_split

X = df[["glucose", "bloodpressure"]]
y = df["diabetes"]

x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.25, random_state=42)

## Training the model with naive Bayes

In [4]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression 

sc = StandardScaler()

x_train_1 = sc.fit_transform(x_train_1) 
x_test_1 = sc.fit_transform(x_test_1) 

model_1 = GaussianNB()
model_1.fit(x_train_1, y_train_1)

y_pred_1 = model_1.predict(x_test_1)

accuracy = accuracy_score(y_test_1, y_pred_1)
print(accuracy)


sc = StandardScaler()

x_train_1 = sc.fit_transform(x_train_1) 
x_test_1 = sc.fit_transform(x_test_1) 

model_2 = LogisticRegression(random_state = 0) 
model_2.fit(x_train_1, y_train_1)

y_pred_2 = model_2.predict(x_test_1)

accuracy = accuracy_score(y_test_1, y_pred_2)
print(accuracy)

0.9437751004016064
0.9156626506024096


Here, we can see that we have an accuracy of approximately an outstanding 94.4%.


While the accuracy score for both the datasets was close, with Naive Bayes giving us an accuracy of **94.4%** and logistic regression giving us an accuracy of **91.6%**, Naive Bayes still performed better.

The reason for this is that if we look at our features again, we can see that the Glucose and the Blood Pressure had no correlation with each other. They both contributed individually to whether a person would have diabetes or not. This is exactly what Naive Bayes algorithm assumes, that all the features contribute individually to the outcome.

\
This was for the case of where Naive Bayes outperforms Logistic Regression, but let's see an example of the case where Logistic Regression outperforms Naive Bayes.

In [None]:
Studnet side code 

In [5]:
#Uploading the csv
from google.colab import files
data_to_load = files.upload()

Saving income.csv to income.csv


In [6]:
import pandas as pd

df = pd.read_csv('income.csv')

print(df.head())
print(df.describe())

   age          workclass  ...  native-country  income
0   39          State-gov  ...   United-States   <=50K
1   50   Self-emp-not-inc  ...   United-States   <=50K
2   38            Private  ...   United-States   <=50K
3   53            Private  ...   United-States   <=50K
4   28            Private  ...            Cuba   <=50K

[5 rows x 14 columns]
                age  education-num  capital-gain  capital-loss  hours-per-week
count  45222.000000   45222.000000  45222.000000  45222.000000    45222.000000
mean      38.547941      10.118460   1101.430344     88.595418       40.938017
std       13.217870       2.552881   7506.430084    404.956092       12.007508
min       17.000000       1.000000      0.000000      0.000000        1.000000
25%       28.000000       9.000000      0.000000      0.000000       40.000000
50%       37.000000      10.000000      0.000000      0.000000       40.000000
75%       47.000000      13.000000      0.000000      0.000000       45.000000
max       90.00

From the given data, we will consider the following fields to determine the salary of a person -



1.   Age
2.   Hours Per Week
3.   Education Number
4.   Capital Gain
5.   Capital Loss

In [7]:
from sklearn.model_selection import train_test_split

X = df[["age", "hours-per-week", "education-num", "capital-gain", "capital-loss"]]
y = df["income"]

x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.25, random_state=42)

## Training the model with Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler 

sc = StandardScaler()

x_train_1 = sc.fit_transform(x_train_1) 
x_test_1 = sc.fit_transform(x_test_1) 

model_1 = GaussianNB()
model_1.fit(x_train_1, y_train_1)

y_pred_1 = model_1.predict(x_test_1)

accuracy = accuracy_score(y_test_1, y_pred_1)
print(accuracy)

0.7896692021935255


This time, with the new dataset, we can see that Naive Bayes gave us an accuracy of almost **79%**. Let's see how much accuracy do we get with Logistic Regression.

In [None]:
from sklearn.model_selection import train_test_split

X = df[["age", "hours-per-week", "education-num", "capital-gain", "capital-loss"]]
y = df["income"]

x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(X, y, test_size=0.25, random_state=42)

## Training model with Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler 

sc = StandardScaler()

x_train_2 = sc.fit_transform(x_train_2) 
x_test_2 = sc.fit_transform(x_test_2) 

model_2 = LogisticRegression(random_state = 0) 
model_2.fit(x_train_2, y_train_2)

y_pred_2 = model_2.predict(x_test_2)

accuracy = accuracy_score(y_test_2, y_pred_2)
print(accuracy)

0.8116929064213692


With Logistic Regression, this time, we got an accuracy of **81.1%**. Let's study this more closely.

\
# Difference b/w Naive Bayes and Logistic Regression

\
In the first dataset, as we pointed out earlier, both the *glucose* and the *bloodpressure* had little correlation, and both of them were contributing individually to whether a person has diabetes or not.

\
**Conclusion**
In these kinds of dataset, where all the features contribute individually to the outcome, Naive Bayes outperforms logistic regression and is highly efficient.

\
In the second dataset, Logistic Regression outperformed Naive Bayes. The reason is that in this dataset, not all features contribute individually to the outcome. For example, there have been people of all age groups earning both less than and more than 50K. There have also been people with all education numbers that have an income of both less and more than 50K. Here, the combination of all the features is a better predictor of whether a person is earning more than or less than 50K, instead of all features having their individual contribution.