# Naive Bayes Algorithm

Naive Bayes is a simple and easy-to-understand algorithm used in machine learning for classification tasks, like determining whether an email is spam or not. It's called "Naive" because it assumes that all features (like words in an email) are independent of each other, which isn’t always true in real life, but it still works well in many situations.

## How It Works

### Training Phase:
1. The algorithm looks at a set of labeled data (like emails that are marked as spam or not).
2. It calculates the probability of each feature (word) appearing in each class (spam or not spam).

### Prediction Phase:
1. For a new email, the algorithm checks each word in the email and calculates the probability that it belongs to each class (spam or not).
2. It then selects the class with the highest probability.

Even though the "naive" assumption might not always be accurate (because words can sometimes depend on each other), Naive Bayes often performs really well, especially when there’s a lot of data.

## Summary:
In short, Naive Bayes is fast, easy to use, and great for tasks like spam detection, sentiment analysis, and more.


# Real-Life Example of Naive Bayes

Imagine you're trying to classify emails as **spam** or **not spam** based on the words they contain. Let's say we have a dataset with emails that have been labeled as spam or not spam, and we use this data to train our Naive Bayes classifier.

For example, in a dataset, we might find that:

- The word "**money**" appears in 80% of spam emails.
- The word "**meeting**" appears in 70% of non-spam emails.

Now, you receive a new email with the words "money" and "meeting". The Naive Bayes classifier will calculate the probabilities that this email is spam or not spam, based on how often these words appear in spam and non-spam emails. The algorithm will then classify the email based on the higher probability.

---

# Advantages of Naive Bayes

1. **Simple and Fast**: It's very easy to implement and fast to train, even on large datasets.
2. **Works Well with Large Data**: It works well with a large number of features (like words in emails) and doesn’t require a lot of data to perform well.
3. **Handles Missing Data Well**: If some features are missing in the data, Naive Bayes can still make predictions without much trouble.
4. **Good for Text Classification**: It is often used in tasks like spam detection, sentiment analysis, and other text-based classification problems.

---

# Disadvantages of Naive Bayes

1. **Assumption of Independence**: The algorithm assumes that features are independent, which is often not true in real life. For example, in a spam email, the presence of both "money" and "offer" together could be a stronger signal of spam, but Naive Bayes treats them as if they're independent.
2. **Poor Performance with Highly Correlated Features**: If the features are highly correlated (e.g., two words that always appear together), Naive Bayes can struggle to make accurate predictions.
3. **Limited Flexibility**: It's not the best choice for tasks where features interact in complex ways. More sophisticated models like decision trees or neural networks may be better suited for those cases.

---

# Summary

In summary, Naive Bayes is great for quick and effective classification tasks, especially in text classification, but it can have limitations when dealing with complex relationships between features.


In [1]:
import pandas as pd
import numpy as np
import sklearn

In [2]:
df = pd.read_csv(r"Social_Network_Ads.csv")
df

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0
...,...,...,...
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0


In [3]:
df.isnull().sum()

Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

In [4]:
X = df.drop(columns='Purchased')
y = df['Purchased']

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [6]:
X_train

Unnamed: 0,Age,EstimatedSalary
336,58,144000
64,59,83000
55,24,55000
106,26,35000
300,58,38000
...,...,...
323,48,30000
192,29,43000
117,36,52000
47,27,54000


In [7]:
X_test

Unnamed: 0,Age,EstimatedSalary
132,30,87000
309,38,50000
341,35,75000
196,30,79000
246,35,50000
...,...,...
14,18,82000
363,42,79000
304,40,60000
361,53,34000


In [8]:
y_test

132    0
309    0
341    0
196    0
246    0
      ..
14     0
363    0
304    0
361    1
329    1
Name: Purchased, Length: 80, dtype: int64

In [9]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)

# prediction

In [10]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int64)

# evalution

In [11]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
confusion_matrix(y_pred,y_test)

array([[56,  4],
       [ 2, 18]], dtype=int64)

In [12]:
accuracy_score(y_pred,y_test)

0.925

In [13]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.97      0.93      0.95        60
           1       0.82      0.90      0.86        20

    accuracy                           0.93        80
   macro avg       0.89      0.92      0.90        80
weighted avg       0.93      0.93      0.93        80

