# Naive Bayes
## This notebook outlines the usage and some examples of Naive Bayes Classification Machine learning algorithm

- Naive Bayes models are a group of extremely **fast** and **simple** classification algorithms that are often suitable for very high-dimensional datasets
- **Quick-and-dirty baseline** for a classification problem

## Bayes Theorem

An equation describing the relationship of conditional probabilities of statistical quantities.

Finding the probability of a label given some observed features  $P(L~|~{\rm features})$

Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:

$$
P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})}
$$

If we are trying to decide between two labels—let's call them $L_1$ and $L_2$—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:

$$
\frac{P(L_1~|~{\rm features})}{P(L_2~|~{\rm features})} = \frac{P({\rm features}~|~L_1)}{P({\rm features}~|~L_2)}\frac{P(L_1)}{P(L_2)}
$$

All we need now is some model by which we can compute $P({\rm features}~|~L_i)$ for each label.
Such a model is called a **generative model** because it specifies the hypothetical random process that generates the data.
Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier.
The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

### Why the name naive?
If we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.

# Example 1
# Tennis_dataset :
https://raw.githubusercontent.com/subashgandyer/datasets/main/PlayTennis.csv

In [None]:
import pandas as pd
import numpy as np

In [None]:
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/PlayTennis.csv"

In [None]:
play_tennis = pd.read_csv(url)
play_tennis

### How many features?

### How many categorical features?

### How many numerical features?

### How many samples?

# Bayes Theorem Exercise: Manual

### Question: If Temperature is Mild, Can Tennis be played?
$$
P(Play Tennis ~|~ {\rm Temperature = Mild})
$$

Hint: Use Bayes' Rule

- 
$$
P(Play Tennis ~|~ {\rm Temperature = Mild}) = \frac{P({\rm Temperature = Mild}~|~ Play Tennis)P(Play Tennis)}{P({\rm Temperature = Mild})}
$$

### Collect only Temperature and Play Tennis features

### Create a Probability Table (manual)

$$
P(Play Tennis ~|~ {\rm Temperature = Mild}) = \frac{P({\rm Temperature = Mild}~|~ Play Tennis)P(Play Tennis)}{P({\rm Temperature = Mild})}
$$

![(Play Tennis Template)](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/PlayTennis_template.png)

### Compute $P({\rm Temperature = Mild})$

### Compute $P({\rm Play Tennis})$

### Compute $P( Temperature = Mild~|~ Play Tennis)$

### Compute $P(Play Tennis ~|~ {\rm Temperature = Mild})$

### Convert the Categorical features into Numerical Features
- Use LabelEncoder( ) OR
- Use OneHotEncoder( )

### Fit transform the categorical columns

### Split the dataset into X and y

### Split into train and test

### Sanity check for split

### Import the GaussianNB Estimator

### Create the GaussianNB model

### Fit the model

### Predict the testing data

### Accuracy

### Let's apply to a real world sample

- Outlook = Rain
- Temperature = Mild
- Humidity = High
- Wind = Weak

### Build the testing sample vector

### Predict on the sample

### Try some other sample with different values

- Outlook = Rain
- Temperature = Cold
- Humidity = High
- Wind = Weak

# Example 2
# Text Classification

This task is to classify the text with respect to newsgroup classes.

Given a piece of text, find which class (topic) it belongs to.

### Newsgroup Built-in dataset

### Get the target names

### How many classes?

### Fetch the training set and testing test

### Explore the training data

### How to convert this piece of text into numerical vectors?
Solution: NLP Feature Extraction

### Let's do a simple TF-IDF Vectorizer
Do not worry too much about it as we will see it in a subsequent lecture in detail on Feature Extraction techniques.

In order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers.

### Import TfidfVectorizer

### Import Multinomial Naive Bayes

### Import pipeline

### Create a pipeline with Tfidf and MultinomialNB

### Fit the model

### Predict on testing data

### Confusion matrix between the true and predicted labels for the test data

### Insights ???

### Predict function

In [None]:
def predict_category(s, train=train, model=model):
    # Code goes here
    
    return ???

### Test some samples

In [None]:
predict_category('sending a payload to the ISS')

In [None]:
predict_category('discussing islam vs atheism')

In [None]:
predict_category('determining the screen resolution')