### Colab Activity 18.3: Naive Bayes Algorithm

**Expected Time = 60 minutes** 


This activity focuses on the implementation of the Naive Bayes algorithm.  You will use the Scikit-Learn estimator together with your earlier vectorization strategies to model the WhatsApp text and compare to your earlier work with Logistic Regression.   

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV

[Back to top](#-Index)

### Problem 1

#### Small Example


The example below is adapted from Marsland's *Machine Learning an Algorithmic Perspective*.  The dataset describes  whether or not a student has a looming deadline, if there is a party going on, and whether or not the student feels lazy.  The activity column is the target, and your aim is to use the Naïve Bayes formula below:

$$P(C_i) \prod_{k} P(X_j^k = a_k | C_i)$$

In [2]:
deadline = ['urgent','urgent','near', 'none', 'none', 'none', 'near', 'near', 'near','urgent']
party = ['yes', 'no', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'no']
lazy = ['yes', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no']
activity = ['party', 'study', 'party', 'party', 'pub', 'party', 'study', 'tv', 'party', 'study']

In [3]:
df = pd.DataFrame({'deadline': deadline,
                  'party': party,
                  'lazy': lazy,
                  'activity': activity})
df

Unnamed: 0,deadline,party,lazy,activity
0,urgent,yes,yes,party
1,urgent,no,yes,study
2,near,yes,yes,party
3,none,yes,no,party
4,none,no,yes,pub
5,none,yes,no,party
6,near,no,no,study
7,near,no,yes,tv
8,near,yes,yes,party
9,urgent,no,no,study


Here, $C_i$ represents the class in the `activity` columm.  Accordingly, if we want to predict a category of activity given the input: 

```
deadline = near
party = no
lazy = yes
```

This means we need four probabilities:

- $P(party) \times P(near | party) \times P(no party | party) \times P(lazy | party)$
- $P(study) \times P(near | study) \times p(noparty | study) \times P(lazy | study)$
- $P(pub) \times P(near | pub) \times P(noparty | pub) \times P(lazy | pub)$
- $P(tv) \times P(near | tv) \times P(noparty | tv) \times P(lazy |tv)$

Compute these four probabilities and assign them to the list `probs` in the order above (party, study, pub, tv). 


In [4]:

probs = []
# Calculate probabilities for each class
probs = [
    # P(party) × P(near|party) × P(no party|party) × P(lazy|party)
    (5/10) * (1/5) * (0/5) * (4/5),

    # P(study) × P(near|study) × P(no party|study) × P(lazy|study)
    (3/10) * (1/3) * (3/3) * (1/3),

    # P(pub) × P(near|pub) × P(no party|pub) × P(lazy|pub)
    (1/10) * (0/1) * (1/1) * (1/1),

    # P(tv) × P(near|tv) × P(no party|tv) × P(lazy|tv)
    (1/10) * (1/1) * (1/1) * (1/1)
]


### ANSWER CHECK
print(probs)

[0.0, 0.033333333333333326, 0.0, 0.1]


[Back to top](#-Index)

### Problem 2

#### MAP solution


Using these probabilities, the maximum aposteriori solution involves selecting the outcome that is associated with the highest probability.  Use your list of probabilities to identify the `argmax`.  Note you can use `np.argmax` for this or just inspect the values. 

Assign your answer as a string -- `party`, `study`, `pub`, or `tv` -- to `ans2` below.

In [5]:

ans2 = ''
ans2 = 'tv'  # The highest probability from probs

### ANSWER CHECK
print(ans2)

tv


### Larger Example

Now, you are to use the Scikit-Learn vectorizers together with the `MultinomialNB` estimator to implement the Naïve Bayes algorithm for classifying the WhatsApp data.  The data is loaded and split for you below.

In [6]:
happy_df = pd.read_csv('data/Emotion(happy).csv')
sad_df = pd.read_csv('data/Emotion(sad).csv.zip', compression = 'zip')
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

[Back to top](#-Index)

### Problem 3

#### Pipeline with `CountVectorizer`


Below, create a pipeline called `cvect_pipe` with named steps `cvect` and `bayes` given by `CountVectorizer()` and `MultinomialNB()`, respectively.

Fit this pipeline to the training data `X_train` and `y_train`.

Finaly, use the function `score` to evaluate it on the test set `X_test` and `y_test`. Assign the result to `cvect_acc`. 


In [7]:
cvect_pipe = Pipeline([
    ('cvect', CountVectorizer()),
    ('bayes', MultinomialNB())
])

# Fit the pipeline
cvect_pipe.fit(X_train, y_train)

# Get accuracy score
cvect_acc = cvect_pipe.score(X_test, y_test)



### ANSWER CHECK
cvect_pipe.named_steps

{'cvect': CountVectorizer(), 'bayes': MultinomialNB()}

[Back to top](#-Index)

### Problem 4

#### Pipeline with `TfidfVectorizer`


Below, create a pipeline called `cvect_pipe` with named steps `tfidf` and `bayes` given by `TfidfVectorizer()` and `MultinomialNB()`, respectively.

Fit this pipeline to the training data `X_train` and `y_train`.

Finaly, use the function `score` to evaluate it on the test set `X_test` and `y_test`. Assign the result to `tfidf_acc`. 


In [8]:
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('bayes', MultinomialNB())
])

# Fit the pipeline
tfidf_pipe.fit(X_train, y_train)

# Get accuracy score
tfidf_acc = tfidf_pipe.score(X_test, y_test)





### ANSWER CHECK
tfidf_pipe.named_steps

{'tfidf': TfidfVectorizer(), 'bayes': MultinomialNB()}

[Back to top](#-Index)

### Problem 5

#### Assessing performance


Now, consider searching the hyperparameters of the model.  Specifically, what is the parameter that controls Laplacian smoothing?  Assign your answer as a string to `ans5` below.  


In [9]:

ans5 = 'alpha'  # This is the smoothing parameter in MultinomialNB



### ANSWER CHECK
print(ans5)

alpha


## Summary of Exercises

This notebook explored the implementation and application of the Naive Bayes algorithm through several key exercises:

1. **Basic Naive Bayes Implementation**
   - Worked with a small dataset about student activities
   - Calculated conditional probabilities manually using the Naive Bayes formula
   - Demonstrated Maximum A Posteriori (MAP) decision making

2. **Text Classification with Scikit-learn**
   - Applied Naive Bayes to WhatsApp messages for sentiment analysis
   - Implemented two different text vectorization approaches:
     - CountVectorizer pipeline
     - TfidfVectorizer pipeline
   - Explored Laplacian smoothing via the `alpha` parameter

## Key Takeaways

1. Naive Bayes is effective for text classification due to its:
   - Simplicity in probability calculations
   - Ability to handle high-dimensional data
   - Fast training and prediction times

2. Text preprocessing choices matter:
   - CountVectorizer provides basic frequency-based features
   - TfidfVectorizer adds term importance weighting
   
3. Laplacian smoothing (controlled by `alpha`) is crucial for:
   - Handling unseen features
   - Preventing zero probability issues
   - Improving model generalization

4. Pipeline implementation in scikit-learn:
   - Streamlines the text processing workflow
   - Combines preprocessing and modeling steps
   - Enables easy model evaluation and tuning