# Multinomial and Bernoulli Naive Bayes

ML Process in the notebook
1. Import and prepare data
2. Build ML model: Multinomial Naive Bayes
3. Build ML model: Bernoulli Naive Bayes


In [1]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import sklearn

## Import and Prepare Data
1. Load train dataset
2. Identify Class and convert categorical label to numerical
3. Split train dataset to X and y labels
4. Create Bag of words by vectorizing and remove Top Words from Bag of Words in train dataset
5.  Create Bag of words by vectorizing and remove Top Words from Bag of Words in test dataset
6.  

In [23]:
# Read train dataset
train_docs = pd.read_csv("train.csv")
train_docs['Class'].value_counts()

Class
education    3
cinema       2
Name: count, dtype: int64

In [24]:
# Convert Class to numerical
train_docs["Class"] = train_docs["Class"].map({'cinema': 0,'education': 1})
train_docs["Class"] = train_docs["Class"].astype('int')
train_docs

Unnamed: 0,Document,Class
0,Upgrad is a great educational institution.,1
1,Educational greatness depends on ethics,1
2,A story of great ethics and educational greatness,1
3,Sholey is a great cinema,0
4,good movie depends on good story,0


In [32]:
# Split train data to X and y and store in numpy array format
train_arr = train_docs.values

# First element will be Tokens
X_train = train_arr[:, 0]
print(f"X_train: \n {X_train}")


# Second element will be Class
y_train = train_arr[:, 1]
print(f"\ny_train: \n {y_train}")


X_train: 
 ['Upgrad is a great educational institution.'
 'Educational greatness depends on ethics'
 'A story of great ethics and educational greatness'
 'Sholey is a great cinema' 'good movie depends on good story']

y_train: 
 [1 1 1 0 0]


## Vectorizing methods and meaning

- ```vect.fit(train)``` learns the vocabulary of the training data
- ```vect.transform(train)``` uses the fitted vocabulary to build a document-term matrix from the training data
- ```vect.transform(test)``` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

In [38]:
# Create bag of words and remove stop words
from sklearn.feature_extraction.text import CountVectorizer

vector = CountVectorizer(stop_words='english')
vector.fit(X_train)

print(f"\nBag of Words:\n {vector.vocabulary_}")

print(f"\nFeature names:\n {vector.get_feature_names_out()}")

print(f"\n Number of features:\n {len(vector.get_feature_names_out())}")



Bag of Words:
 {'upgrad': 11, 'great': 5, 'educational': 2, 'institution': 7, 'greatness': 6, 'depends': 1, 'ethics': 3, 'story': 10, 'sholey': 9, 'cinema': 0, 'good': 4, 'movie': 8}

Feature names:
 ['cinema' 'depends' 'educational' 'ethics' 'good' 'great' 'greatness'
 'institution' 'movie' 'sholey' 'story' 'upgrad']

 Number of features:
 12


### Build Vocabulary metrix for train dataset

In [46]:
# Use fit transformed document 
X_train_transformed = vector.transform(X_train)


print(f"X_train_transformed: \n{X_train_transformed.toarray()}")

pd.DataFrame(X_train_transformed.toarray(), 
             columns=vector.get_feature_names_out())


X_train_transformed: 
[[0 0 1 0 0 1 0 1 0 0 0 1]
 [0 1 1 1 0 0 1 0 0 0 0 0]
 [0 0 1 1 0 1 1 0 0 0 1 0]
 [1 0 0 0 0 1 0 0 0 1 0 0]
 [0 1 0 0 2 0 0 0 1 0 1 0]]


Unnamed: 0,cinema,depends,educational,ethics,good,great,greatness,institution,movie,sholey,story,upgrad
0,0,0,1,0,0,1,0,1,0,0,0,1
1,0,1,1,1,0,0,1,0,0,0,0,0
2,0,0,1,1,0,1,1,0,0,0,1,0
3,1,0,0,0,0,1,0,0,0,1,0,0
4,0,1,0,0,2,0,0,0,1,0,1,0


### We do the same with test dataset


In [56]:
# Read test dataset
test_docs = pd.read_csv("test.csv")
test_docs

Unnamed: 0,Document,Class
0,very good educational institution,education


In [57]:
# Convert categorical Class to numerical
test_docs["Class"] = test_docs["Class"].map({
    'education': 1,
    'cinema': 0
})
test_docs

Unnamed: 0,Document,Class
0,very good educational institution,1


In [58]:
test_arr = test_docs.values
# Split X and y of test dataset
X_test = test_arr[:, 0]

y_test = test_arr[:, 1]


array([1], dtype=object)

In [62]:
# Transformed X_test
X_test_transformed = vector.transform(X_test)

print(f"X_test_transformed: \n: {X_test_transformed.toarray()}")

X_test_transformed: 
: [[0 0 1 0 1 0 0 1 0 0 0 0]]


## Build model: Multinomial Naive Bayes

In [81]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

# Fit model with train dataset
y_train = y_train.astype('int')
mnb.fit(X_train_transformed, y_train)

# # Do predict on train dataset
mnb_prob = mnb.predict_proba(X_test_transformed)
mnb_prob

array([[0.32808399, 0.67191601]])

In [82]:
print(f"Probability of test document belonging to CINEMA is: {mnb_prob[0][0]*100}")
print(f"Probability of test document belonging to EDUCATION is: {mnb_prob[0][1]*100}")

Probability of test document belonging to CINEMA is: 32.80839895013124
Probability of test document belonging to EDUCATION is: 67.19160104986874


## Build Model: Bernoulli Naive Bayes

In [80]:
from sklearn.naive_bayes import BernoulliNB

bnb = BernoulliNB()

# Fit model with Train dataset
bnb = bnb.fit(X_train_transformed, y_train)

# Predict on the test class
bnb_prob = bnb.predict_proba(X_test_transformed)

bnb_prob

array([[0.2326374, 0.7673626]])

In [83]:
print(f"Probability of test document belonging to CINEMA is: {bnb_prob[0][0]*100}")
print(f"Probability of test document belonging to EDUCATION is: {bnb_prob[0][1]*100}")

Probability of test document belonging to CINEMA is: 23.263740486986176
Probability of test document belonging to EDUCATION is: 76.73625951301383
