<a href="https://colab.research.google.com/github/parduet/CDA-340-notes/blob/main/Multinomial_NB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multinomial Naive Bayes

Similar to Bernoulli NB, but examines frequency of words rather than presence/absence.

All Multinomial NB applications are text-based: spam filters, topic classification, sentiment analysis.

Which method to choose? \
* Use Bernoulli NB when the data is natually binary, or when the text is short. \
* Use Multinomial NB when the text is longer, or for topic or sentiment analysis. \


## Multinomial NB in Python

Exactly the same as Bernoulli NB, but instead of \
&nbsp;&nbsp;&nbsp;&nbsp; `vectorizer = CountVectorizer(binary=True)` \
Use \
&nbsp;&nbsp;&nbsp;&nbsp; `vectorizer = CountVectorizer()`

And instead of \
&nbsp;&nbsp;&nbsp;&nbsp;`model = BernoulliNB()` \
Use \
&nbsp;&nbsp;&nbsp;&nbsp;`model = MultinomialNB()`



In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB #added Multinomial NB
from sklearn.metrics import accuracy_score, classification_report

from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

# Download stopwords (only required once)
nltk.download('stopwords')
stop_words = stopwords.words('english')

#import dataset
from google.colab import drive
drive.mount('/content/drive')

file_path = "/content/drive/My Drive/Python course files/SMSSpamCollection.txt"
#file_path = "/content/drive/My Drive/Python course files/twitter_sentiment_analysis.csv"

df=pd.read_csv(file_path, sep='\t')
#df=pd.read_csv(file_path)

df.columns=["label", "message"]
#df.columns = ["twitterID", "entity", "label", "message"]

print(df.head())
print(df.info())

In [None]:
stemmer = PorterStemmer()
#lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if text is None:  # Handle missing values
        return ""
    text = str(text).lower()  # Ensure text is a string before applying transformations
    text = re.sub(r'\W', ' ', text)
    words = text.split()
    stop_words = set(stopwords.words('english'))
    #words = [word for word in words if word not in stop_words]
    words = [stemmer.stem(word) for word in words if word not in stop_words]
    #words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

df["message"] = df["message"].apply(preprocess_text)


In [None]:
#convert text to binary for Bernoulli NB, counts for Multinomial NB

###CountVectorizer(binary=True) for Bernoulli, CountVectorizer() for Multinomial
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df["message"])
y = df["label"]

In [None]:
#split into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#run the model
#model = BernoulliNB()
model=MultinomialNB()
model.fit(X_train, y_train)

In [None]:
#make predictions
y_pred = model.predict(X_test)

In [None]:
#examine the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       955
        spam       0.92      0.94      0.93       160

    accuracy                           0.98      1115
   macro avg       0.95      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



# Methods of processing text

By default, every form of a word is treated as separate.  For example, adore, adores, adorable, adoring, adoringly are separate. Likewise, misspell, misspells, misspelled, misspelling are separate.

Alternatives are stemming and lemmatization.

Stemming strips off suffixes, even if what's left isn't a word. Lemmatization reduces words to their base form.

Generally, stemming runs faster but is less accurate than lemmatization. Stemming is a better option for large datasets with informal text, while lemmatization works better for small datasets with formal text.

In [None]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

#words = ["running", "flies", "easily", "studies", "arguing", "adorable", "misspelling"]
words=["adore", "adores", "adorable", "adoring", "adoringly"]

for word in words:
    #print(f"Original: {word} → Stemmed: {stemmer.stem(word)}")
    print(f"Original: {word} → Lemmatized: {lemmatizer.lemmatize(word, pos='v')}")  # 'v' for verb form

Original: adore → Lemmatized: adore
Original: adores → Lemmatized: adore
Original: adorable → Lemmatized: adorable
Original: adoring → Lemmatized: adore
Original: adoringly → Lemmatized: adoringly


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Stemming and Lemmatization in NB

To change to stemming (or lemmatization): \
1. In the import statements, add the statement: \
&nbsp; &nbsp; &nbsp; `from nltk.stem import PorterStemmer` or \
&nbsp; &nbsp; &nbsp; `from nltk.stem import WordNetLemmatizer` \
2. Just above the preprocess_text function definition, add the statement: \
&nbsp; &nbsp; &nbsp;`stemmer = PorterStemmer()` or \
&nbsp; &nbsp; &nbsp;`lemmatizer = WordNetLemmatizer()`
3. Inside the preprocess_text function definition, replace `words = [word for word in words if word not in stop_words]` with: \
&nbsp; &nbsp; &nbsp; `words = [stemmer.stem(word) for word in words if word not in stop_words]` or \
&nbsp; &nbsp; &nbsp; `words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]`




