# 1. Multinomial Naive Bayes

building a SPAM classifier using multinomial Naive Bayes. Exactly how Gmail or Whatsapp spam filters start out.

### 🧠 Step 1: Real-world idea

We want to predict whether a message is spam or not spam.

Example:

| Message                        | Label          |
| ------------------------------ | -------------- |
| "Win money now!!!"             | spam           |
| "Your project meeting at 10am" | ham (not spam) |
| "Free laptop just for you"     | spam           |
| "Can we talk later?"           | ham            |


### 🧩 Step 2: Concept

The algorithm looks for which words are common in spam vs ham messages.

It calculates probabilities like:
- P(word = “win” | spam)
- P(word = “win” | ham)

Then, when a new message comes in — say "Win a free phone now" —
it multiplies those probabilities for all words and picks whichever class (spam/ham) gives a higher probability.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
messeges = [
    "Win money now",
    "Free cash prize",
    "Click to claim your reward",
    "Earn extra income today",
    "Meeting at 10am",
    "Let's have lunch tomorrow",
    "Project submission due",
    "Are you free tonight?",
]

labels = ["spam", "spam", "spam", "spam", "ham", "ham", "ham", "ham"]

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(messeges)

""" 
WHAT HAPPENED HERE ?

#############################################
PHASE-1 : fit() phase → VOCABULARY GENERATION
#############################################

1. CountVectorizer.fit() scans all messages 
2. Splits each message into individual words (tokenization)
3. Collects all unique words across all messages
4. sorts them alphabetically
5. stores thi svocabulary internally in the vectozier object

- Now vectorizer.vocabulary_ contains all the word-to-index mapping :

{'win': 25, 'money': 15, 'now': 16, 'free': 9, 'cash': 3, 'prize': 17, 'click': 5, 'to': 21, 'claim': 4, 'your': 27, 'reward': 19, 'earn': 7, 'extra': 8, 'income': 11, 'today': 22, 'meeting': 14, 'at': 2, '10am': 0, 'let': 12, 'have': 10, 'lunch': 13, 'tomorrow': 23, 'project': 18, 'submission': 20, 'due': 6, 'are': 1, 'you': 26, 'tonight': 24}

#############################################################
PHASE-2 : transform() phase → DOCUMENT-TERM MATRIX GENERATION
#############################################################

1. CountVectorizer.transform() uses the vocabulary created in phase-1
2. for each message, counts how many times each word appears
3. creates the numerical matrix (sparse matrix format)

- Now X conatains our document-term matrix

👇 click the below link to see How document-term matrix Looks like 👇

"""

click here 👉 [Document-term Matrix](./assets/document_Matrix.xlsx)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.25, random_state=42
)

In [None]:
model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
cr = classification_report(y_test, y_pred)

print(f"Cnfusion Matrix:\n {cm}")
print(f"\nAccuracy Score: {acc:.3f}")
print(f"\nClassification Report: {cr}")

#### Testing our model :--

In [None]:
sample = [
    "Win a free laptop now",
    "Let's schedule a meeting",
    "Click here to grab your free cash prize",
]

sample_vector = vectorizer.transform(sample)
print(f"\nPredictions: {model.predict(sample_vector)}")

### Summary :

 **how multinomial works**
 
we'll have to use countvectorizer of slearn.feature_extraction.text to fit messeges , which will extract all words from phrases or sentences , 
then it will use transform() to make a document-tree matrix where it will find each extracted words are present how many times in each sentence and will note on that matrix!
then real calculation is  like , it will calculate probability of each word to be SPAM or HAM 
then when a new sentence will come , after breakin the new sentence into words , it will find probability of each word wheather it is spam or ham
then will find overall probability or (spam) and (ham) , which probability would be large will be the final answer 