<a href="https://colab.research.google.com/github/kareemullah123456789/MLE/blob/main/Naive_Bayes_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes — Colab Markdown Notes

## Bayes’ Theorem (classification view)

$$
P(\text{Class} \mid \text{Evidence}) \;=\; \frac{P(\text{Evidence} \mid \text{Class}) \, P(\text{Class})}{P(\text{Evidence})}
$$

For comparing classes, the denominator is constant, so:

$$
P(\text{Class} \mid \text{Evidence}) \;\propto\; P(\text{Evidence} \mid \text{Class}) \, P(\text{Class})
$$

---

## Naive (conditional independence) assumption

For text with tokens $w_1,\dots,w_n$:

$$
P(w_1,\dots,w_n \mid \text{Class}) \;=\; \prod_{i=1}^{n} P(w_i \mid \text{Class})
$$

This lets us compute a simple product of per-word likelihoods.

---

## Worked example (toy spam filter)

Email to classify: **“win free money”**

**Priors**

$$
P(\text{Spam}) = 0.4, \quad P(\text{Ham}) = 0.6
$$

**Likelihoods**

Spam:

$$
P(\text{"win"}\mid\text{Spam})=0.2,\quad
P(\text{"free"}\mid\text{Spam})=0.3,\quad
P(\text{"money"}\mid\text{Spam})=0.4
$$

Ham:

$$
P(\text{"win"}\mid\text{Ham})=0.01,\quad
P(\text{"free"}\mid\text{Ham})=0.02,\quad
P(\text{"money"}\mid\text{Ham})=0.01
$$

**Scores (unnormalized posteriors)**

Spam:

$$
0.2 \times 0.3 \times 0.4 \times 0.4 \;=\; \mathbf{0.0096}
$$

Ham:

$$
0.01 \times 0.02 \times 0.01 \times 0.6 \;=\; \mathbf{0.0000012}
$$

Decision: **Spam** (higher score).
To get exact posteriors, divide each score by the sum of both scores.

---

## Why it works well in practice

* **Fast:** simple products (or sums in log space).
* **Effective baseline:** strong on text tasks like spam filtering and sentiment.
* **Data-efficient:** reasonable results with modest training data.

---

## Limitations

* **Independence assumption:** tokens are not truly independent (e.g., “New” and “York”).
* **Context/negation handling:** “not good” vs “good” needs care or additional features.

---

## Practical tips

**1) Smoothing (Laplace/Add-α)**
Avoids zero probabilities for unseen words:

$$
P(w \mid \text{Class}) \;=\; \frac{\text{count}(w,\text{Class}) + \alpha}
{\sum_{v \in V}\text{count}(v,\text{Class}) + \alpha \, |V|}
$$

Common default: $\alpha = 1$.

**2) Use log probabilities**
Prevent underflow and turn products into sums:

$$
\log P(\text{Class} \mid \text{Evidence})
\;\propto\;
\log P(\text{Class}) + \sum_i \log P(w_i \mid \text{Class})
$$

**3) Variants**

* **Multinomial NB:** counts of words (standard for text).
* **Bernoulli NB:** binary presence/absence of words.
* **Gaussian NB:** continuous features (not typical for raw text).

---

## Typical workflow (text)

1. Clean text (lowercase, punctuation handling, optional stopword removal).
2. Vectorize (Bag-of-Words or TF-IDF).
3. Train **MultinomialNB** (with smoothing).
4. Evaluate with accuracy, precision/recall, F1; inspect confusion matrix.
5. Tune preprocessing (n-grams, min\_df/max\_df, stopwords) and smoothing α.


## Naive Bayes Algorithm - Classification of Algorithm
- Based on **Bayes's theorem** of conditional probability
- Calculate the probability of each test point to be in either of classes, given that the features that determine the class are given.
- P(0/x1,x2,x3) and P(1/x1,x2,x3) is calculated, where 0 and 1 are classificcation. x1,x2,x3 are features
- This algorithm assumes
    - all classes are independent, while its not that true always. However, despite going wrong on basic logic, its still a very strong algorithm.
    - Sequence doesn't matter, while its not true. Like the sequence of words in a sentence do matter for meaning of sentence.
- Eg: In NLP, this is used to understand the tone of text like, 'If this word is present in the sentence, what class it belongs to?'
- Widely used in HAM-SPAM classification
- Very fast algorithm as it simply calculates probability based on features. Very strong results despite assumptions that serve as basis of algorithm are wrong.
- Technically it processes words and language, so it may be thought as a part of NLP. But since it doens't understand semantics and sentiments of language, it can't handle language processing on its own.
### 3 Types of Naive Bayes in Scikit Learn
__Gaussian__
- It is used in classification and it assumes that features follow a normal distribution.

__Multinominal__
- It is used for discrete counts. For eg., let's say we have a text cLassification problem. Here we consider Bernoulli trails which is one step further and instead of "word occuring in the document", we have "count how often word occurs in the document" you can think of it as "number of times outcome number_x is observed over n trails".

__Bernoulli__
- The binomial model is useful if your feature vectors are binary (ie., Zeroes and One). One application would be text classification with 'bag of words' model where the 1s and 0s are "words occur in the document" and "word does not occur in the document" respectively.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

base_path = "/content/drive/MyDrive/HV"

In [None]:
ls /content/drive/MyDrive/HV

custom_functions.py                      processing_data.py
Feature_engineering_and_PCA_Class.ipynb  [0m[01;34m__pycache__[0m/
HR.csv                                   spam.csv
Loading_from_pickle_titanic.ipynb        spam.tsv
loan_approved.csv                        SVM_Implementation.ipynb
Naive_Bayes.ipynb                        SVM_Preprocessing.ipynb
output.csv                               train.csv
preprocessing.py                         Transformer_pickle_demo.ipynb
processed_data.pkl                       TreeAlgos_Data_Preprocessing.ipynb
Processed_data.pkl                       Tree_CT.pkl
Processed_data.pkl.pkl                   Trees_Implementation.ipynb


In [None]:
import os
import pandas as pd

base_path = "/content/drive/MyDrive/HV"   # or wherever your file actually is

df = pd.read_csv(
    os.path.join(base_path, "spam.tsv"),
    sep="\t",
    names=["Class", "text"]
)

print(df.head())


  Class                                               text
0   ham  I've been searching for the right words to tha...
1  spam  Free entry in 2 a wkly comp to win FA Cup fina...
2   ham  Nah I don't think he goes to usf, he lives aro...
3   ham  Even my brother is not like to speak with me. ...
4   ham               I HAVE A DATE ON SUNDAY WITH WILL!!!


In [None]:
#The csv file being read
import pandas as pd
import numpy as np
import string
df1 = pd.read_csv(os.path.join(base_path,"spam.csv"),encoding='latin1')    #encoding is specified as latin1 because text contains non-unicode characters
df1.head()

Unnamed: 0,type,email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape

(5567, 2)

## Text preprocessing

In [None]:
df.loc[df['Class']=='spam','Class'] = 1
df.loc[df['Class']=='ham','Class'] = 0
y = df['Class'].values #defining target
y

array([0, 1, 0, ..., 0, 0, 0], dtype=object)

In [None]:
#Strings are sensitive to case and punctuation. So when dealing with strings, we remove punctuations and convert case
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
# Why is it important to remove punctuation?

"This message is spam" == "This message is spam."

False

In [None]:
def remove_punc(text):
    return "".join(char for char in text if char not in string.punctuation)
remove_punc('Animals are ,dfd/.,/')

'Animals are dfd'

In [None]:
#Apply this function using 'apply' command or use a for loop
df['cln_txt'] = df['text'].apply(lambda x: remove_punc(x).lower())
df.head()

Unnamed: 0,Class,text,cln_txt
0,0,I've been searching for the right words to tha...,ive been searching for the right words to than...
1,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
2,0,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...
3,0,Even my brother is not like to speak with me. ...,even my brother is not like to speak with me t...
4,0,I HAVE A DATE ON SUNDAY WITH WILL!!!,i have a date on sunday with will


- Tokenization (process of converting the normal text strings in to a list of tokens(also known as lemmas)).
- Now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with and machine learning model which we will gonig to use can understand.

In [None]:
len(df)

5567

In [None]:
type(y)

numpy.ndarray

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv=CountVectorizer(stop_words='english')
# Countvectorizer is a method to convert text to numerical data.
#specifying stop_words='english': Stops adding 'the','a' kind of words in the bag of words,
#which might end up adding weightages to those words and their probabilities
x = df['cln_txt']
x_vec1 = cv.fit_transform(x)
x_vec = x_vec1.toarray()    #vectorising the words and making it an array.
y = y.astype('int8')  #Type casting y to integer

In [None]:
print(type(x_vec1), type(x_vec))

<class 'scipy.sparse._csr.csr_matrix'> <class 'numpy.ndarray'>


In [None]:
x_vec.shape

(5567, 9270)

In [None]:
x_vec

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_vec,y,test_size=0.2,random_state=42)

In [None]:
print(len(x_train),len(x_test))

4453 1114


In [None]:
from sklearn.naive_bayes import MultinomialNB   #Bernouli's NB is ideal
nb = MultinomialNB()  #Creating instance
nb.fit(x_train,y_train)
y_pred = nb.predict(x_test)
y_pred

array([0, 0, 0, ..., 0, 0, 1], dtype=int8)

In [None]:
df.Class.value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,4821
1,746


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99       969
           1       0.88      0.96      0.92       145

    accuracy                           0.98      1114
   macro avg       0.94      0.97      0.95      1114
weighted avg       0.98      0.98      0.98      1114



In [None]:
text= input('Enter a message: ')
#txt = np.array(text)
txt_vec = cv.transform([text])
_pred = nb.predict(txt_vec)
if _pred == 0:
    print('ham')
else:
    print('spam')

Enter a message: you have won a lottery
spam


In [None]:
text= input('Enter a message: ')
#txt = np.array(text)
txt_vec = cv.transform([text])
_pred = nb.predict(txt_vec)
if _pred == 0:
    print('ham')
else:
    print('spam')

Enter a message: come for lunch
ham


## BAG OF WORDS
We cannot pass text directly to train our models in Natural Language Processing, thus we need to convert it into numbers, which machine can understand and can perform the required modelling on it
### TF-IDF approach
In **BOW approach** we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the


<font color=darkviolet>  **Term Frequency (tf)** </font>
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term 't' appears in a document) / (Total number of terms in the document).



<font color=darkviolet>  **Inverse Document Frequency (idf)** </font>
              It measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.



__Let's see an example:__

Consider a document containing 100 words wherein the word cat appears 3 times.

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03.

Now, assume we have 10 million documents and the word cat appears in one thousand of these.

Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4.

Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12

In [None]:
## text preprocessing and feature vectorizer
# To extract features from a document of words, we import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

x = df['cln_txt']
tf=TfidfVectorizer() ## object creation
X=tf.fit_transform(x) ## fitting and transforming the data into vector
y = y.astype('int')

In [None]:
## print feature names selected from the raw documents
#tf.get_feature_names()
tf.get_feature_names_out()

array(['008704050406', '0089my', '0121', ..., 'zyada', 'üll', '〨ud'],
      dtype=object)

In [None]:
## number of features created
len(tf.get_feature_names_out())

9537

In [None]:
## getting the feature vectors (TFIDF arrays are not just as sparse as CV method, so they will be stored as native np arrays. toarray() commmand wont be needed)
X=X.toarray()

In [None]:
## Creating training and testing
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=6)

In [None]:
## Model creation
from sklearn.naive_bayes import BernoulliNB

## model object creation
nb=BernoulliNB(alpha=0.01)

## fitting the model
nb.fit(X_train,y_train)

## getting the prediction
y_hat=nb.predict(X_test)

In [None]:
# Evaluating the model
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_hat))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1206
           1       0.96      0.96      0.96       186

    accuracy                           0.99      1392
   macro avg       0.98      0.98      0.98      1392
weighted avg       0.99      0.99      0.99      1392



TF-IDF to the rescue

This is where TF-IDF comes in. It says:

Words that occur often in one document are important (TF part).

But words that occur everywhere across all documents are boring and should be down-weighted (IDF part).

Fun Example (Spam emails edition)

Suppose we have two “documents”:

Doc1 (Spam): "Win money now money free free cash."

Doc2 (Ham): "Let’s have lunch tomorrow at noon."

Step 1: TF (term frequency)

In Doc1, “money” appears 2 times out of 7 words → TF = 2/7.

In Doc2, “money” = 0/6 = 0.

Step 2: IDF (inverse document frequency)

“money” appears in 1 out of 2 documents.

IDF = log(2 / 1) = log(2) ≈ 0.693.

Step 3: TF-IDF weight

For Doc1: TF-IDF(“money”) = (2/7) * 0.693 ≈ 0.198.

For Doc2: TF-IDF(“money”) = 0.

Now compare that with the word “the”: if it existed in both documents, IDF would be log(2/2) = 0 → completely useless.

So TF-IDF basically says:

“money” → kinda important.

“the” → shut up.

Why this matters

If you’re building a spam filter, TF-IDF ensures that words like “free,” “win,” and “cash” scream louder in the math than boring filler words.

In search engines, TF-IDF helps rank pages. If you Google “cat memes,” the system boosts documents with rare-ish but relevant words like “cat” and “meme,” not ones filled with “the” and “is.”

So in short:

Document = one piece of text (email, tweet, article, whatever).

TF = how often a word shows up in that document.

IDF = how rare the word is across all documents.

TF-IDF = TF × IDF = the word’s importance.