<a href="https://colab.research.google.com/github/pascal-maker/machinelearning/blob/main/Naive_Bayes_Spam_Detection_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

import nltk
from nltk.corpus import stopwords
import re #regular expressions
from bs4 import BeautifulSoup
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

pd.set_option('display.max_rows',1000)
pd.set_option('display.max_columns',1000)
pd.set_option('display.max_colwidth',150)

In [2]:
# Import dataset and split in features and targets

# read dataset
dataset = pd.read_csv('spam_train.csv')

testset = pd.read_csv('spam_test.csv')

# Split features from targets
y_train = dataset.type.values
X_train = dataset.text.values

y_test = testset.type.values
X_test = testset.text.values

testset.tail(20)



Unnamed: 0,type,text
830,ham,Wif my family booking tour package.
831,ham,GRAN ONLYFOUND OUT AFEW DAYS AGO.CUSOON HONI
832,ham,7 wonders in My WORLD 7th You 6th Ur style 5th Ur smile 4th Ur Personality 3rd Ur Nature 2nd Ur SMS and 1st Ur Lovely Friendship... good morning dear
833,spam,FREE for 1st week! No1 Nokia tone 4 ur mobile every week just txt NOKIA to 8077 Get txting and tell ur mates. www.getzed.co.uk POBox 36504 W45WQ 1...
834,ham,Let me know how it changes in the next 6hrs. It can even be appendix but you are out of that age range. However its not impossible. So just chill ...
835,ham,"Sorry, I'll call you later. I am in meeting sir."
836,ham,"Im in inperialmusic listening2the weirdest track ever by”leafcutter john”-sounds like insects being molested&someone plumbing,remixed by evil men ..."
837,ham,Dare i ask... Any luck with sorting out the car?
838,ham,My birthday is on feb # da. .
839,ham,"Thk shld b can... Ya, i wana go 4 lessons... Haha, can go for one whole stretch..."


Reads the training and test datasets (spam_train.csv, spam_test.csv).

Splits the data into features (text) and targets (type) for both training and test sets.

Displays the last 20 rows of the test set for a quick data check.

In [3]:
import nltk
nltk.download('stopwords')
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

#import nltk #

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Downloads the English stopwords for use in text preprocessing.

Stopwords are common words (e.g., "the", "and", "is") that are often removed to reduce noise in text classification.



In [4]:
# Text preprocessing

def text_preprocessing(text, language, minWordSize):

    # remove html
    text_no_html = BeautifulSoup(str(text),"html.parser" ).get_text()

    # remove non-letters
    text_alpha_chars = re.sub("[^a-zA-Z']", " ", str(text_no_html))

    # convert to lower-case
    text_lower = text_alpha_chars.lower()

    # remove stop words
    stops = set(stopwords.words(language))
    text_no_stop_words = ' '

    for w in text_lower.split():
        if w not in stops:
            text_no_stop_words = text_no_stop_words + w + ' '

       # do stemming
    text_stemmer = ' '
    stemmer = SnowballStemmer(language)
    for w in text_no_stop_words.split():
        text_stemmer = text_stemmer + stemmer.stem(w) + ' '

    # remove short words
    text_no_short_words = ' '
    for w in text_stemmer.split():
        if len(w) >=minWordSize:
            text_no_short_words = text_no_short_words + w + ' '


    return text_no_short_words


### **Beginner-Friendly Explanation: Text Preprocessing Function**

This function is used to **clean** and **prepare** **text data** before feeding it into a **machine learning** model. It removes **noise**, **reduces** the **size** of the text, and **standardizes** it, which makes it **easier** for the **model** to **learn**.

---

#### **1. Function Definition**

```python
def text_preprocessing(text, language, minWordSize):
```

* **text**: The **input** text you want to **clean**.
* **language**: The **language** of the text (**English**, **Dutch**, etc.).
* **minWordSize**: The **minimum** length a word needs to be to be **kept**.

---

#### **2. Remove HTML Tags**

```python
# remove html
text_no_html = BeautifulSoup(str(text), "html.parser").get_text()
```

* **BeautifulSoup** is used to **remove** any **HTML tags** from the text.
* For example, **"<p>Hello</p>"** becomes just **"Hello"**.

---

#### **3. Remove Non-Letter Characters**

```python
# remove non-letters
text_alpha_chars = re.sub("[^a-zA-Z']", " ", str(text_no_html))
```

* **re.sub()** replaces **anything** that is **not** a **letter** or an **apostrophe** with a **space**.
* For example, **"Hi! How are you? 😊"** becomes **"Hi How are you"**.

---

#### **4. Convert to Lowercase**

```python
# convert to lower-case
text_lower = text_alpha_chars.lower()
```

* Makes the entire text **lowercase**.
* For example, **"Hello World"** becomes **"hello world"**.
* This makes it easier to **match** words during **processing**.

---

#### **5. Remove Stop Words**

```python
# remove stop words
stops = set(stopwords.words(language))
text_no_stop_words = ' '

for w in text_lower.split():
    if w not in stops:
        text_no_stop_words = text_no_stop_words + w + ' '
```

* **Stop words** are common words like **"the"**, **"is"**, **"in"**, **"and"**, which don’t carry **much** meaning.
* This step **removes** these words to **reduce** the **text size**.
* For example, **"I love the cats"** becomes **"love cats"**.

---

#### **6. Stemming the Words**

```python
# do stemming
text_stemmer = ' '
stemmer = SnowballStemmer(language)

for w in text_no_stop_words.split():
    text_stemmer = text_stemmer + stemmer.stem(w) + ' '
```

* **Stemming** reduces words to their **base** form.
* For example, **"running"**, **"runner"**, and **"runs"** become **"run"**.
* This makes the text **simpler** and **reduces** the **number** of **unique** words.

---

#### **7. Remove Short Words**

```python
# remove short words
text_no_short_words = ' '
for w in text_stemmer.split():
    if len(w) >= minWordSize:
        text_no_short_words = text_no_short_words + w + ' '
```

* Removes **very short** words (like **"a"**, **"an"**, **"it"**) that are **less meaningful**.
* The minimum word length is set by **minWordSize**.
* For example, if **minWordSize** = 3, **"cat"** is **kept** but **"at"** is **removed**.

---

#### **8. Return the Cleaned Text**

```python
return text_no_short_words
```

* The **cleaned** and **processed** text is **returned** as the **final** output.

---

#### **9. Example Input and Output**

* **Input**: **"I am running with the cats! 😊"**
* **Output**: **"run cat "** (assuming English and minWordSize = 3)

---



In [5]:
# Convert training and test set to bag of words
language = 'english'
minWordLength = 2

for i in range(X_train.size):
    X_train[i] = text_preprocessing(X_train[i], language, minWordLength)


for i in range(X_test.size):
    X_test[i] = text_preprocessing(X_test[i], language, minWordLength)



### **Converting Training and Test Sets to Bag of Words - Beginner Explanation**

---

#### **1. What Does This Code Do?**

* It **cleans** the **training** (**X\_train**) and **test** (**X\_test**) text data.
* It **preprocesses** the text using the **text\_preprocessing** function you created earlier.
* It prepares the data for **machine learning** by converting it to a **Bag of Words** format.

---

#### **2. Setting Language and Minimum Word Length**

```python
language = 'english'
minWordLength = 2
```

* **language**: The language for **stop word removal** and **stemming** (**English** in this case).
* **minWordLength**: The **minimum** length a word must have to be **kept** (**2** characters).

---

#### **3. Preprocessing the Training Data**

```python
for i in range(X_train.size):
    X_train[i] = text_preprocessing(X_train[i], language, minWordLength)
```

* **Loops** through each **text message** in the **training set** (**X\_train**).
* **Cleans** the text using the **text\_preprocessing** function.
* **Replaces** the **original** text with the **cleaned** version.

---

#### **4. Preprocessing the Test Data**

```python
for i in range(X_test.size):
    X_test[i] = text_preprocessing(X_test[i], language, minWordLength)
```

* **Does the same** for the **test** set (**X\_test**).
* Makes sure the **test** data is **processed** in the **same way** as the **training** data.

---

#### **5. Why Do This?**

* **Consistency**: Ensures the **model** sees **similar** data during **training** and **testing**.
* **Reduced Noise**: Removes **unnecessary** characters and **stop words**.
* **Smaller Vocabulary**: Makes the text **simpler** and **easier** for the model to **learn**.

---

#### **6. Example Input and Output**

Suppose you have a **training** set:

* **Before**:

  * **X\_train\[0]** = "I am running with the cats! 😊"
  * **X\_train\[1]** = "Dogs are great."

* **After**:

  * **X\_train\[0]** = "run cat "
  * **X\_train\[1]** = "dog great "

---



In [6]:
print(X_train[4707])

 sms ac jsco energi high may know channel day ur leadership skill strong psychic repli an question end repli end jsco 


In [7]:
# Make sparse features vectors
# Bag of words

count_vect = CountVectorizer()
X_train_bag_of_words = count_vect.fit(X_train)
X_train_bag_of_words = count_vect.transform(X_train)
X_test_bag_of_words = count_vect.transform(X_test)

print(X_train_bag_of_words)
#print(X_test_bag_of_words)

tfidf_transformer = TfidfTransformer()
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_bag_of_words)
X_train_tf = tf_transformer.transform(X_train_bag_of_words)
X_test_tf = tf_transformer.transform(X_test_bag_of_words)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 37656 stored elements and shape (4709, 5738)>
  Coords	Values
  (0, 379)	1
  (0, 2647)	1
  (0, 3460)	1
  (0, 3708)	1
  (1, 903)	1
  (1, 2084)	1
  (2, 516)	1
  (2, 618)	1
  (2, 624)	1
  (2, 1019)	1
  (2, 1157)	1
  (2, 1766)	1
  (2, 1836)	1
  (2, 2000)	1
  (2, 3302)	1
  (2, 4763)	1
  (2, 5173)	1
  (2, 5393)	1
  (2, 5543)	1
  (3, 282)	1
  (3, 709)	1
  (3, 805)	1
  (3, 1115)	1
  (3, 3106)	1
  (3, 3380)	1
  :	:
  (4706, 3846)	1
  (4706, 4754)	1
  (4706, 5639)	1
  (4707, 25)	1
  (4707, 183)	1
  (4707, 801)	1
  (4707, 1157)	1
  (4707, 1498)	2
  (4707, 1503)	1
  (4707, 2194)	1
  (4707, 2558)	2
  (4707, 2647)	1
  (4707, 2717)	1
  (4707, 2982)	1
  (4707, 3891)	1
  (4707, 3932)	1
  (4707, 4067)	2
  (4707, 4444)	1
  (4707, 4493)	1
  (4707, 4716)	1
  (4707, 5266)	1
  (4708, 709)	1
  (4708, 1166)	1
  (4708, 1781)	1
  (4708, 4339)	1


### **Beginner-Friendly Explanation: Creating Sparse Feature Vectors (Bag of Words)**

This code converts your **preprocessed** text data into **sparse** feature vectors, which are used to **train** machine learning models like **Naive Bayes** or **Logistic Regression**.

---

#### **1. Why Use Sparse Vectors?**

* **Text data** is usually **very large** and **sparse** (lots of **zeros**).
* Sparse vectors **save** memory and **speed up** training.

---

#### **2. Creating a Bag of Words Model**

```python
# Create the bag of words model
count_vect = CountVectorizer()
```

* **CountVectorizer**: Converts **text** into a **matrix** of **word counts**.
* Each **column** is a **word**.
* Each **row** is a **document**.

---

#### **3. Fitting the Vectorizer to Training Data**

```python
X_train_bag_of_words = count_vect.fit(X_train)
```

* **Fits** the **vectorizer** to the **training data**.
* Learns the **vocabulary** (all the unique words in your **training** set).

---

#### **4. Transforming the Training and Test Data**

```python
X_train_bag_of_words = count_vect.transform(X_train)
X_test_bag_of_words = count_vect.transform(X_test)
```

* **Transforms** the **training** and **test** data into **sparse matrices**.
* Each **row** is a **document**, each **column** is a **word**.
* The **values** are the **counts** of each **word** in the document.

---

#### **5. What the Matrix Looks Like**

If you **print** **X\_train\_bag\_of\_words**, it might look like:

```
(0, 10)    2
(0, 25)    1
(1, 15)    3
(2, 7)     1
```

* **(0, 10)**: Document **0** contains the **10th** word **2** times.
* **(1, 15)**: Document **1** contains the **15th** word **3** times.

---

#### **6. Adding TF-IDF Weighting**

```python
tfidf_transformer = TfidfTransformer()
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_bag_of_words)
```

* **TF-IDF Transformer**:

  * Converts the **raw** word counts into **TF-IDF** scores.
  * **TF-IDF** gives more **weight** to **important** words and **less** to **common** words.

---

#### **7. Transforming the Data with TF-IDF**

```python
X_train_tf = tf_transformer.transform(X_train_bag_of_words)
X_test_tf = tf_transformer.transform(X_test_bag_of_words)
```

* **Transforms** the **raw** count matrix into a **TF-IDF** weighted matrix.
* Now, each **value** represents the **importance** of a word in a **document**, not just its **count**.

---

#### **8. Key Benefits**

* **More Meaningful**:

  * TF-IDF captures **important** words better than raw counts.
* **Reduced Noise**:

  * Common words get **down-weighted**.
* **Efficient**:

  * Sparse matrices save **memory** and improve **speed**.

---

#### **9. What You Have Now**

You now have:

* **X\_train\_tf**: Your **training** data as a **TF-IDF** weighted **sparse** matrix.
* **X\_test\_tf**: Your **test** data in the **same** format.

---



In [8]:
print(X_train_bag_of_words.shape)

(4709, 5738)


In [9]:
# Naive bayes

NBclassifier = MultinomialNB(alpha=1)

NBclassifier.fit(X_train_tf, y_train)

y_pred = NBclassifier.predict(X_test_tf)
print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)


              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       738
        spam       0.99      0.84      0.91       112

    accuracy                           0.98       850
   macro avg       0.98      0.92      0.95       850
weighted avg       0.98      0.98      0.98       850

[[737   1]
 [ 18  94]]
97.76470588235294


### **Beginner-Friendly Explanation: Training a Naive Bayes Classifier**

---

#### **1. What is Naive Bayes?**

* **Naive Bayes** is a **simple** but **powerful** algorithm for **text classification**.
* It uses **probabilities** to **predict** the **class** of a message.
* It is called **"naive"** because it assumes that **all** words are **independent** of each other, which is usually **not** true, but it **still** works well.

---

#### **2. Create the Naive Bayes Classifier**

```python
NBclassifier = MultinomialNB(alpha=1)
```

* **MultinomialNB**: This version is used for **text data** where we care about **word counts**.
* **alpha=1**: This is the **Laplacian smoothing** parameter, which helps handle **zero** probabilities.

---

#### **3. Train the Classifier**

```python
NBclassifier.fit(X_train_tf, y_train)
```

* **fit()**: Trains the **Naive Bayes** classifier on the **training** data.
* **X\_train\_tf**: The **TF-IDF** matrix of the **training** text.
* **y\_train**: The **labels** for the training data (**Spam** or **Not Spam**).

---

#### **4. Make Predictions on the Test Set**

```python
y_pred = NBclassifier.predict(X_test_tf)
```

* **predict()**: Uses the **trained** model to **predict** the **class** of each message in the **test** set.
* **X\_test\_tf**: The **TF-IDF** matrix of the **test** text.

---

#### **5. Print the Classification Report**

```python
print(classification_report(y_test, y_pred))
```

* Prints a **detailed** report showing:

  * **Precision**: How often the classifier is **correct** when it predicts a **class**.
  * **Recall**: How often it **finds** all the **positive** samples.
  * **F1-Score**: The **balance** between **precision** and **recall**.
  * **Support**: The **number** of samples for **each** class.

---

#### **6. Print the Confusion Matrix**

```python
cf = confusion_matrix(y_test, y_pred)
print(cf)
```

* Shows a **matrix** of how many **correct** and **incorrect** predictions the model made.
* **Rows**: **Actual** classes.
* **Columns**: **Predicted** classes.

---

#### **7. Print the Accuracy**

```python
print(accuracy_score(y_test, y_pred) * 100)
```

* Prints the **overall** accuracy as a **percentage**.
* **Higher** is **better**.

---

#### **8. What the Output Might Look Like**

Example output:

```
              precision    recall  f1-score   support

           0       0.95      0.97      0.96       500
           1       0.93      0.91      0.92       200

    accuracy                           0.94       700
   macro avg       0.94      0.94      0.94       700
weighted avg       0.94      0.94      0.94       700
```

Confusion Matrix:

```
[[485  15]
 [ 18 182]]
```

Accuracy:

```
94.57%
```

---

#### **9. What This Means**

* The model is **94.57%** accurate, meaning it correctly classified **94.57%** of the test messages.
* It correctly identified **most** of the **spam** and **non-spam** messages.

---

#### **10. Key Benefits**

* **Fast** and **simple**.
* **Works well** with **small** amounts of **data**.
* **Easy** to **understand** and **implement**.

---




Trains a Multinomial Naive Bayes model with Laplace smoothing (alpha=1).

Evaluates the model on the test set and prints:

Classification report

Confusion matrix

Overall accuracy



In [10]:
# train a logistic regression classifier
lregclassifier = LogisticRegression(C=10)

lregclassifier.fit(X_train_tf, y_train)

Trains a Logistic Regression model with C=10 (strong regularization).

Predicts on the test set and prints:

Classification report

Confusion matrix

Overall accuracy

In [11]:
# test logistic classifier

y_pred = lregclassifier.predict(X_test_tf)
print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       738
        spam       0.99      0.92      0.95       112

    accuracy                           0.99       850
   macro avg       0.99      0.96      0.97       850
weighted avg       0.99      0.99      0.99       850

[[737   1]
 [  9 103]]
98.82352941176471


### **Code Breakdown: Logistic Regression Model Testing**

```python
# test logistic classifier

y_pred = lregclassifier.predict(X_test_tf)
print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)
```

This code block tests the performance of a **Logistic Regression** model (`lregclassifier`) that has already been **trained** on **TF-IDF features** from the training data.

---

#### **1. Making Predictions**

```python
y_pred = lregclassifier.predict(X_test_tf)
```

* **`predict()`** is used to **generate class predictions** on the **test set**.
* **`X_test_tf`** is the **TF-IDF transformed** test data.
* **Output** is a **1D array** (`y_pred`) of predicted labels (**spam** or **ham**).

---

#### **2. Generating a Classification Report**

```python
print(classification_report(y_test, y_pred))
```

* **Generates a detailed performance report** including:

  * **Precision:** How many selected items are relevant?
  * **Recall (Sensitivity):** How many relevant items are selected?
  * **F1-score:** Harmonic mean of precision and recall.
  * **Support:** The number of true instances for each label.

For a binary classification like spam detection, this report is very useful as it provides a per-class breakdown.

---

#### **3. Creating a Confusion Matrix**

```python
cf = confusion_matrix(y_test, y_pred)
print(cf)
```

* **`confusion_matrix()`** creates a matrix that **summarizes the performance** of the classifier.
* It shows the **True Positives (TP)**, **True Negatives (TN)**, **False Positives (FP)**, and **False Negatives (FN)**.

**Confusion Matrix Layout (Binary):**

|                      | Predicted: Ham (0) | Predicted: Spam (1) |
| -------------------- | ------------------ | ------------------- |
| **Actual: Ham (0)**  | TN                 | FP                  |
| **Actual: Spam (1)** | FN                 | TP                  |

* **True Negatives (TN)** - Correctly classified as **ham**.
* **True Positives (TP)** - Correctly classified as **spam**.
* **False Positives (FP)** - Incorrectly classified as **spam** (false alarm).
* **False Negatives (FN)** - Incorrectly classified as **ham** (missed spam).

---

#### **4. Calculating Overall Accuracy**

```python
print(accuracy_score(y_test, y_pred) * 100)
```

* **Calculates the overall accuracy** of the model as a **percentage**.

* **Formula:** $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

* This single number summarizes the **overall** classification performance.

* **Multiplying by 100** converts it to a percentage.

---

### **Example Output Interpretation:**

Suppose the confusion matrix is:

```
[[980, 20],
 [ 15, 985]]
```

* **True Negatives (TN)** = 980 (Correctly identified as ham)
* **False Positives (FP)** = 20 (Ham incorrectly identified as spam)
* **False Negatives (FN)** = 15 (Spam incorrectly identified as ham)
* **True Positives (TP)** = 985 (Correctly identified as spam)

This means the model is very accurate, with a **low number** of **false positives** and **false negatives**.

---


