<a href="https://colab.research.google.com/github/pascal-maker/machinelearning/blob/main/Naive_Bayes_Spam_Detection_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

import nltk
from nltk.corpus import stopwords
import re #regular expressions
from bs4 import BeautifulSoup
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

pd.set_option('display.max_rows',1000)
pd.set_option('display.max_columns',1000)
pd.set_option('display.max_colwidth',150)

In [None]:
# Import dataset and split in features and targets

# read dataset
dataset = pd.read_csv('spam_train.csv')

testset = pd.read_csv('spam_test.csv')

# Split features from targets
y_train = dataset.type.values
X_train = dataset.text.values

y_test = testset.type.values
X_test = testset.text.values

testset.tail(20)



Unnamed: 0,type,text
830,ham,Wif my family booking tour package.
831,ham,GRAN ONLYFOUND OUT AFEW DAYS AGO.CUSOON HONI
832,ham,7 wonders in My WORLD 7th You 6th Ur style 5th Ur smile 4th Ur Personality 3rd Ur Nature 2nd Ur SMS and 1st Ur Lovely Friendship... good morning dear
833,spam,FREE for 1st week! No1 Nokia tone 4 ur mobile every week just txt NOKIA to 8077 Get txting and tell ur mates. www.getzed.co.uk POBox 36504 W45WQ 1...
834,ham,Let me know how it changes in the next 6hrs. It can even be appendix but you are out of that age range. However its not impossible. So just chill ...
835,ham,"Sorry, I'll call you later. I am in meeting sir."
836,ham,"Im in inperialmusic listening2the weirdest track ever by”leafcutter john”-sounds like insects being molested&someone plumbing,remixed by evil men ..."
837,ham,Dare i ask... Any luck with sorting out the car?
838,ham,My birthday is on feb # da. .
839,ham,"Thk shld b can... Ya, i wana go 4 lessons... Haha, can go for one whole stretch..."


Reads the training and test datasets (spam_train.csv, spam_test.csv).

Splits the data into features (text) and targets (type) for both training and test sets.

Displays the last 20 rows of the test set for a quick data check.

In [None]:
import nltk
nltk.download('stopwords')
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

#import nltk #

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Downloads the English stopwords for use in text preprocessing.

Stopwords are common words (e.g., "the", "and", "is") that are often removed to reduce noise in text classification.



In [None]:
# Text preprocessing

def text_preprocessing(text, language, minWordSize):

    # remove html
    text_no_html = BeautifulSoup(str(text),"html.parser" ).get_text()

    # remove non-letters
    text_alpha_chars = re.sub("[^a-zA-Z']", " ", str(text_no_html))

    # convert to lower-case
    text_lower = text_alpha_chars.lower()

    # remove stop words
    stops = set(stopwords.words(language))
    text_no_stop_words = ' '

    for w in text_lower.split():
        if w not in stops:
            text_no_stop_words = text_no_stop_words + w + ' '

       # do stemming
    text_stemmer = ' '
    stemmer = SnowballStemmer(language)
    for w in text_no_stop_words.split():
        text_stemmer = text_stemmer + stemmer.stem(w) + ' '

    # remove short words
    text_no_short_words = ' '
    for w in text_stemmer.split():
        if len(w) >=minWordSize:
            text_no_short_words = text_no_short_words + w + ' '


    return text_no_short_words


This function cleans and processes raw text:

Remove HTML Tags:

Extracts plain text from HTML using BeautifulSoup.

Remove Non-Letter Characters:

Removes everything except alphabets and single quotes.

Convert to Lowercase:

Converts all characters to lowercase for consistency.

Remove Stopwords:

Removes common words like "the", "and", "is" using NLTK stopwords.

Apply Stemming:

Reduces words to their root form (e.g., "running" to "run") using SnowballStemmer.

Remove Short Words:

Filters out words smaller than the specified minWordSize.

In [None]:
# Convert training and test set to bag of words
language = 'english'
minWordLength = 2

for i in range(X_train.size):
    X_train[i] = text_preprocessing(X_train[i], language, minWordLength)


for i in range(X_test.size):
    X_test[i] = text_preprocessing(X_test[i], language, minWordLength)



Loops through each training and test text sample, applying the text_preprocessing function.

Prints a sample preprocessed text to verify the transformation.

In [None]:
print(X_train[4707])

 sms ac jsco energi high may know channel day ur leadership skill strong psychic repli an question end repli end jsco 


In [None]:
# Make sparse features vectors
# Bag of words

count_vect = CountVectorizer()
X_train_bag_of_words = count_vect.fit(X_train)
X_train_bag_of_words = count_vect.transform(X_train)
X_test_bag_of_words = count_vect.transform(X_test)

print(X_train_bag_of_words)
#print(X_test_bag_of_words)

tfidf_transformer = TfidfTransformer()
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_bag_of_words)
X_train_tf = tf_transformer.transform(X_train_bag_of_words)
X_test_tf = tf_transformer.transform(X_test_bag_of_words)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 37656 stored elements and shape (4709, 5738)>
  Coords	Values
  (0, 379)	1
  (0, 2647)	1
  (0, 3460)	1
  (0, 3708)	1
  (1, 903)	1
  (1, 2084)	1
  (2, 516)	1
  (2, 618)	1
  (2, 624)	1
  (2, 1019)	1
  (2, 1157)	1
  (2, 1766)	1
  (2, 1836)	1
  (2, 2000)	1
  (2, 3302)	1
  (2, 4763)	1
  (2, 5173)	1
  (2, 5393)	1
  (2, 5543)	1
  (3, 282)	1
  (3, 709)	1
  (3, 805)	1
  (3, 1115)	1
  (3, 3106)	1
  (3, 3380)	1
  :	:
  (4706, 3846)	1
  (4706, 4754)	1
  (4706, 5639)	1
  (4707, 25)	1
  (4707, 183)	1
  (4707, 801)	1
  (4707, 1157)	1
  (4707, 1498)	2
  (4707, 1503)	1
  (4707, 2194)	1
  (4707, 2558)	2
  (4707, 2647)	1
  (4707, 2717)	1
  (4707, 2982)	1
  (4707, 3891)	1
  (4707, 3932)	1
  (4707, 4067)	2
  (4707, 4444)	1
  (4707, 4493)	1
  (4707, 4716)	1
  (4707, 5266)	1
  (4708, 709)	1
  (4708, 1166)	1
  (4708, 1781)	1
  (4708, 4339)	1


Initializes a CountVectorizer, which:

Converts the text into a sparse matrix of word counts.

Fits the vectorizer to the training data and transforms both the training and test data into feature matrices.

Prints the resulting sparse matrix.



Initializes a TF-IDF transformer, which:

Converts word counts to TF-IDF scores.

Transforms both training and test data to TF-IDF.

Prints the shape of the final training feature matrix.

In [None]:
print(X_train_bag_of_words.shape)

(4709, 5738)


In [None]:
# Naive bayes

NBclassifier = MultinomialNB(alpha=1)

NBclassifier.fit(X_train_tf, y_train)

y_pred = NBclassifier.predict(X_test_tf)
print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)


              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       738
        spam       0.99      0.84      0.91       112

    accuracy                           0.98       850
   macro avg       0.98      0.92      0.95       850
weighted avg       0.98      0.98      0.98       850

[[737   1]
 [ 18  94]]
97.76470588235294


Trains a Multinomial Naive Bayes model with Laplace smoothing (alpha=1).

Evaluates the model on the test set and prints:

Classification report

Confusion matrix

Overall accuracy



In [None]:
# train a logistic regression classifier
lregclassifier = LogisticRegression(C=10)

lregclassifier.fit(X_train_tf, y_train)



LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Trains a Logistic Regression model with C=10 (strong regularization).

Predicts on the test set and prints:

Classification report

Confusion matrix

Overall accuracy

In [None]:
# test logistic classifier

y_pred = lregclassifier.predict(X_test_tf)
print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       738
        spam       0.99      0.92      0.95       112

   micro avg       0.99      0.99      0.99       850
   macro avg       0.99      0.96      0.97       850
weighted avg       0.99      0.99      0.99       850

[[737   1]
 [  9 103]]
98.82352941176471


### **Code Breakdown: Logistic Regression Model Testing**

```python
# test logistic classifier

y_pred = lregclassifier.predict(X_test_tf)
print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)
```

This code block tests the performance of a **Logistic Regression** model (`lregclassifier`) that has already been **trained** on **TF-IDF features** from the training data.

---

#### **1. Making Predictions**

```python
y_pred = lregclassifier.predict(X_test_tf)
```

* **`predict()`** is used to **generate class predictions** on the **test set**.
* **`X_test_tf`** is the **TF-IDF transformed** test data.
* **Output** is a **1D array** (`y_pred`) of predicted labels (**spam** or **ham**).

---

#### **2. Generating a Classification Report**

```python
print(classification_report(y_test, y_pred))
```

* **Generates a detailed performance report** including:

  * **Precision:** How many selected items are relevant?
  * **Recall (Sensitivity):** How many relevant items are selected?
  * **F1-score:** Harmonic mean of precision and recall.
  * **Support:** The number of true instances for each label.

For a binary classification like spam detection, this report is very useful as it provides a per-class breakdown.

---

#### **3. Creating a Confusion Matrix**

```python
cf = confusion_matrix(y_test, y_pred)
print(cf)
```

* **`confusion_matrix()`** creates a matrix that **summarizes the performance** of the classifier.
* It shows the **True Positives (TP)**, **True Negatives (TN)**, **False Positives (FP)**, and **False Negatives (FN)**.

**Confusion Matrix Layout (Binary):**

|                      | Predicted: Ham (0) | Predicted: Spam (1) |
| -------------------- | ------------------ | ------------------- |
| **Actual: Ham (0)**  | TN                 | FP                  |
| **Actual: Spam (1)** | FN                 | TP                  |

* **True Negatives (TN)** - Correctly classified as **ham**.
* **True Positives (TP)** - Correctly classified as **spam**.
* **False Positives (FP)** - Incorrectly classified as **spam** (false alarm).
* **False Negatives (FN)** - Incorrectly classified as **ham** (missed spam).

---

#### **4. Calculating Overall Accuracy**

```python
print(accuracy_score(y_test, y_pred) * 100)
```

* **Calculates the overall accuracy** of the model as a **percentage**.

* **Formula:** $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

* This single number summarizes the **overall** classification performance.

* **Multiplying by 100** converts it to a percentage.

---

### **Example Output Interpretation:**

Suppose the confusion matrix is:

```
[[980, 20],
 [ 15, 985]]
```

* **True Negatives (TN)** = 980 (Correctly identified as ham)
* **False Positives (FP)** = 20 (Ham incorrectly identified as spam)
* **False Negatives (FN)** = 15 (Spam incorrectly identified as ham)
* **True Positives (TP)** = 985 (Correctly identified as spam)

This means the model is very accurate, with a **low number** of **false positives** and **false negatives**.

---


