# Detecting Spam email

## Group Members and Roles

1. Zafir Jamal – Project Coordinator
2. Joseph Maswi - Programmer
3. Anthony Kwasi - Researcher
4. Kevin Mungai – Program Leader
5. Alex Samia – Program Coordinator

## Project Overview

We seek to develop an AI program that will be able to detect email spam and notify the user. The program will help the users in determining the mail is safe and protect their machines/devices from viruses and unnecessary messages.

## Purpose of the project

Anti-spam software works to identify and prevent potential harmful email from reaching users inboxes. Spam is defined as an uninvited and undesired message (spam); frequently, spam advertises a product, which may be valid (though still unwanted) or malevolent. Anti-spam protocols define what constitutes spam.

## Goals for the project 🥅

In order to protect legitimate users from being impacted, spam detection aims to create effective and efficient methods for automatically identifying spams and their sources.
To find emails that may contain malware.

## Proposed key feature 🔑

Classify emails as spam(1) or not spam (ham)(0) using machine learning techniques.

## Inspiration 💡

Inspired by [inspiration](https://blog.logrocket.com/email-spam-detector-python-machine-learning/)

Uploading sample data from local computer to **Google Colab** environment.
This data is obtained from [spam.csv](https://raw.githubusercontent.com/SmallLion/Python-Projects/main/Spam-detection/spam.csv)


# Declaring our imports.

In [71]:
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


# Loading our data set

In [39]:
# Read the CVS file
# To run locally.

df = pd.read_csv("spam.csv")

# Print the first 5 rows of data
df.head(5)


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [40]:
# Print the shape of document(Get the number of rows and columns)
df.shape


(5572, 5)

In [41]:
# Get the columns names
df.columns


Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

# Cleaning our data set

In [42]:
# Check for duplicates and remove them
df.drop_duplicates(inplace=True)

# show new shape
df.shape


(5169, 5)

In [43]:
# show missing data(NAN, NaN, na) for each column
df.isnull().sum()


v1               0
v2               0
Unnamed: 2    5126
Unnamed: 3    5159
Unnamed: 4    5164
dtype: int64

## Creating some utilities for cleaning our data set.

In [72]:
# Download the stopwords package to the parent directory.
nltk.download("stopwords", download_dir=Path.cwd().parent)


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kevinmungai/usiu/spam_email_detector...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
def process_text(text: str) -> list[str]:
    """
    Process the text by:
    1. Remove punctuation
    2. Remove stopwords
    3. Return list of clean text words
    """

    # 1.
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = "".join(nopunc)

    # 2
    clean_words = [
        word
        for word in nopunc.split()
        if word.lower() not in stopwords.words("english")
    ]

    # 3
    return clean_words


In [10]:
df.head()


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [11]:
# show the tokenization( a list of tokens also called lemmas)
df["v2"].head().apply(process_text)


0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: v2, dtype: object

### Example on how to convert a matrix into token counts


In [12]:
# Example
message5 = "Hello angelique angelique angelique"
message6 = "test one two two three test test test"

# Convert the text to a matrix of token counts
# bow => bag of words.
bow5 = CountVectorizer(analyzer=process_text).fit_transform(
    [[message5], [message6]]
)
print(message5)
print(message6)
print(bow5)

print()

print(bow5.shape)


Hello angelique angelique angelique
test one two two three test test test
  (0, 0)	1
  (0, 1)	3
  (1, 3)	4
  (1, 2)	1
  (1, 5)	2
  (1, 4)	1

(2, 6)


# Splitting our data set into 80% for training and 20% for testing.

In [48]:
# Split the selected data into 80% training and 20% testing

label = df["v1"]  # label (spam or ham).
email = df["v2"]  # email text.

email_train, email_test, label_train, label_test = train_test_split(
    email, label, test_size=0.2
)


# Training our model.

## Using CountVectorizer to transform email text to matrix tokens.

In [60]:
# Convert a collection of text to a matrix of tokens
# Using a count vectorizer.
cv = CountVectorizer(analyzer=process_text)
email_features = cv.fit_transform(email_train)


## Using Support Vector Machine model for training.

In [61]:
# We'll be using a Support Vector Machine to classify our emails as ham or spam
# The data used to train the model will come form the 80% of email and label.
# But we'll have to convert the email to a matrix of tokens so that the support vector
# can use it.
svm_model = svm.SVC()
svm_model = svm_model.fit(email_features, label_train)


# Testing our model.

## Using CountVectorizer to transform test email to matrix of tokens.

In [62]:
# Transforming the email_test to a matrix of tokens
email_feature_test = cv.transform(email_test)


## Scoring our model by applying it to our test data.

In [63]:
# Testing the accuracy of our model with the 20% test data
# of email_test and label test.
svm_model.score(email_feature_test, label_test)


0.9690522243713733

## Generating a classification report.

In [69]:
# Generating the Classification Report
feature_prediction = svm_model.predict(email_feature_test)
print(
    classification_report(
        label_test,
        feature_prediction,
    )
)


              precision    recall  f1-score   support

         ham       0.97      1.00      0.98       907
        spam       0.99      0.76      0.86       127

    accuracy                           0.97      1034
   macro avg       0.98      0.88      0.92      1034
weighted avg       0.97      0.97      0.97      1034

