# Detecting Spam email

## Group Members and Roles

1. Zafir Jamal – Project Coordinator
2. Joseph Maswi - Programmer
3. Anthony Kwasi - Researcher
4. Kevin Mungai – Program Leader
5. Alex Samia – Program Coordinator

## Project Overview

We seek to develop an AI program that will be able to detect email spam and notify the user. The program will help the users in determining the mail is safe and protect their machines/devices from viruses and unnecessary messages.

## Purpose of the project

Anti-spam software works to identify and prevent potential harmful email from reaching users inboxes. Spam is defined as an uninvited and undesired message (spam); frequently, spam advertises a product, which may be valid (though still unwanted) or malevolent. Anti-spam protocols define what constitutes spam.

## Goals for the project 🥅

In order to protect legitimate users from being impacted, spam detection aims to create effective and efficient methods for automatically identifying spams and their sources.
To find emails that may contain malware.

## Proposed key feature 🔑

Classify emails as spam(1) or not spam (ham)(0) using machine learning techniques.

## Inspiration 💡

Inspired by [inspiration](https://blog.logrocket.com/email-spam-detector-python-machine-learning/)

Uploading sample data from local computer to **Google Colab** environment.
This data is obtained from [spam.csv](https://raw.githubusercontent.com/SmallLion/Python-Projects/main/Spam-detection/spam.csv)


In [36]:
#Import libraries
import numpy as pynum
import pandas as panda
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm


In [46]:
# Run in Google Colab
# Load the data
# from google.colab import files
# uploaded = files.upload()


In [2]:
#Read the CVS file
import pandas as panda
df=panda.read_csv('spam.csv')

#Print the first 5 rows of data
df.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [7]:
#Print the shape of document(Get the number of rows and columns)
df.shape

(5572, 5)

In [8]:
#Get the columns names
df.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [10]:
#Check for duplicates and remove them
df.drop_duplicates(inplace = True)

#show new shape
df.shape

(5169, 5)

In [24]:
#show missing data(NAN, NaN, na) for each column
df.isnull().sum()


v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [3]:
#Download the stopwords package
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kevinmungai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
# Process some text
def process_text(text: str) -> list[str]:
  """
  Process the text by:
  1. Remove punctuation 
  2. Remove stopwords
  3. Return list of clean text words
  """


  #1.
  nopunc = [char for char in text if char not in string.punctuation]
  nopunc = ''.join(nopunc)

  #2
  clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

  #3
  return clean_words



In [24]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [20]:
#show the tokenization( a list of tokens also called lemmas)
df['v2'].head().apply(process_text)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: v2, dtype: object

👓 Example on how to convert a matrix into token counts.

In [28]:
#Example
message5 = 'Hello angelique angelique angelique'
message6 = 'test one two two three test test test'

#Convert the text to a matrix of token counts
bow5 = CountVectorizer(analyzer=process_text).fit_transform([[message5], [message6]])
print(message5)
print(message6)
print(bow5)

print()

print(bow5.shape)

Hello angelique angelique angelique
test one two two three test test test
  (0, 0)	1
  (0, 1)	3
  (1, 3)	4
  (1, 2)	1
  (1, 5)	2
  (1, 4)	1

(2, 6)


In [38]:
#Split the selected data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
x = df["v1"]
y = df["v2"]
# X_train, X_test, y_train, y_test = train_test_split(messages_bow, test_size=0.2)
z_train, z_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [41]:
cv = CountVectorizer(analyzer=process_text)

In [42]:
#Convert a collection of text to a matrix of tokens
features = cv.fit_transform(z_train)


In [43]:
model = svm.SVC()
model.fit(features, y_train)

In [44]:
feature_test = cv.transform(z_test)

In [45]:
model.score(feature_test, y_test)

0.011659192825112108