# Spam Mail Detection with Machine Learning

###### Classify email as spam or ham
###### Dataset: https://www.kaggle.com/datasets/venky73/spam-mails-dataset

In [3]:
# pip3 install pandas numpy nltk scikit-learn
import string 

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\646ca\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [5]:
df = pd.read_csv('spam_ham_dataset.csv')

In [6]:
df # we only need text and label

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [8]:
df['text'] = df['text'].apply(lambda x: x.replace('\r\n', ' '))
df
df.text.iloc[1] # stands for "integer location" and allows you to access rows and columns by their integer position

'Subject: hpl nom for january 9 , 2001 ( see attached file : hplnol 09 . xls ) - hplnol 09 . xls'

In [9]:
df.info() # check missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


In [16]:
stemmer = PorterStemmer() # reduce words to their root form (e.g., "running" to "run").
corpus = [] # an empty list to store the processed text data

# Creates a set of common English stopwords (e.g., "the", "and", "is") that will be removed from the text.
stopwords_set = set(stopwords.words('english'))

for i in range(len(df)):
    text = df['text'].iloc[i].lower() # lower case all the text
    
    # remove all punctuations and splits it into individual words
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.split()
    
    # Applies stemming to each word in the text and removes stopwords.
    text = [stemmer.stem(word) for word in text if word not in stopwords_set]
    
    text = ' '.join(text)
    corpus.append(text)

In [17]:
stemmer.stem('sophisticated') # Stemming reduces words to their root form (e.g., "running" to "run")

'sophist'

In [20]:
corpus[0]
len(corpus)

5171

In [22]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Transform the corpus into a document-term matrix
X = vectorizer.fit_transform(corpus).toarray()
Y = df.label_num

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [23]:
X[0]

array([1, 0, 0, ..., 0, 0, 0], dtype=int64)

### Vectorizing is a crucial step in natural language processing (NLP) that transforms text data into numerical representations, which can be used by machine learning algorithms
##### It helps in extracting features from the text, such as word frequencies, which can be used to identify patterns and make predictions.
##### Machine learning models require numerical input. Vectorizing converts text into a format that these models can process.

In [24]:
clf = RandomForestClassifier(n_jobs=-1)

clf.fit(X_train, Y_train) 
# X_train: feature matrix for the training data
# Y_train: target labels for the training data

###### n_jobs parameter specifies the number of CPU cores to use for training the model. Setting it to -1 means that all available cores will be used, which can speed up the training process.

In [28]:
clf.score(X_test, Y_test)

0.9739130434782609

In [29]:
email_to_classify = df.text.values[10]
email_to_classify

"Subject: vocable % rnd - word asceticism vcsc - brand new stock for your attention vocalscape inc - the stock symbol is : vcsc vcsc will be our top stock pick for the month of april - stock expected to bounce to 12 cents level the stock hit its all time low and will bounce back stock is going to explode in next 5 days - watch it soar watch the stock go crazy this and next week . breaking news - vocalscape inc . announces agreement to resell mix network services current price : $ 0 . 025 we expect projected speculative price in next 5 days : $ 0 . 12 we expect projected speculative price in next 15 days : $ 0 . 15 vocalscape networks inc . is building a company that ' s revolutionizing the telecommunications industry with the most affordable phone systems , hardware , online software , and rates in canada and the us . vocalscape , a company with global reach , is receiving international attention for the development of voice over ip ( voip ) application solutions , including the award 

In [30]:
# Converts the email text to lowercase. Removes all punctuation. Splits the text into individual words.
email_text = email_to_classify.lower().translate(str.maketrans('','',string.punctuation)).split()
email_text = [stemmer.stem(word) for word in text if word not in stopwords_set] # Stem and Remove Stopwords
email_text = ' '.join(email_text) # Combines the processed words back into a single string.

email_corpus = [email_text] # Puts the processed email text into a list.

X_email = vectorizer.transform(email_corpus) # Transforms the text into a numerical format that a machine learning model can use.

In [31]:
clf.predict(X_email)

array([1], dtype=int64)

In [32]:
df.label_num.iloc[10]

1

# 1 = SPAM 
# 0 = HAM

In [33]:
df.label_num.iloc[20]

0

In [39]:
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.split()
    text = [stemmer.stem(word) for word in text if word not in stopwords_set]
    return ' '.join(text)

# Function to classify an email
def classify_email(email):
    email_text = preprocess_text(email)
    email_corpus = [email_text]
    
    X_email = vectorizer.transform(email_corpus)
    prediction = clf.predict(X_email)
    
    result = prediction[0]
    print("HAM" if result == 0 else "SPAM")
    return result

In [44]:
your_email = "Thank you for submitting your information and/or inquiry to '[placeholder email address]'. IT staff review messages daily, during regular campus business hours. For IMMEDIATE assistance with a suspicious email, contact the IT Service Desk if you: provided personally identifiable information (SSN, date of birth), financial/banking information, account credentials (user ID/password), etc.; corresponded via contact information provided in a suspicious email; or clicked on any URL (links) provided in a suspicious email. The IT Service Desk can be contacted at [placeholder phone number] or [placeholder email address]. The campus Chief Information Security Office (CISO) can be contacted at [placeholder email address]. Contact the Campus Police Department if you experienced monetary loss or have otherwise been defrauded or victimized due to information provided in a suspect email. The campus Police Department can be contacted at [placeholder phone number]. What can I do NOW? Before acting on information in a suspicious message, contact the sender to verify that the message is legitimate using a different communication method (other than what is provided in the message). Do NOT correspond or forward the suspect message (except sending it as an attachment to the [placeholder email address] address). If using Outlook or Outlook for the Web, you can report the message as Junk or Phishing using your Toolbar under Protection. Delete the message from your mailbox. Always change your [placeholder password name] when you click on any links in a suspicious email message, or if you replied to the message. If you have been 'spoofed,' warn the people you regularly communicate with that they may receive emails from your account which are not from you. IT Security and Compliance, Chief Information Security Officer, Information Technology & Institutional Planning Division, [placeholder university name], [placeholder email address]. IT Service Desk/Library Tech Desk Contact Information: IT Service Desk Contact Information. Submit a ticket using the IT Service Portal."

In [45]:
classify_email(your_email)

HAM


0

# What is Random Forest: https://www.youtube.com/watch?v=gkXX4h3qYm4