$$
Oasis Infobyte Internship
$$

Name: Mayuri Kale

Email: mayurikale1947@gmail.com

 Task 4: EMAIL SPAM DETECTION WITH MACHINE LEARNING
 
 We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
 that is sent to a massive number of users at one time, frequently containing cryptic
 messages, scams, or most dangerously, phishing content.
 In this Project, use Python to build an email spam detector. Then, use machine learning to
 train the spam detector to recognize and classify emails into spam and non-spam. Let’s get
 started !

## Import Required Libraries

In [1]:
## import neccessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Load and Explore dataset

In [2]:
df = pd.read_csv("spam.csv", encoding="latin-1")
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
df.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [5]:
df.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [7]:
df.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [8]:
# Keep only relevant columns
df = df[['v1', 'v2']].copy()

In [9]:
# Rename columns for better readability
df.columns = ['label', 'message']
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [10]:
# Convert 'label' column to numeric values (ham = 0, spam = 1)
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

In [11]:
# Check for missing values
df.isnull().sum()

label      0
message    0
dtype: int64

## Text Preprocessing

In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string

# Download stopwords if not already present
nltk.download('stopwords')
nltk.download('punkt')

# Initialize Stemmer & Stopwords
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Function to preprocess text
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize words
    words = word_tokenize(text)
    # Remove stopwords and apply stemming
    words = [stemmer.stem(word) for word in words if word not in stop_words]
    # Join words back to string
    return " ".join(words)

# Apply preprocessing to the 'message' column
df['message'] = df['message'].apply(preprocess_text)

# Display sample processed messages
df.head()


[nltk_data] Downloading package stopwords to C:\Users\Krushna
[nltk_data]     Jadhav\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Krushna
[nltk_data]     Jadhav\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


Unnamed: 0,label,message
0,0,go jurong point crazi avail bugi n great world...
1,0,ok lar joke wif u oni
2,1,free entri 2 wkli comp win fa cup final tkt 21...
3,0,u dun say earli hor u c alreadi say
4,0,nah dont think goe usf live around though


## Feature Extraction (TF-IDF)

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit to top 5000 words

# Transform text messages into TF-IDF features
X = tfidf_vectorizer.fit_transform(df['message'])

# Labels (target variable)
y = df['label']

# Check shape of the transformed data
X.shape, y.shape

((5572, 5000), (5572,))

## Split Data into Training and Testing Sets

In [15]:
# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check shape of train and test sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4457, 5000), (1115, 5000), (4457,), (1115,))

## Train a Machine Learning Model

In [16]:
# Initialize and train Naïve Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Model training complete
model


MultinomialNB()

## Model Evaluation

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Display results
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.9668161434977578
Precision: 0.9912280701754386
Recall: 0.7583892617449665
F1 Score: 0.8593155893536121


**Results on the test set:**

Accuracy: 🚀 96.68% (Overall correctness)

Precision: 🎯 99.12% (Spam detection accuracy)

Recall: 🔥 75.83% (Spam messages correctly identified)

F1-score: ✅ 85.93% (Balanced measure of precision & recall)


## Sample Email Message To Test 

In [20]:
# Sample email message for testing
sample_email = "Congratulations! You've won a free lottery ticket. Claim your prize now."

# Preprocess the sample email using the same steps
sample_email_processed = preprocess_text(sample_email)

# Convert to TF-IDF features
sample_email_tfidf = tfidf_vectorizer.transform([sample_email_processed])

# Predict using the trained model
prediction = model.predict(sample_email_tfidf)[0]

if prediction == 1:
    result = "🚨 This email is SPAM! 🚨"
else:
    result = "✅ This email is NOT spam."

result

'🚨 This email is SPAM! 🚨'