# Spam Email Detection using Machine Learning

This project demonstrates how to build a spam detection classifier using Naive Bayes algorithm. The dataset contains SMS messages labeled as spam or ham (non-spam).

**Dataset Source:** [Kaggle - Spam Emails Dataset](https://www.kaggle.com/datasets/abdallahwagih/spam-emails)

## 1. Import Required Libraries

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Machine learning utilities
from sklearn.model_selection import train_test_split  # Split data into train and test sets
from sklearn.naive_bayes import MultinomialNB  # Naive Bayes classifier for text classification
from sklearn.feature_extraction.text import CountVectorizer  # Convert text to numerical features

# Evaluation metrics
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

## 2. Load and Inspect the Dataset

In [2]:
# Load the dataset from CSV file
df = pd.read_csv('spam.csv')

# Display the first 5 rows to understand the data structure
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# Check the distribution of spam vs ham messages
# This shows class imbalance in the dataset (more ham than spam)
df.groupby('Category').count()

Unnamed: 0_level_0,Message
Category,Unnamed: 1_level_1
ham,4825
spam,747


In [4]:
# Check the total number of messages and features
# Output: (5572 messages, 2 columns)
df.shape

(5572, 2)

In [5]:
# View examples of spam messages
# Notice keywords like "Free", "Win", "Prize", "Call now" which are common in spam
spam = df[df['Category']=='spam'].head()
spam['Message'].values

array(["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
       "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",
       'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',
       'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
       'SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info'],
      dtype=object)

In [6]:
# Create a binary target variable for modeling
# Convert 'spam' to 1 and 'ham' to 0 for numerical classification
df['spam'] = df['Category'].apply(lambda x: 1 if x == 'spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


## 3. Split Data into Training and Testing Sets

We split the data to evaluate model performance on unseen data. This prevents overfitting and gives us a realistic estimate of model accuracy.

In [7]:
# Split data: 75% training, 25% testing (default split)
# X = features (messages), y = target (spam/ham labels)
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, random_state=42)

In [8]:
# View the training messages
X_train

4281    WINNER!! As a valued network customer you have...
585     So how's scotland. Hope you are not over showi...
4545                  when you and derek done with class?
3034                          Aight, lemme know what's up
2758                Yo we are watching a movie on netflix
                              ...                        
3772    Hi, wlcome back, did wonder if you got eaten b...
5191                               Sorry, I'll call later
5226        Prabha..i'm soryda..realy..frm heart i'm sory
5390                           Nt joking seriously i told
860               Did he just say somebody is named tampa
Name: Message, Length: 4179, dtype: object

In [9]:
# View the training labels (0 = ham, 1 = spam)
y_train

4281    1
585     0
4545    0
3034    0
2758    0
       ..
3772    0
5191    0
5226    0
5390    0
860     0
Name: spam, Length: 4179, dtype: int64

## 4. Feature Engineering: Convert Text to Numerical Format

Machine learning algorithms require numerical input. We use CountVectorizer to convert text messages into a matrix of token counts.

In [10]:
# Initialize CountVectorizer
# This will create a vocabulary of all unique words in the training data
cv = CountVectorizer()

In [11]:
# Fit the vectorizer on training data and transform messages to count matrix
# Each row represents a message, each column represents a word, values are word counts
X_train_count = cv.fit_transform(X_train.values)

In [12]:
# View the sparse matrix as a dense array
# Each number represents how many times a word appears in a message
X_train_count.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## 5. Train the Naive Bayes Classifier

Multinomial Naive Bayes is effective for text classification. It calculates the probability that a message is spam based on word frequencies.

In [13]:
# Initialize the Multinomial Naive Bayes model
model = MultinomialNB()

In [14]:
# Train the model on the training data
# The model learns which words are associated with spam vs ham
model.fit(X_train_count, y_train)

## 6. Test the Model with Custom Examples

Before evaluating on test data, let's verify the model works with custom messages.

In [15]:
# Test with a spam-like message
# Expected output: 1 (spam)
spam_email = ["Win $1,000,000 today in our Free Casino! No registration needed. Click this link."]
spam_email_count = cv.transform(spam_email)  # Transform using the same vectorizer
model.predict(spam_email_count)

array([1])

In [16]:
# Test with a legitimate message
# Expected output: 0 (ham)
ham_email = ["Hello, please let me know when we can schedule a meeting?"]
ham_email_count = cv.transform(ham_email)
model.predict(ham_email_count)

array([0])

## 7. Evaluate Model Performance on Test Data

We use multiple metrics to assess model quality:
- **Accuracy**: Overall correctness
- **Precision**: Of messages classified as spam, how many are actually spam?
- **Recall**: Of all actual spam messages, how many did we catch?
- **F1 Score**: Harmonic mean of precision and recall

In [17]:
# Transform test data using the same vectorizer (do NOT fit again)
X_test_count = cv.transform(X_test)

# Generate predictions on the test set
predictions = model.predict(X_test_count)

In [18]:
# Calculate and display performance metrics
precision_score_val = round(precision_score(y_test, predictions), 2)
recall_score_val = round(recall_score(y_test, predictions), 2)
accuracy_score_val = round(accuracy_score(y_test, predictions), 2)
f1_score_val = round(f1_score(y_test, predictions), 2)

print('Model Performance Metrics:')
print('=' * 30)
print(f'Accuracy:  {accuracy_score_val} - Correctly classified messages')
print(f'Precision: {precision_score_val} - Spam predictions that are correct')
print(f'Recall:    {recall_score_val} - Actual spam messages detected')
print(f'F1 Score:  {f1_score_val} - Overall performance balance')

Model Performance Metrics:
Accuracy:  0.99 - Correctly classified messages
Precision: 0.98 - Spam predictions that are correct
Recall:    0.94 - Actual spam messages detected
F1 Score:  0.96 - Overall performance balance
