<a href="https://colab.research.google.com/github/kanaka-22/-Email-Spam-Filtering/blob/main/Email_Spam_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Email messaging has become an integral part of our daily lives. You've probably noticed some strange emails landing in your junk or spam folder. Many email clients, such as Gmail, use machine learning alongside other techniques to detect and filter out spam emails. Once identified, these emails are redirected to the Junk folder.

In this guide, we'll learn how to build a spam email detection system using machine learning. We'll train our model on a dataset containing examples of spam content. Leveraging the powerful SciKit Learn library, we'll apply machine learning algorithms to categorize incoming emails as either spam or genuine.

**The Dataset**

Data is a very crucial part of any Machine Learning or Artificial Intelligence model. We would be working with the SMS Spam Collection Dataset available on Kaggle https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

The SMS Spam Collection comprises a curated dataset of SMS messages, specifically gathered for research on SMS spam detection. Within this collection, there exists a corpus of 5,574 SMS messages in the English language. Each message in this dataset has been meticulously labeled as either ‘ham,’ signifying its legitimacy, or ‘spam,’ indicating its unsolicited or malicious nature.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
import joblib
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv("/content/drive/MyDrive/spam.csv", encoding="latin-1")
df = df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df.rename(columns={"v1": "label", "v2": "text"})

# Preprocess text data
df["text"] = df["text"].str.lower()
df["text"] = df["text"].apply(word_tokenize)
stop_words = set(stopwords.words("english"))
df["text"] = df["text"].apply(lambda x: [word for word in x if word not in stop_words])
df["text"] = df["text"].apply(lambda x: " ".join(x))


In [5]:
#Feature Extraction
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df["text"])

#Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, df["label"], test_size=0.2, random_state=42)

#Train a classification model, such as Multinomial Naive Bayes
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

#Evaluate the model's performance using metrics like accuracy and classification report
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(report)



Accuracy: 0.9748878923766816
Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       965
        spam       0.90      0.92      0.91       150

    accuracy                           0.97      1115
   macro avg       0.94      0.95      0.95      1115
weighted avg       0.98      0.97      0.98      1115



In [6]:
joblib.dump(classifier, 'email_spam_model.pkl')

['email_spam_model.pkl']

In [7]:
loaded_model = joblib.load('email_spam_model.pkl')
new_email = ["Congratulations! You've won a prize. Claim it now."]
new_email = vectorizer.transform(new_email)  # Assuming you have the vectorizer from the previous code
prediction = loaded_model.predict(new_email)

if prediction[0] == "spam":
    print("This email is spam.")
else:
    print("This email is not spam.")


This email is spam.


In conclusion, email spam detection acts as a vital defense against the ongoing threat of unwanted and potentially harmful messages invading our inboxes. By leveraging advanced algorithms and machine learning techniques, we can effectively filter through the flood of emails, ensuring that only legitimate and meaningful communications reach their intended recipients.