# **Email Spam Classifier**

![image.png](attachment:image.png)

## By : Muhammad Ubaidullah

# **Abstract**

This project focuses on the development and evaluation of machine learning models for email classification into spam and non-spam categories. Using the provided email dataset, various classifiers such as Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, Decision Tree, and K-Nearest Neighbors are explored and assessed for their effectiveness in accurately categorizing emails. The project aims to identify the most suitable classifier for efficient spam detection, thus improving email filtering systems' performance and reducing the inconvenience caused by spam emails

# I**ntroduction**

Email classification, particularly spam detection, is a critical task in modern communication systems. With the exponential growth of email traffic, distinguishing between legitimate emails and spam is essential to ensure users' productivity and security. Machine learning techniques offer robust solutions for automating this process by training models to identify spam patterns and predict email categories accurately. This project leverages various machine learning algorithms to develop and evaluate email classifiers, aiming to enhance email filtering systems' performance and user experience

# **Roadmap**

- **Data Collection**: Obtain a labeled email dataset containing examples of both spam and non-spam emails.
- **Data Preprocessing**: Perform data cleaning, including text normalization, removal of stopwords, and vectorization of text data using techniques like TF-IDF.
- **Exploratory Data Analysis (EDA)**: Explore the distribution of spam and non-spam emails, analyze common words or phrases in each category, and identify potential features for classification.
- **Model Building**: Train multiple machine learning models, including Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, Decision Tree, and K-Nearest Neighbors, on the preprocessed email data.
- **Model Evaluation**: Assess the performance of each model using evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix analysis.
- **Hyperparameter Tuning**: Fine-tune the hyperparameters of selected models using techniques like grid search or randomized search to optimize performance.
- **Model Comparison**: Compare the performance of tuned models and select the most effective classifier based on evaluation metrics.
- **Testing and Deployment**: Test the selected classifier on new, unseen email data to evaluate its real-world performance. If satisfactory, deploy the classifier in email filtering systems to classify incoming emails accurately.

# **Collect Data Set**

## Importing Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import nltk
import seaborn as sns
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import AdaBoostClassifier
# from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier
from nltk.corpus import stopwords
import string
from sklearn.model_selection import GridSearchCV

## Read the dataset

In [None]:
df = pd.read_csv('spam.csv')

## Display a sample of the dataset

In [None]:
df.sample(5)

## Get the shape of the dataset

In [None]:
df.shape

# **Data Cleaning**

## Display information about the dataset

In [None]:
df.info()

## Check for missing values

In [None]:
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

## Check for duplicate rows

In [None]:
duplicate_rows = df.duplicated().sum()
print("\nDuplicate Rows:", duplicate_rows)

## Remove duplicate rows

In [None]:
df = df.drop_duplicates()

## Drop unnecessary columns

In [None]:
unnecessary_columns = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
df = df.drop(columns=unnecessary_columns, errors='ignore')

## Rename the columns for better understanding

In [None]:
df.rename(columns={'v1':'target','v2':'text'}, inplace=True)

## Display a sample after cleaning

In [None]:
df.sample(5)

In [None]:
encoder = LabelEncoder()
df['target'] = encoder.fit_transform(df['target'])
df.head()

## Check for duplicate values


In [None]:
df.duplicated().sum()

## Remove duplicates

In [None]:
df = df.drop_duplicates(keep='first')
df.duplicated().sum()
df.shape

In [None]:

df.isnull().sum()

# **Exploratory Data Analysis (EDA)**

## Display value counts of the target variable

In [None]:
df['target'].value_counts()

## Visualize the distribution of target variable

In [None]:
import matplotlib.pyplot as plt
plt.pie(df['target'].value_counts(), labels=['ham','spam'], autopct="%0.2f")
plt.show()

## Data is imbalanced

In [None]:
nltk.download('punkt')
df['num_characters'] = df['text'].apply(len)
df.head()

## Num of words

In [None]:
df['num_words'] = df['text'].apply(lambda x:len(nltk.word_tokenize(x)))
df.head()
df['num_sentences'] = df['text'].apply(lambda x:len(nltk.sent_tokenize(x)))
df.head()
df[['num_characters','num_words','num_sentences']].describe()

## Ham

In [None]:
df[df['target'] == 0][['num_characters','num_words','num_sentences']].describe()

## Spam

In [None]:
df[df['target'] == 1][['num_characters','num_words','num_sentences']].describe()

In [None]:
# Exclude non-numeric columns before generating correlation matrix
numeric_df = df.select_dtypes(include='number')

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# **Data Preprocessing**

In [None]:
nltk.download('stopwords')

## Function for Text Preprocessing

In [None]:
def transform_text(text):
    text = text.lower()
    text = nltk.word_tokenize(text)
    
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)
    
    text = y[:]
    y.clear()
    
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
            
    text = y[:]
    y.clear()
    
    ps = PorterStemmer()
    for i in text:
        y.append(ps.stem(i))
    
    return " ".join(y)


## Apply text transformation to the 'text' column

In [None]:
df['transformed_text'] = df['text'].apply(transform_text)

## Word Cloud for spam messages

In [None]:
spam_corpus = ' '.join(df[df['target'] == 1]['transformed_text'].tolist())
spam_wc = WordCloud(width=500, height=500, min_font_size=10, background_color='white').generate(spam_corpus)
plt.figure(figsize=(15, 6))
plt.imshow(spam_wc)
plt.axis('off')
plt.show()

## Word Cloud for ham messages

In [None]:
ham_corpus = ' '.join(df[df['target'] == 0]['transformed_text'].tolist())
ham_wc = WordCloud(width=500, height=500, min_font_size=10, background_color='white').generate(ham_corpus)
plt.figure(figsize=(15, 6))
plt.imshow(ham_wc)
plt.axis('off')
plt.show()

## Barplot for most common words in spam messages

In [None]:
word_counts = Counter(spam_corpus.split()).most_common(30)
words = [word_count[0] for word_count in word_counts]
counts = [word_count[1] for word_count in word_counts]

plt.figure(figsize=(10, 6))
sns.barplot(x=words, y=counts)
plt.xticks(rotation='vertical')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Most Common Words in Spam Messages')
plt.show()

## Barplot for most common words in ham messages

In [None]:
# Extract words and counts separately
word_counts = Counter(ham_corpus.split()).most_common(30)
words = [word_count[0] for word_count in word_counts]
counts = [word_count[1] for word_count in word_counts]

# Plot barplot for most common words in ham messages
plt.figure(figsize=(10, 6))
sns.barplot(x=words, y=counts)
plt.xticks(rotation='vertical')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Most Common Words in Ham Messages')
plt.show()

## Text Vectorization using Bag of Words

In [None]:
# Initialize CountVectorizer
count_vectorizer = CountVectorizer()

# Fit and transform the preprocessed text data
X_bow = count_vectorizer.fit_transform(df['transformed_text'])

# Display the shape of the resulting matrix
print("Shape of Bag of Words matrix:", X_bow.shape)

# **Model Building**

## Split the data into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], test_size=0.2, random_state=42)

## Initialize TF-IDF vectorizer

In [None]:
tfidf_vectorizer = TfidfVectorizer()

## Fit and transform the training data

In [None]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

## Transform the test data (using only transform, not fit_transform)

In [None]:
X_test_tfidf = tfidf_vectorizer.transform(X_test)

## **Model Building and Evaluation using Naive Bayes Multinomial**

## Initialize and fit MultinomialNB model

In [None]:
mnb = MultinomialNB()
mnb.fit(X_train_tfidf, y_train)

## Predict

In [None]:
y_pred_mnb = mnb.predict(X_test_tfidf)

## Evaluate

In [None]:
accuracy_mnb = accuracy_score(y_test, y_pred_mnb)
precision_mnb = precision_score(y_test, y_pred_mnb)
confusion_mat_mnb = confusion_matrix(y_test, y_pred_mnb)

print("\n--- Naive Bayes Multinomial Classifier ---")
print("Accuracy:", accuracy_mnb)
print("Precision:", precision_mnb)
print("Confusion Matrix:")
print(confusion_mat_mnb)

## **Model Building and Evaluation using Decision Tree (J48)**

## Initialize and fit Decision Tree Classifier

In [None]:
j48 = DecisionTreeClassifier()
j48.fit(X_train_tfidf, y_train)

## Predict

In [None]:
y_pred_j48 = j48.predict(X_test_tfidf)

## Evaluate

In [None]:
accuracy_j48 = accuracy_score(y_test, y_pred_j48)
precision_j48 = precision_score(y_test, y_pred_j48)
confusion_mat_j48 = confusion_matrix(y_test, y_pred_j48)

print("\n--- Decision Tree Classifier (J48) ---")
print("Accuracy:", accuracy_j48)
print("Precision:", precision_j48)
print("Confusion Matrix:")
print(confusion_mat_j48)

# **Model Building and Evaluation using Logistic Regression**

## Initialize Logistic Regression model

In [None]:
logistic_regression = LogisticRegression()

## Fit the model using TF-IDF vectorized training data

In [None]:
logistic_regression.fit(X_train_tfidf, y_train)

## Predict on the TF-IDF vectorized test data

In [None]:
y_pred_lr = logistic_regression.predict(X_test_tfidf)

## Evaluate the model

In [None]:
accuracy_lr = accuracy_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr)
confusion_mat_lr = confusion_matrix(y_test, y_pred_lr)

print("\n--- Logistic Regression Classifier ---")
print("Accuracy:", accuracy_lr)
print("Precision:", precision_lr)
print("Confusion Matrix:")
print(confusion_mat_lr)

## **Model Building and Evaluation using Support Vector Machine (SVM)**

## Initialize SVM model

In [None]:
svm = SVC()

## Fit the model using TF-IDF vectorized training data

In [None]:
svm.fit(X_train_tfidf, y_train)

## Predict on the TF-IDF vectorized test data

In [None]:
y_pred_svm = svm.predict(X_test_tfidf)

## Evaluate the model

In [None]:
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm)
confusion_mat_svm = confusion_matrix(y_test, y_pred_svm)

print("\n--- Support Vector Machine (SVM) Classifier ---")
print("Accuracy:", accuracy_svm)
print("Precision:", precision_svm)
print("Confusion Matrix:")
print(confusion_mat_svm)

## **Model Building and Evaluation using K-Nearest Neighbors (KNN)**

## Initialize KNN model

In [None]:
knn = KNeighborsClassifier()

## Fit the model using TF-IDF vectorized training data

In [None]:
knn.fit(X_train_tfidf, y_train)

## Predict on the TF-IDF vectorized test data

In [None]:
y_pred_knn = knn.predict(X_test_tfidf)

## Evaluate the model

In [None]:
accuracy_knn = accuracy_score(y_test, y_pred_knn)
precision_knn = precision_score(y_test, y_pred_knn)
confusion_mat_knn = confusion_matrix(y_test, y_pred_knn)

print("\n--- K-Nearest Neighbors (KNN) Classifier ---")
print("Accuracy:", accuracy_knn)
print("Precision:", precision_knn)
print("Confusion Matrix:")
print(confusion_mat_knn)

# **Model Comparison**

## List of models, accuracies, and precisions

In [None]:
models = ['MultinomialNB', 'DecisionTree', 'LogisticRegression', 'SVC', 'KNeighborsClassifier' ]
accuracies = [accuracy_mnb, accuracy_j48, accuracy_lr, accuracy_svm, accuracy_knn]
precisions = [precision_mnb, precision_j48, precision_lr, precision_svm, precision_knn]

## **Plot accuracies**

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=models, y=accuracies, hue=models, palette='viridis')
plt.title('Model Accuracies')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.legend(title='Model')
plt.xticks(rotation=45)
plt.show()

## **Plot precisions**

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=models, y=precisions, hue=models, palette='magma')
plt.title('Model Precisions')
plt.xlabel('Model')
plt.ylabel('Precision')
plt.legend(title='Model')
plt.xticks(rotation=45)
plt.show()

## **Plot confusion matrices**

## Multinomial Naive Bayes

In [None]:
plt.figure(figsize=(25, 5))
plt.subplot(1, 3, 1)
sns.heatmap(confusion_mat_mnb, annot=True, cmap='Blues', fmt='d', cbar=False)
plt.title('Confusion Matrix - Multinomial Naive Bayes')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

## Decision Tree Classifier

In [None]:
plt.figure(figsize=(25, 5))
plt.subplot(1, 3, 2)
sns.heatmap(confusion_mat_j48, annot=True, cmap='Greens', fmt='d', cbar=False)
plt.title('Confusion Matrix - Decision Tree Classifier')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

## Logistic Regression

In [None]:
plt.figure(figsize=(25, 5))
plt.subplot(1, 3, 3)
sns.heatmap(confusion_mat_lr, annot=True, cmap='Reds', fmt='d', cbar=False)
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.tight_layout()
plt.show()

## SVM

In [None]:
plt.figure(figsize=(25, 5))
plt.subplot(1, 3, 2)
sns.heatmap(confusion_mat_svm, annot=True, cmap='Blues', fmt='d', cbar=False, square=True)
plt.title('Confusion Matrix - SVM')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

## KNN

In [None]:
plt.figure(figsize=(25, 5))
plt.subplot(1, 3, 3)
sns.heatmap(confusion_mat_knn, annot=True, cmap='Blues', fmt='d', cbar=False, square=True)
plt.title('Confusion Matrix - KNN')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

# **Model Improvement and Ensemble Methods**

## Change max_features parameter of TfidfVectorizer

In [None]:
tfidf_vectorizer_max_ft_3000 = TfidfVectorizer(max_features=3000)
X_train_tfidf_max_ft_3000 = tfidf_vectorizer_max_ft_3000.fit_transform(X_train)
X_test_tfidf_max_ft_3000 = tfidf_vectorizer_max_ft_3000.transform(X_test)

## Voting Classifier

In [None]:
svc = SVC(kernel='sigmoid', gamma=1.0, probability=True)
etc = ExtraTreesClassifier(n_estimators=50, random_state=2)

voting_mnb = VotingClassifier(estimators=[('svm', svc), ('et', etc), ('mnb', mnb)], voting='soft')
voting_mnb.fit(X_train_tfidf_max_ft_3000, y_train)
y_pred_voting_mnb = voting_mnb.predict(X_test_tfidf_max_ft_3000)
accuracy_voting_mnb = accuracy_score(y_test, y_pred_voting_mnb)
precision_voting_mnb = precision_score(y_test, y_pred_voting_mnb)

print("Voting Classifier Accuracy (MNB):", accuracy_voting_mnb)
print("Voting Classifier Precision (MNB):", precision_voting_mnb)

## Stacking

In [None]:
clf_mnb = StackingClassifier(estimators=[('svm', svc), ('et', etc)], final_estimator=RandomForestClassifier())
clf_mnb.fit(X_train_tfidf_max_ft_3000, y_train)
y_pred_stacking_mnb = clf_mnb.predict(X_test_tfidf_max_ft_3000)
accuracy_stacking_mnb = accuracy_score(y_test, y_pred_stacking_mnb)
precision_stacking_mnb = precision_score(y_test, y_pred_stacking_mnb)

print("Stacking Classifier Accuracy (MNB):", accuracy_stacking_mnb)
print("Stacking Classifier Precision (MNB):", precision_stacking_mnb)

# **Hyperparameter Tuning**

## Multinomial Naive Bayes classifier

In [None]:
# Define the range of alpha values to search
alpha_values = [0.1, 0.5, 1.0, 1.5, 2.0]

# Create a parameter grid
param_grid = {'alpha': alpha_values}

# Initialize Multinomial Naive Bayes classifier
mnb = MultinomialNB()

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=mnb, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV to the training data
grid_search.fit(X_train_tfidf, y_train)

# Get the best hyperparameters
best_alpha = grid_search.best_params_['alpha']
print("Best alpha:", best_alpha)

In [None]:
# Retrain the Multinomial Naive Bayes classifier with the best alpha value
mnb_best = MultinomialNB(alpha=best_alpha)
mnb_best.fit(X_train_tfidf, y_train)

# Predict using the tuned classifier
y_pred_mnb_tuned = mnb_best.predict(X_test_tfidf)

# Evaluate the tuned classifier
accuracy_mnb_tuned = accuracy_score(y_test, y_pred_mnb_tuned)
precision_mnb_tuned = precision_score(y_test, y_pred_mnb_tuned)
confusion_mat_mnb_tuned = confusion_matrix(y_test, y_pred_mnb_tuned)

print("Tuned Multinomial Naive Bayes Classifier:")
print("Accuracy:", accuracy_mnb_tuned)
print("Precision:", precision_mnb_tuned)
print("Confusion Matrix:")
print(confusion_mat_mnb_tuned)

## Decision Tree , Logistic Regression , SVM , KNN

In [None]:
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data (using only transform, not fit_transform)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Define hyperparameters for each model
param_grid_decision_tree = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_logistic_regression = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l2']
}

param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Fine-tune Decision Tree
grid_search_decision_tree = GridSearchCV(DecisionTreeClassifier(), param_grid_decision_tree, cv=5)
grid_search_decision_tree.fit(X_train_tfidf, y_train)
best_decision_tree = grid_search_decision_tree.best_estimator_

# Fine-tune Logistic Regression
grid_search_logistic_regression = GridSearchCV(LogisticRegression(), param_grid_logistic_regression, cv=5)
grid_search_logistic_regression.fit(X_train_tfidf, y_train)
best_logistic_regression = grid_search_logistic_regression.best_estimator_

# Fine-tune SVM
grid_search_svm = GridSearchCV(SVC(), param_grid_svm, cv=5)
grid_search_svm.fit(X_train_tfidf, y_train)
best_svm = grid_search_svm.best_estimator_

# Fine-tune KNN
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5)
grid_search_knn.fit(X_train_tfidf, y_train)
best_knn = grid_search_knn.best_estimator_

# Test the tuned models on the test data
models = {
    'Decision Tree': best_decision_tree,
    'Logistic Regression': best_logistic_regression,
    'SVM': best_svm,
    'KNN': best_knn
}

for name, model in models.items():
    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    
    print("\n---", name, "---")
    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Confusion Matrix:")
    print(confusion_mat)

## Tunning Decision Tree Again for Accuracy

In [None]:
# Initialize Decision Tree classifier with adjusted hyperparameters
dt_classifier = DecisionTreeClassifier(max_depth=10, min_samples_split=5, min_samples_leaf=2)

# Train the Decision Tree classifier on the training data
dt_classifier.fit(X_train_tfidf, y_train)

# Predict using the trained Decision Tree classifier
y_pred_dt = dt_classifier.predict(X_test_tfidf)

# Evaluate the Decision Tree classifier
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
confusion_mat_dt = confusion_matrix(y_test, y_pred_dt)

# Print evaluation metrics
print("Decision Tree Classifier (Tuned):")
print("Accuracy:", accuracy_dt)
print("Precision:", precision_dt)
print("Confusion Matrix:")
print(confusion_mat_dt)

# **Testing**

In [None]:
# Transform the message using the TF-IDF vectorizer
message_tfidf = tfidf_vectorizer.transform(["FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"])

# Initialize models
dt_classifier = DecisionTreeClassifier(**best_decision_tree.get_params())
mnb_classifier = MultinomialNB(**grid_search.best_params_)
lr_classifier = LogisticRegression(**best_logistic_regression.get_params())
svm_classifier = SVC(**best_svm.get_params())
knn_classifier = KNeighborsClassifier(**best_knn.get_params())

# Load pre-trained models
dt_classifier.fit(X_train_tfidf, y_train)
mnb_classifier.fit(X_train_tfidf, y_train)
lr_classifier.fit(X_train_tfidf, y_train)
svm_classifier.fit(X_train_tfidf, y_train)
knn_classifier.fit(X_train_tfidf, y_train)

# Predict using all models
y_pred_dt = dt_classifier.predict(message_tfidf)
y_pred_mnb = mnb_classifier.predict(message_tfidf)
y_pred_lr = lr_classifier.predict(message_tfidf)
y_pred_svm = svm_classifier.predict(message_tfidf)
y_pred_knn = knn_classifier.predict(message_tfidf)

# Print predictions
print("Decision Tree:", "Spam" if y_pred_dt[0] == 1 else "Not Spam")
print("Multinomial Naive Bayes:", "Spam" if y_pred_mnb[0] == 1 else "Not Spam")
print("Logistic Regression:", "Spam" if y_pred_lr[0] == 1 else "Not Spam")
print("Support Vector Machine:", "Spam" if y_pred_svm[0] == 1 else "Not Spam")
print("K-Nearest Neighbors:", "Spam" if y_pred_knn[0] == 1 else "Not Spam")


In [None]:
# Transform the message using the TF-IDF vectorizer
message_tfidf = tfidf_vectorizer.transform(["Thanks for your subscription to Ringtone UK your mobile will be charged ï¿½5/month Please confirm by replying YES or NO. If you reply NO you will not be charged"])

# Initialize models
dt_classifier = DecisionTreeClassifier(**best_decision_tree.get_params())
mnb_classifier = MultinomialNB(**grid_search.best_params_)
lr_classifier = LogisticRegression(**best_logistic_regression.get_params())
svm_classifier = SVC(**best_svm.get_params())
knn_classifier = KNeighborsClassifier(**best_knn.get_params())

# Load pre-trained models
dt_classifier.fit(X_train_tfidf, y_train)
mnb_classifier.fit(X_train_tfidf, y_train)
lr_classifier.fit(X_train_tfidf, y_train)
svm_classifier.fit(X_train_tfidf, y_train)
knn_classifier.fit(X_train_tfidf, y_train)

# Predict using all models
y_pred_dt = dt_classifier.predict(message_tfidf)
y_pred_mnb = mnb_classifier.predict(message_tfidf)
y_pred_lr = lr_classifier.predict(message_tfidf)
y_pred_svm = svm_classifier.predict(message_tfidf)
y_pred_knn = knn_classifier.predict(message_tfidf)

# Print predictions
print("Decision Tree:", "Spam" if y_pred_dt[0] == 1 else "Not Spam")
print("Multinomial Naive Bayes:", "Spam" if y_pred_mnb[0] == 1 else "Not Spam")
print("Logistic Regression:", "Spam" if y_pred_lr[0] == 1 else "Not Spam")
print("Support Vector Machine:", "Spam" if y_pred_svm[0] == 1 else "Not Spam")
print("K-Nearest Neighbors:", "Spam" if y_pred_knn[0] == 1 else "Not Spam")


In [None]:
# Transform the message using the TF-IDF vectorizer
message_tfidf = tfidf_vectorizer.transform(["I HAVE A DATE ON SUNDAY WITH WILL!!"])

# Initialize models
dt_classifier = DecisionTreeClassifier(**best_decision_tree.get_params())
mnb_classifier = MultinomialNB(**grid_search.best_params_)
lr_classifier = LogisticRegression(**best_logistic_regression.get_params())
svm_classifier = SVC(**best_svm.get_params())
knn_classifier = KNeighborsClassifier(**best_knn.get_params())

# Load pre-trained models
dt_classifier.fit(X_train_tfidf, y_train)
mnb_classifier.fit(X_train_tfidf, y_train)
lr_classifier.fit(X_train_tfidf, y_train)
svm_classifier.fit(X_train_tfidf, y_train)
knn_classifier.fit(X_train_tfidf, y_train)

# Predict using all models
y_pred_dt = dt_classifier.predict(message_tfidf)
y_pred_mnb = mnb_classifier.predict(message_tfidf)
y_pred_lr = lr_classifier.predict(message_tfidf)
y_pred_svm = svm_classifier.predict(message_tfidf)
y_pred_knn = knn_classifier.predict(message_tfidf)

# Print predictions
print("Decision Tree:", "Spam" if y_pred_dt[0] == 1 else "Not Spam")
print("Multinomial Naive Bayes:", "Spam" if y_pred_mnb[0] == 1 else "Not Spam")
print("Logistic Regression:", "Spam" if y_pred_lr[0] == 1 else "Not Spam")
print("Support Vector Machine:", "Spam" if y_pred_svm[0] == 1 else "Not Spam")
print("K-Nearest Neighbors:", "Spam" if y_pred_knn[0] == 1 else "Not Spam")


# **Literature Review**

Previous research in email classification has demonstrated the efficacy of machine learning techniques in accurately categorizing emails into spam and non-spam classes. Several studies have investigated various algorithms, feature engineering methods, and datasets to improve classification accuracy and robustness. Challenges in email classification include handling imbalanced datasets, selecting appropriate features, and interpreting model predictions for practical use. This project builds upon existing literature to develop an integrated framework for email classification, encompassing data preprocessing, feature selection, model training, and evaluation to advance the field of email filtering and spam detection

# **Conclusion**

- In conclusion, this project highlights the potential of machine learning algorithms in email classification tasks. Through rigorous data preprocessing, feature engineering, model training, and evaluation, we have explored diverse approaches to accurately distinguish between spam and non-spam emails. Our study underscores the significance of efficient email filtering systems in enhancing user experience and security in email communication.

- Evaluation of multiple machine learning models, including Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, Decision Tree, and K-Nearest Neighbors, has revealed varying performance metrics such as accuracy, precision, recall, and F1-score. While each model exhibited promising results, further optimization and fine-tuning are essential to enhance their effectiveness and generalization across different email datasets and real-world scenarios.

- Furthermore, visualizations such as confusion matrices, ROC curves, and decision boundaries have provided valuable insights into model behavior and decision-making processes. These visual aids facilitate better understanding and interpretation of model predictions, enabling email service providers to implement effective spam filtering mechanisms.

Moving forward, future research directions may include:

- Exploration of ensemble learning techniques and deep learning architectures for improved email classification performance.
- Integration of domain-specific knowledge and additional email features to enhance model interpretability and accuracy.
- Conducting large-scale studies and real-world evaluations to validate the effectiveness and scalability of developed models.
- Investigation of novel approaches incorporating natural language processing, user behavior analysis, and network traffic analysis for more comprehensive email classification systems.
  
Overall, this project contributes to the ongoing efforts in leveraging machine learning for email classification and spam detection. By harnessing the power of data-driven approaches, we aim to advance email filtering systems, mitigate the impact of spam emails, and enhance user productivity and security in digital communication channels.