# SPAM DETECTOR

This spam detector leverages machine learning to automatically classify emails as spam or legitimate based on their content, providing a practical tool for filtering unwanted or potentially harmful messages.

### Importing Libraries


In [47]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.utils import resample
import re

This section imports necessary libraries for data manipulation, text vectorization, model training, and evaluation.

### Data Loading

In [48]:
data = pd.read_csv("emails_dataset.csv")

Here, you load your email dataset into a Pandas DataFrame named `data` using the `pd.read_csv()` function.

### Data Cleaning and Preprocessing

In [49]:
data['Message'] = data['Message'].str.lower()
data['Message'] = data['Message'].apply(lambda x: re.sub(r"[^a-zA-Z0-9']", ' ', x))

This section cleans and preprocesses the text data in the 'Message' column. It converts the text to lowercase and removes non-alphanumeric characters, retaining only letters, numbers, and single quotes.

### Feature Engineering

In [50]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['Message'])

Feature engineering involves converting the text data into a numerical format using TF-IDF vectorization. The `TfidfVectorizer` is used to transform the text messages into a TF-IDF matrix (`X`).

### Oversampling for Handling Imbalanced Classes

In [51]:
spam = data[data['Classification'] == 'spam']
ham = data[data['Classification'] == 'ham']
spam_upsampled = resample(spam, replace=True, n_samples=len(ham), random_state=42)
data_resampled = pd.concat([ham, spam_upsampled])

To handle imbalanced classes, this section oversamples the minority class ('spam') to match the number of samples in the majority class ('ham'). The upsampled 'spam' data is then combined with the original 'ham' data.

### Train-Test Split

In [52]:
X_train, X_test, y_train, y_test = train_test_split(
    vectorizer.transform(data_resampled['Message']),
    data_resampled['Classification'],
    test_size=0.20,
    random_state=42
)

The dataset is split into training and testing sets using the `train_test_split` function. The TF-IDF-transformed messages (`X`) are split along with the corresponding classifications ('spam' or 'ham').

### Model Training

In [53]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

A Multinomial Naive Bayes classifier (`MultinomialNB`) is trained on the training data (`X_train` and `y_train`) using the `fit` method.

### Model Evaluation

In [54]:
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9844527363184079
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       797
        spam       0.98      0.99      0.98       811

    accuracy                           0.98      1608
   macro avg       0.98      0.98      0.98      1608
weighted avg       0.98      0.98      0.98      1608



The trained model is evaluated on the test set, and accuracy along with a detailed classification report is printed.

### Model Verifying/Testing (Manual)

In [55]:
msg = vectorizer.transform(["Dear Lucky Beneciary,You have been selected to receive the sum of “€1,000,000.00” as charity dona-tions/aid from the Qatar Foundation, on the 20th of June 2016. Contact  Mr. Rashid Al-Naimi through e-mail for more information: rashidalnai@gmail.com"])
prediction = clf.predict(msg)
print("The email is :", prediction[0])

The email is : spam


Finally, a manual test is performed on a sample email message. The message is transformed using the TF-IDF vectorizer, and the trained classifier predicts whether it is 'spam' or 'ham'. The result is printed.