<a href="https://colab.research.google.com/github/kamranakhter/Spam-Email-Detection/blob/main/Machine_Learning_Based_Spam_Detection_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 1: Install Required Libraries

* Make sure you have the following libraries installed:

In [None]:
# !pip install scikit-learn pandas

## Step 2: Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
pd.set_option('display.max_colwidth', None)

In [5]:
# Mounting google drive to load the dataset

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Step 3: Prepare the Dataset

* For this example, we'll use a sample dataset. You can replace this with a real dataset like the Spam SMS Collection Dataset (available on Kaggle).

In [8]:
df = pd.read_csv('/content/drive/MyDrive/My Projects/Spam Email Detection/spam_ham_dataset.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,"Subject: enron methanol ; meter # : 988291\r\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\r\nflow data provided by daren } .\r\nplease override pop ' s daily volume { presently zero } to reflect daily\r\nactivity you can obtain from gas control .\r\nthis change is needed asap for economics purposes .",0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see attached file : hplnol 09 . xls )\r\n- hplnol 09 . xls",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time !\r\ni know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute .\r\non the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .\r\ni think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up with a potential alternative for how we can get together on that weekend , and then you can let me know which you prefer .\r\nthe first option would be to have a retreat similar to what we ' ve done the past several years . this year we could go to the heartland country inn ( www . . com ) outside of brenham . it ' s a nice place , where we ' d have a 13 - bedroom and a 5 - bedroom house side by side . it ' s in the country , real relaxing , but also close to brenham and only about one hour and 15 minutes from here . we can golf , shop in the antique and craft stores in brenham , eat dinner together at the ranch , and spend time with each other . we ' d meet on saturday , and then return on sunday morning , just like what we ' ve done in the past .\r\nthe second option would be to stay here in houston , have dinner together at a nice restaurant , and then have dessert and a time for visiting and recharging at one of our homes on that saturday evening . this might be easier , but the trade off would be that we wouldn ' t have as much time together . i ' ll let you decide .\r\nemail me back with what would be your preference , and of course if you ' re available on that weekend . the democratic process will prevail - - majority vote will rule ! let me hear from you as soon as possible , preferably by the end of the weekend . and if the vote doesn ' t go your way , no complaining allowed ( like i tend to do ! )\r\nhave a great weekend , great golf , great fishing , great shopping , or whatever makes you happy !\r\nbobby",0
3,4685,spam,"Subject: photoshop , windows , office . cheap . main trending\r\nabasements darer prudently fortuitous undergone\r\nlighthearted charm orinoco taster\r\nrailroad affluent pornographic cuvier\r\nirvin parkhouse blameworthy chlorophyll\r\nrobed diagrammatic fogarty clears bayda\r\ninconveniencing managing represented smartness hashish\r\nacademies shareholders unload badness\r\ndanielson pure caffein\r\nspaniard chargeable levin\r\n",1
4,2030,ham,"Subject: re : indian springs\r\nthis deal is to book the teco pvr revenue . it is my understanding that teco\r\njust sends us a check , i haven ' t received an answer as to whether there is a\r\npredermined price associated with this deal or if teco just lets us know what\r\nwe are giving . i can continue to chase this deal down if you need .",0


In [9]:
## Remove ir-relevant column

df.drop(columns=['Unnamed: 0', 'label'], inplace=True)

## Step 4: Preprocess the Data

We need to convert the text data into numerical features using CountVectorizer.

In [10]:
# Split the data into features (X) and labels (y)
X = df['text']
y = df['label_num']

# Convert text into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

## Step 5: Train Various Model

* We'll use the Multinomial Naive Bayes algorithm, which is suitable for text classification.

In [11]:
# Initialize the Naive Bayes classifier
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)

## Step 6: Evaluate the Model

* Let's test the model on the test set and evaluate its performance.

In [12]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.978743961352657

Confusion Matrix:
 [[731  11]
 [ 11 282]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       742
           1       0.96      0.96      0.96       293

    accuracy                           0.98      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.98      0.98      1035



## Step 7: Test the Model on New Emails

* You can now use the trained model to classify new emails as spam or ham.

In [13]:
# New emails to classify
new_emails = [
    ''' Subject: Exclusive Offer! Get 90% Off Now!
        Body: Hurry! Limited-time deal. Click here to claim your 90% discount today: [malicious-link.com] ''',

    ''' Subject: Your Account is Suspended! Urgent Action Required
        Body: Dear User, your bank account has been locked due to suspicious activity. Verify your details immediately: [fake-bank.com]''',

    ''' Subject: Project Update - Please Review
        Body: Hi Team, the updated report is attached. Please review and share your feedback. Thanks! ''',

    '''Subject: Dinner Plan for Tonight?
        Body: Hey bro, are we still on for dinner at 8 PM? Let me know.
    '''
]

# Vectorize the new emails
new_emails_vectorized = vectorizer.transform(new_emails)

# Predict the labels
predictions = model.predict(new_emails_vectorized)

# Display the results
for email, prediction in zip(new_emails, predictions):
    print(f"Email: {email}\nPrediction: {prediction}\n")

Email:  Subject: Exclusive Offer! Get 90% Off Now!
        Body: Hurry! Limited-time deal. Click here to claim your 90% discount today: [malicious-link.com] 
Prediction: 1

Email:  Subject: Your Account is Suspended! Urgent Action Required
        Body: Dear User, your bank account has been locked due to suspicious activity. Verify your details immediately: [fake-bank.com]
Prediction: 1

Email:  Subject: Project Update - Please Review
        Body: Hi Team, the updated report is attached. Please review and share your feedback. Thanks! 
Prediction: 0

Email: Subject: Dinner Plan for Tonight?
        Body: Hey bro, are we still on for dinner at 8 PM? Let me know.
    
Prediction: 0



## Save the model

In [14]:
import joblib

# Save the model and scaler
joblib.dump(model, 'model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']