# MAIL SPAM DETECTION USING TENSORFLOW

In [None]:
#reading the dataset
import pandas as pd
dataset=pd.read_csv('spammailsTF.csv')

In [3]:
import tensorflow as tf

In [4]:
#printing the first and last 5 rows of the dataset
dataset

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [5]:
dataset.info()
dataset.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

In [6]:
df=dataset

In [7]:
df = df.drop(columns=['Unnamed: 0']) #drops the 'Unnamed:0' column

In [8]:
df['text'].fillna('Missing', inplace=True)
df['label'].fillna('Missing', inplace=True)
df['label_num'].fillna(-1, inplace=True)  # Filling with -1 to indicate missing value


In [9]:
df

Unnamed: 0,label,text,label_num
0,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,spam,"Subject: photoshop , windows , office . cheap ...",1
4,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...
5166,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,ham,Subject: industrial worksheets for august 2000...,0


Preprocess the data - tokenize and pad sequences 

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [11]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
padded_sequences = pad_sequences(sequences, padding='post')

In [12]:
from sklearn.model_selection import train_test_split
#Split data into training and testing sets 
x_train, x_test, y_train, y_test = train_test_split(padded_sequences, df['label_num'], test_size=0.2)

In [13]:
print(x_train.shape,x_test.shape)

(4136, 5916) (1035, 5916)


In [14]:
#model fitting
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=16),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])




In [15]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])




In [16]:
model.fit(x_train, y_train, epochs=20)

Epoch 1/20


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x1b6697b7590>

In [17]:
#Evaluate the model on test data 
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Loss: {loss}, Accuracy: {accuracy}")

Loss: 0.18257546424865723, Accuracy: 0.9681159257888794


In [18]:
from sklearn.metrics import confusion_matrix,classification_report
y_true = y_test # the true labels from the test set
y_pred = model.predict(x_test) # the predicted probabilities from the model
threshold = 0.5 # the threshold value to classify as spam or ham
y_pred = (y_pred > threshold).astype(int) # the binary labels from the probabilities





In [19]:
cm = confusion_matrix(y_true, y_pred) # the confusion matrix
print('confusion matrix: \n',cm)

confusion matrix: 
 [[696  30]
 [  3 306]]


* True Positives (696): Spam emails correctly identified as spam.
* True Negatives (306): Ham emails correctly identified as ham.
* False Positives (30): Ham emails incorrectly classified as spam.
* False Negatives (3): Spam emails incorrectly classified as ham.

In [20]:
cr = classification_report(y_true, y_pred) # the classification report
print(f'classification report: \n{cr}')

classification report: 
              precision    recall  f1-score   support

           0       1.00      0.96      0.98       726
           1       0.91      0.99      0.95       309

    accuracy                           0.97      1035
   macro avg       0.95      0.97      0.96      1035
weighted avg       0.97      0.97      0.97      1035



here, 0:ham and 1:spam

* The overall accuracy of the model is 97%, indicating that it correctly classifies emails as spam or ham 97% of the time.
* The precision and recall for both spam and ham emails are high, suggesting that the model is effective in identifying both types of emails accurately.
*  The precision for ham emails is 1.00, meaning that all emails predicted as ham were actually ham. Precision for spam mails is  0.91 means that out of every 10 emails the model predicts as spam, 9 are actually spam, and 1 is mistakenly labeled ham (false positive).
* F1 Score: 0.95 - This signifies a good balance between precision and recall for identifying spam emails. The model excels at    catching most spam without labeling too many ham emails incorrectly.

In [21]:
ip=''' 
The Dropbox logo
Please sign in!
Hi Pp,

We noticed you're not taking advantage of your Dropbox account. We're presenting new ways to utilize your Dropbox.
Continue here
Here are some ways to use Dropbox:
Back up your files—like photos and important docs—to keep them stored safely.
Download Dropbox on your devices to access files from wherever you are.
Send larger files and folders to clients and friends with
Dropbox Transfer—even if they don’t use Dropbox.
Out of storage space? No problem. We’ll give you an additional 250 MB* for free once you complete a few steps.

* This page can only be viewed from a computer, not a mobile phone.'''

In [22]:
#Preprocess the input email text
input_text = ip
input_sequence = tokenizer.texts_to_sequences([input_text])
input_padded = pad_sequences(input_sequence, maxlen=5916, padding='post')

#Predict the probability of spam
probability = model.predict(input_padded)[0][0]
print(f"Probability of spam: {probability}")

#Classify the email as spam or ham
threshold = 0.5
if probability > threshold:
    print("The email is spam.")
else:
    print("The email is ham.")


Probability of spam: 0.7673009634017944
The email is spam.
