<a href="https://colab.research.google.com/github/kiojoel/Spam-Mail-Prediction-Logistic-Regression/blob/main/Spam_Mail_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Mail Prediction


The primary goal is to develop a logistic regression model to accurately classify emails as spam or not spam (ham)

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Data Collection and Pre-Processing

In [2]:
# loading data into a pandas DataFrame
mail_data = pd.read_csv('/content/spam_ham_dataset.csv')

In [3]:
mail_data.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [4]:
mail_data.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

In [5]:
mail_data.isnull().sum()

Unnamed: 0    0
label         0
text          0
label_num     0
dtype: int64

In [6]:
# Drop column 'Unnamed: 0'
mail_data = mail_data.drop('Unnamed: 0', axis=1)
mail_data = mail_data.drop('label', axis=1)

# Rename column 'label_num' to 'label'
mail_data = mail_data.rename(columns={'label_num': 'label'})

In [7]:
mail_data.head()

Unnamed: 0,text,label
0,Subject: enron methanol ; meter # : 988291\r\n...,0
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,"Subject: photoshop , windows , office . cheap ...",1
4,Subject: re : indian springs\r\nthis deal is t...,0


In [8]:
# size of the dataset
mail_data.shape

(5171, 2)

Spliting data into text and label

In [9]:
X = mail_data['text']
Y = mail_data['label']

In [10]:
print(X)

0       Subject: enron methanol ; meter # : 988291\r\n...
1       Subject: hpl nom for january 9 , 2001\r\n( see...
2       Subject: neon retreat\r\nho ho ho , we ' re ar...
3       Subject: photoshop , windows , office . cheap ...
4       Subject: re : indian springs\r\nthis deal is t...
                              ...                        
5166    Subject: put the 10 on the ft\r\nthe transport...
5167    Subject: 3 / 4 / 2000 and following noms\r\nhp...
5168    Subject: calpine daily gas nomination\r\n>\r\n...
5169    Subject: industrial worksheets for august 2000...
5170    Subject: important online banking alert\r\ndea...
Name: text, Length: 5171, dtype: object


In [11]:
print(Y)

0       0
1       0
2       0
3       1
4       0
       ..
5166    0
5167    0
5168    0
5169    0
5170    1
Name: label, Length: 5171, dtype: int64


Spliting data into training and test data

In [12]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [13]:
# size of training and test data
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(4136,) (1035,) (4136,) (1035,)


Feature Extraction : Converting text data into numerical values

In [14]:
# transform the text data into feature vectors
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [15]:
print(X_test_features)

  (0, 43869)	0.04869956545696554
  (0, 42813)	0.03527402702862729
  (0, 41851)	0.08810491896608709
  (0, 41412)	0.4620507480296343
  (0, 40766)	0.04272090570919602
  (0, 40180)	0.04895680312117566
  (0, 39266)	0.04878991611309338
  (0, 39043)	0.010706489213114774
  (0, 38602)	0.22203508931006138
  (0, 38215)	0.06834259541498736
  (0, 38015)	0.11170834580172105
  (0, 37366)	0.052951078688389976
  (0, 36936)	0.04726991304221889
  (0, 36204)	0.0800147155732058
  (0, 34434)	0.06310125262131315
  (0, 34402)	0.07087348931743342
  (0, 33957)	0.06276133552527766
  (0, 33604)	0.11554230171899273
  (0, 33603)	0.1223839489927178
  (0, 33409)	0.051690035887647
  (0, 32180)	0.04580097071470751
  (0, 32099)	0.06629706458347832
  (0, 30757)	0.13558684862384385
  (0, 30421)	0.05122446300573309
  (0, 29773)	0.05568007980954718
  :	:
  (1033, 990)	0.017221480293918248
  (1034, 44262)	0.15183101676637092
  (1034, 39374)	0.31179985880269884
  (1034, 39043)	0.08998772512264999
  (1034, 32431)	0.11764860046

Training the model (LogisticRegression)

In [16]:
model = LogisticRegression()

In [17]:
model.fit(X_train_features,Y_train)

Model Evaluation

In [18]:
# Prediction on training data
prediction_on_train_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_train_data)
print(f'Accuracy on train data : {accuracy_on_training_data}')

Accuracy on train data : 0.9961315280464217


In [19]:
# Prediction on test data
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
print(f'Accuracy on test data : {accuracy_on_test_data}')

Accuracy on test data : 0.9893719806763285


Predictive system

In [20]:
# making prediction with spam mail data
#input_mail = ["Subject: seize clal 1 is , \ / 11 agrra , xanaax , adlpex , \ / all 1 um , ambl 1 en , tussioneex from $ 65 usual go blewearth her exciting shall times degree island week ,xanaax , \ / alium , cialiis , \ / iaagra , ambieen & all popular medssno long questioning form , you pay & we shiip out today quietworldwide shippiing allowprom 0 tion running now :\ / aliuum : from $ 70ambiien : from $ 68ciaaliis : from $ 96\ / iaagra : from $ 64xanaax : from $ 75& many more meds for u to choose from alongdont miss this prom 0 tionlimited stock until all sold out ( this way please )probably copy changed ."]

#input_mail = ["Subject: full stock of all your p # harmacy needs ! n 9glycerophosphoric homolog designer fourcher lenticula mastectomy . programer extrapelvic spiffing doesn microcentrosome . redshirted extrasystole isogenous pseudoviaduct tongued unconsonant . undisadvantageous shahdoms vaugnerite estuarine armholes flask jouk . palaeography haffet nonheritor choloidic bedchamber lutianid misled gratifies pillmonger ciliata . somatotyper apomecometry proceduring streptobacillus unsoluble .thingman when . trichophore algorithms unblanched felon parameterize bribetaker when scuta stearone implicational . superorganize steamtight pavanes dukely presympathize stabile unglorifying trichophore outlearn . foreganger cholos aquascutum shog ruck unproportionality filmsets mega misbill . mismated pedobaptist aquascutum leached incapabilities consulage charlesworth compriest . epinephrine ciboule snooperscope nychthemer proddle preobstruction esophagectomy amusively . peritoneomuscular mg pst . unperpetrated botuliform disqualification intraleukocytic samadhi . dome commissionship sacrament impersonalized sphinges centripetally basichromatin rainbow prenatal spectroradiometry semiautomatics . spinidentate tycoonate ringbark hyperresonant glycerophosphoric applanate promote undisinherited . onychophyma animists wisents hydroxylate genuflex . downgrading outlearn sibbs mutt phoenicians ungruff tannogen assisting . unstung shipshapely tannings psychoanalytical uncommandedness pigmental pantherish hymnarium parallelotropism decivilization . franker unsuperscribed outquestion verbascose embroiling . grabble priapean epinephrine bedchamber figurize dolium undoctrinal postimperial sheldrake verbascose genome . fasciculation ruck alienors redshirted electrolytically pullers yens alienator . unvoted damson commonest trypanolytic simplicidentate sentry microcentrosome pseudomorph . undergaoler cardiogenesis periclitation unimbued overappraisal trysts antikings rotundas trame nogal . shends . preilluminate coarsening naturality unstung curt foresightless tidinesses dukely holometabolic . devilish floured . temptationless semihard photonephograph spectacular doesn enwrought dome repurchases . syndyasmian streptobacillus khis bebouldered oversecure . overdid cicutoxin coheirship pillowslips estuarine ."]



# making prediction with ham mail data
#input_mail = ["Subject: hpl nom for october 5 , 2000( see attached file : hpll 005 . xls )- hpll 005 . xls"]

input_mail = ["Subject: wellhead volumesdaren ,please click on the supply analysis tab of the attached spreadsheet to viewthe wellhead volumes through 4 / 4 / 01 .bob"]

# convert text to feature vectors
input_mail_features = feature_extraction.transform(input_mail)

# make prediction
prediction = model.predict(input_mail_features)
print(prediction)

if prediction[0] == 0 :
  print('This is a Ham Mail')
else:
  print('This is a Spam Mail')

[0]
This is a Ham Mail
