<a href="https://colab.research.google.com/github/ppprakharr/ClassificationModels/blob/main/SpamMailPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the dependencies

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

Data Collection and Preproccesing

In [2]:
mail_data=pd.read_csv('/content/mail_data.csv')


In [3]:
mail_data.shape

(5572, 2)

In [4]:
#first five elements
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# last 5 elements
mail_data.tail()

Unnamed: 0,Category,Message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [6]:
# check for null
mail_data.isnull().sum()
# new_data=mail_data.where((pd.notnull(mail_data)),' ')

Category    0
Message     0
dtype: int64

Label Encoding
spam mail -->1
ham mail -->0

In [7]:
mail_data.replace({'Category':{'spam': 1, 'ham': 0}},inplace=True)

In [8]:
mail_data

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [9]:
# Seperating the data as text and label

x=mail_data['Message']
y=mail_data['Category']

In [15]:
print(y)

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: Category, Length: 5572, dtype: int64


Splitting the data into training and test data

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=1)

In [13]:
print(x.shape,x_train.shape,x_test.shape)

(5572,) (4457,) (1115,)


Transform the data into feature vectors to be used as input

In [14]:
feature=TfidfVectorizer(min_df=1, stopwords='english', lowercase='True')

In [16]:
x_train_feature = feature.fit_transform(x_train)
x_test_feature = feature.transform(x_test)


In [17]:
print(x_train_feature)

  (0, 2377)	0.13084322879530272
  (0, 4474)	0.20675479928475868
  (0, 2408)	0.1625091138189867
  (0, 6910)	0.2308946468063433
  (0, 1632)	0.1240452905376704
  (0, 6230)	0.1238433975762776
  (0, 6077)	0.26505343627625905
  (0, 2795)	0.19121220903776834
  (0, 1578)	0.12507269360630358
  (0, 937)	0.1473844950210902
  (0, 4872)	0.11547335237601036
  (0, 3940)	0.2409135887546744
  (0, 4985)	0.2811524084951536
  (0, 3153)	0.2811524084951536
  (0, 6889)	0.07687128957768792
  (0, 3906)	0.30331418937109733
  (0, 4813)	0.12424836072414847
  (0, 6794)	0.16849404618992045
  (0, 963)	0.10280712668342132
  (0, 1119)	0.12999227871929717
  (0, 5564)	0.34546463397618216
  (0, 7370)	0.2647043827753558
  (0, 7674)	0.15772075348041273
  (0, 1051)	0.12354273580984558
  (0, 7442)	0.16997624597585514
  :	:
  (4455, 3838)	0.19426158222126744
  (4455, 6248)	0.17796330384160655
  (4455, 4411)	0.16424350620788178
  (4455, 4606)	0.1561875890003103
  (4455, 313)	0.17722407642202112
  (4455, 2846)	0.168925565608865

#Training the model

Logistic

In [18]:
model=LogisticRegression()

In [19]:
model.fit(x_train_feature,y_train)

In [20]:
#evaluating the model
train_data=model.predict(x_train_feature)
score=accuracy_score(train_data,y_train)
print('accuracy_score train: ',score)

accuracy_score train:  0.974646623289208


In [21]:
test_data=model.predict(x_test_feature)
score=accuracy_score(test_data,y_test)
print('accuracy_score test: ',score)

accuracy_score test:  0.9775784753363229


Build a predictive system

In [25]:
input = ["Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged"]

# convert text to feature

input_feature = feature.transform(input)
pred=model.predict(input_feature)
print(pred)

[1]
