# Problem Statement

##### In the digital age, email is vital, but spam poses a threat. This project aims to build an accurate email spam filter using logistic regression, ensuring safer and clutter-free inboxes.

### Objectives


* Collect and preprocess a spam email dataset.
* Develop and optimize a robust classification model.
* Evaluate model performance.
* Deploy the filter for enhanced email security and usability.

### Importing Dependencies 

In [1]:
import pandas as pd                                  #importing pandas
import numpy as np                                   #importing numpy
from sklearn.model_selection import train_test_split #import train_test_split
from sklearn.linear_model import LogisticRegression  #importing LogisticRegression as we have binary problem
from sklearn.feature_extraction.text import TfidfVectorizer  #importing TfidVectorizer to deal with text to vector
from sklearn.metrics import accuracy_score            #importing accuracy to check perform model accuracy

### Data Collection & Pre-Processing

In [4]:
#Loading data into pandas dataframe from cvs file
data=pd.read_csv("E://spam.csv",encoding='latin1')
print("Shape of the data", data.shape)

Shape of the data (5572, 5)


In [6]:
#printing first 5 dataframe
data.head(5)

Unnamed: 0,Category,Message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [11]:
#remove unusual columns from data
data.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)

In [12]:
data

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [13]:
#printing first 5 dataframe
data.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
#printing last 5 dataframe
data.tail(5)

Unnamed: 0,Category,Message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [15]:
#Checking data information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [16]:
#checking if data consist any null values
data.isna().sum()

Category    0
Message     0
dtype: int64

## Observation

* Total dataset has total 5572 roes/email, and 2 column
* ham --> good email and spam email --> spam email
* data doesn't contain any null values
* both data are in object type data

## Label Encoding

#### spam mail ------> 0
##### ham/non spam mail ------>1

In [17]:
data['Category']=data['Category'].map({'spam':0,'ham':1})

In [18]:
data

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,0,This is the 2nd time we have tried 2 contact u...
5568,1,Will Ì_ b going to esplanade fr home?
5569,1,"Pity, * was in mood for that. So...any other s..."
5570,1,The guy did some bitching but I acted like i'd...


### separating the data as texts and label

##### *X= features(Message)*
**Y= target(Category)**

In [19]:
X=data['Message']
Y=data['Category']

In [20]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [21]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: int64


### Train Test Split

In [23]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=2)
#checking shape of data
print("Train data shape of X_train ", X_train.shape)
print("Train data shape of X_test ", X_test.shape)
print("Train data shape of Y_train ", Y_train.shape)
print("Train data shape of Y_test ", Y_test.shape)

Train data shape of X_train  (4457,)
Train data shape of X_test  (1115,)
Train data shape of Y_train  (4457,)
Train data shape of Y_test  (1115,)


## Feature Extraction 

In [24]:
#tranfroming text data(Message) to feature vectors that can be used as input to the Logistic Regression

#imp parameter 
#min_df=1 if word score is 1 ignore, 
#stop_words='english =common words ignore like "this", that, was,
#lowercase='True' =converting all word to lowercase for better prediction

feature_extraction= TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

#converting and storing all features values using feature_extraction variable

X_train_features =feature_extraction.fit_transform(X_train)
X_test_features =feature_extraction.transform(X_test)

#convering Y_train and Y_test to int for model to understand and better prediction

Y_train=Y_train.astype('int')
Y_test=Y_test.astype('int')


### Training Model

#### features:
* X_train_features
* X_test_features

#### Target:
* Y_train
* Y_test

In [25]:
#creating model variable 

model = LogisticRegression()

In [26]:
#Starting the traning model with training data
model.fit(X_train_features, Y_train)                
#x_train_features all training data in numerical value Y_train are corroponding label

LogisticRegression()

### Evaluating Traing Model

In [27]:
#Prediction on Training Data

pred_on_training_data = model.predict(X_train_features)

#compare predict value with actual value
accuracy_on_training_data = accuracy_score(Y_train, pred_on_training_data)

In [28]:
print("Accuracy on Training Data: {:.2%}".format(accuracy_on_training_data))

Accuracy on Training Data: 97.17%


In [29]:
#prediction on Test data

pred_on_test_data = model.predict(X_test_features)

#checking accuracy now on Test data

accuray_on_test_data = accuracy_score(Y_test, pred_on_test_data)

In [30]:
print("Model Prediction on Test data is {:.2%}".format(accuray_on_test_data))

Model Prediction on Test data is 95.61%


### Building Predictive System

In [31]:
input_mail = ["Enter your email in double quotation marks"]
#converting text to vector using tfidvectorizer

input_mail_features = feature_extraction.transform(input_mail)


#predicting on the input_mail
prediction = model.predict(input_mail_features)


#creating if else statement to print the anser 0--> spam | 1--> ham
if prediction[0] == 0:
    print("The mail is spam")
else:
    print("The mail is Ham")

The mail is Ham


# Conclusion

This project utilized Logistic Regression to create a robust spam email prediction system. After data preprocessing and TF-IDF Vectorization, the model demonstrated a remarkable 96% accuracy on training and test data. The system enhances email security and organization by effectively filtering out unwanted and potentially harmful emails.