## NAME - ***KANISHKA KANOJIA***
## DATA SCIENCE INTERNSHIP @ OASIS INFOBYTE

## **TASK 1**

# PROJECT NAME - **EMAIL SPAM DETECTION WITH MACHINE LEARNING**

In [None]:
from IPython.display import Image
Image(url='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRMTZzwzqcPlQXfaktaySc3cTBRbIRuQZ7pMQ&usqp=CAU', width=650)

# **Github Link**
#https://github.com/kanishkanojia/OIBSIP-Oasis-Infobyte

## Problem Statement

We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.



In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam. Let’s get
started!

## Importing the Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# convert text into feature vector or numeric values
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Data Collection & Pre-Processing

In [None]:
# loading the data from csv file to a pandas Dataframe
raw_data = pd.read_csv('/content/spam.csv', encoding='latin')

In [None]:
print(raw_data)

In [None]:
# printing the first 5 rows of the dataframe
raw_data.head()

In [None]:
raw_data.tail()

In [None]:
# checking the number of rows and columns in the dataframe
raw_data.shape

In [None]:
raw_data.columns

In [None]:
raw_data.info()

In [None]:
raw_data.duplicated().sum()

In [None]:
raw_data.nunique()

In [None]:
raw_data.isnull().sum()

In [None]:
# replace the null values with a null string
#creating a new dataframe
mail_data = raw_data.where((pd.notnull(raw_data)),'')

ham -> non spam mail \
spam -> spam mail

In [None]:
# rename the columns
mail_data=mail_data.rename(columns={'v1':'Category',
                                    'v2':'Message'})

## Label Encoding

Label encoding is a process of converting categorical or textual data into numerical labels. In machine learning and data analysis, many algorithms require numerical inputs, and label encoding provides a way to represent categorical data in a format that can be easily understood by these algorithms.

In [None]:
# label spam mail as 0;  ham mail as 1;

mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

spam  -  0

ham  -  1

In [None]:
# separating the data as texts and label

X = mail_data['Message']

Y = mail_data['Category']

In [None]:
print(X)

In [None]:
print(Y)

## Splitting the data into training data & test data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

In [None]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

## Feature Extraction

Feature Extraction helps to convert your text data into numerical values. If you feed all these strings to your logistic regression model it doesn't understand anything so we need to convert this all this text data into meaningful numerical values

In [None]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [None]:
Y_train

In [None]:
Y_test

In [None]:
print(X_train)

In [None]:
print(X_train_features)

## Training the Model

## Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
# training the Logistic Regression model with the training data
model.fit(X_train_features, Y_train)

## Evaluating the trained model

In [None]:
# prediction on training data

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [None]:
print('Accuracy on training data : ', accuracy_on_training_data)

In [None]:
# prediction on test data

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [None]:
print('Accuracy on test data : ', accuracy_on_test_data)

## Building a Predictive System

In [None]:
input_mail = ["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times"]

# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# making prediction

prediction = model.predict(input_data_features)
print(prediction)


if (prediction[0]==1):
  print('Ham mail')

else:
  print('Spam mail')