### The Dataset

The dataset which we are going to use is an open-source dataset available on Kaggle.

**About the dataset**   
The dataset contains three columns. The size of the dataset is around 5.65mb. It has around 5000 rows in total.

**Columns**  
Label: ham, spam
Text: a collection of text or emails
Label_num: 0 for ham and 1 for spam

**Task**  
Our Task is to create a machine learning model that can accurately predict whether an email is a spam or not.

### Creating the model using ML and NLP

**Importing necessary libraries**

In [None]:
import numpy as np ## scientific computation
import pandas as pd ## loading dataset file
import matplotlib.pyplot as plt ## Visulization
import nltk  ## Preprocessing our text
from nltk.corpus import stopwords ## removing all the stop words
from nltk.stem.porter import PorterStemmer ## stemming of words

### Load the dataset 

In [None]:
df = pd.read_csv("spam_ham_dataset.csv")
df.head()

### EDA on Dataset

In [None]:
print(df.shape)  ### Return the shape of data 
print("**************************************")
print(df.size)   ### Return the size of data 
print("**************************************")
print(df.isna().sum())  ### Returns the sum fo all na values
print("**************************************")
print(df.info())  ### Give concise summary of a DataFrame


We will see that we don’t have any null values in our dataset. Also, one thing to notice is that only one column of our has numerical values so we can only visualize that column.

**Let’s Visualize the Column label_num**  
We can only visualize the count of both the categories in the column.

In [None]:
df["label_num"].value_counts().plot(kind="bar",figsize=(12,6))
plt.xticks(np.arange(2), ('Non spam', 'spam'),rotation=0);

Here in the plot, we can see that almost 3500 does not spam and around 1500 are spam messages.

### Cleaning The Text  

In any NLP problem, the most important step is to clean the text. cleaning text means removing all the punctuation, removing stopwords, performing stemming, lemmatization, and converting the text into vectors.

In [None]:
import re
corpus = []
length = len(df)
for i in range(0,length):
    text = re.sub("[^a-zA-Z0-9]"," ",df["text"][i])
    text = text.lower()
    text = text.split()
    pe = PorterStemmer()
    stopword = stopwords.words("english")
    text = [pe.stem(word) for word in text if not word in set(stopword)]
    text = " ".join(text)
    corpus.append(text)
print(corpus)

**Explanation:**  

- **Line 1:** We are importing re library, which is used to perform regex in python.  
- **Line 2:** Define an empty corpus list, that can be used to store all the text after cleaning.  
- **Line 3:** Initializing the var length with the length of the data frame.  
- **Line 4:** Running a loop from 0 to the length of our data frame.  
- **Line 5:** Removing all characters except the lower alphabet, bigger alphabets, and digits.  
- **Line 6:** Converting the text to lower.  
- **Line 7:** Splitting the text by spaces.  
- **Line 8:** Creating an object of porter stemmer.  
- **Line 9:** Initializing all the stopword in English dictionary to var stopword  
- **Line 10:** Running a loop in the length of the sentence and then for each word in the sentence checking it in stopword and if it does not find in stopword then apply Stemming on to the text and add it to the list.  
- **Line 11:** Just concatenating all the words to make a sentence  
- **Line 12:** Appending the sentence to the corpus list  
- **Line 13:** Printing the corpus list.  

In Cleaning Process the next step is to convert the list of the sentence(corpus) into vectors so that we can feed this data into our machine learning model. For converting the text into vectors we are going to use a bag of words which is going to convert the text into binary form.’

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=35000)
X = cv.fit_transform(corpus).toarray()

y = df['label_num']

- **Line 1:** We are importing the CountVectorizer from sklearn.
- **Line 2:** Creating an object for the count vectorizer with max features as 35000, means we are only fetching the top 35000 columns.
- **Line 3:** Using CV we are fitting are corpus and also transforming it into vectors.

### Modeling and Training
Splitting data into train and validation sets using train_test_split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Creating a model using MultinomialNaiveBayes


In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

### Fitting the model to the training sets

In [None]:
model.fit(X_train, y_train)

### Prediction

In [None]:
y_pred=model.predict(X_test)
y_pred

### Evaluating Model
we are going to evaluate our model using the confusion matrix and accuracy score.

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_test, y_pred)
score = accuracy_score(y_test,y_pred)
print(cm,score*100)