<a href="https://colab.research.google.com/github/josmyrose/Webscraping/blob/main/Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The business context**

The ABC company recieves several messages over their company mail id.They want to classify whether the messages are spam or not.

The company appointed an analyst to develop a model for classifying the message as spam or not.



Import necessary libraries

In [109]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import nltk
from nltk import word_tokenize # for tokenizing
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from imblearn.over_sampling import RandomOverSampler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


from sklearn.feature_extraction.text import TfidfVectorizer

  



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Reading and analysing the dataset**

In [83]:
df=pd.read_csv("/content/drive/MyDrive/spam_classification", sep="\t",header=None)

In [84]:
df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Renaming the dataset columns

In [85]:
df.rename(columns={0: 'label', 1: 'text'}, inplace=True)

In [86]:
print("The size of the dataset",df.shape)
print("The Datatype of columns\n {} of the dataset\n".format(df.dtypes))
print("values of different categories\n",df['label'].value_counts())#for getting the count of each item in the product table.We can understand that the product column has five types of values.
print("Number of null values in the dataset\n",df.isnull().sum())

The size of the dataset (5572, 2)
The Datatype of columns
 label    object
text     object
dtype: object of the dataset

values of different categories
 ham     4825
spam     747
Name: label, dtype: int64
Number of null values in the dataset
 label    0
text     0
dtype: int64


From the above result,we can understand that dataset is imbalanced(ham      4825,
spam     747)
To balance the dataset,use random

**Calculate the length of each data sample**
Add a column to the dataset,to get the length of text in the dataset

In [87]:
word_l=[]
for x in df['text']:
  word_l.append(len(x))


In [88]:
df['word_l']=word_l

In [89]:
df.head()

Unnamed: 0,label,text,word_l
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


**TEXT CLEANING**

In [90]:
#Another way to find the length of text 
df['word_l']=df['text'].apply(lambda x: len(x))



In [91]:
df.head()

Unnamed: 0,label,text,word_l
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


**Text Cleaning steps:**


1.Make the text lowercase. As you probably know, NLP is case-sensitive.
2.	Remove line breaks. Again, depending on your source, you might have encoded line breaks.
3.	Remove punctuation. This is using the string library. Other punctuation can be added as needed.
4.	Remove stop words using the NLTK library. There is a list in the next line to add additional stop words to the function as needed. These might be noisy domain words or anything else that makes the contextless clear.
5.	Removing numbers. Optional depending on your data.
6.	Stemming or Lemmatization. This process is an argument in the function. You can choose either one via with Stem or Lem. The default is to use none.



It includes removing unnecessary words, punctuation, stop words, white spaces, and unnecessary symbols from the text dataset.

In [92]:
#Make the text lowercase
def lower_case(text): #user defined function for converting upper case text to lower case
  return text.lower()

In [94]:
df['text']=df['text'].apply(lambda x: lower_case(x))

In [95]:
df.head(5)

Unnamed: 0,label,text,word_l
0,ham,"go until jurong point, crazy.. available only ...",111
1,ham,ok lar... joking wif u oni...,29
2,spam,free entry in 2 a wkly comp to win fa cup fina...,155
3,ham,u dun say so early hor... u c already then say...,49
4,ham,"nah i don't think he goes to usf, he lives aro...",61


In [96]:
# Remove punctuation

def remove_punctuation(text):
    return text.translate(str.maketrans('', ''))

In [97]:
df['text'] = df['text'].apply(lambda x: remove_punctuation(x))

Stopwords


Stopwords are commonly used words that are usually filtered out in natural language processing (NLP) tasks because they do not add much meaning to the text. Examples of stopwords include "a", "an", "the", "and", "or", "in", "on", "at", "is", "are", "was", "were", etc.

Removing stopwords can help in reducing the dimensionality of the text data and improving the efficiency and accuracy of NLP models. This is because the presence or absence of stopwords is often not very informative in determining the sentiment, topic, or intent of the text.

There are several ways to remove stopwords in NLP tasks. One way is to use pre-defined lists of stopwords that are available in popular NLP libraries such as NLTK, spaCy, and scikit-learn. These lists can be customized by adding or removing stopwords as per the specific use case.

In [98]:
label_map = {
    'ham': 0,
    'spam': 1,
}

In [99]:
df['label'] = df['label'].map(label_map)

In [100]:
#Remove stopwords
def remove_stopwords(text):
  stop_words = set(stopwords.words('english'))
  removed=[]
  tokens = word_tokenize(text)
  print("\nTokens {} in text {}".format(tokens,text))
  for i in range(len(tokens)):
    if tokens[i] not in stop_words:
      removed.append(tokens[i])
    return " ".join(removed)

In [101]:
df['text'] = df['text'].apply(lambda x: remove_stopwords(x))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

Tokens ['ok', 'lor', '...', 'but', 'buy', 'wat', '?'] in text ok lor... but buy wat?

Tokens ['somebody', 'should', 'go', 'to', 'andros', 'and', 'steal', 'ice'] in text somebody should go to andros and steal ice

Tokens ['don', 'know', '.', 'i', "did't", 'msg', 'him', 'recently', '.'] in text don know. i did't msg him recently.

Tokens ['take', 'us', 'out', 'shopping', 'and', 'mark', 'will', 'distract', 'isaiah.=d'] in text take us out shopping and mark will distract isaiah.=d

Tokens ['mum', ',', 'hope', 'you', 'are', 'having', 'a', 'great', 'day', '.', 'hoping', 'this', 'text', 'meets', 'you', 'well', 'and', 'full', 'of', 'life', '.', 'have', 'a', 'great', 'day', '.', 'abiola'] in text mum, hope you are having a great day. hoping this text meets you well and full of life. have a great day. abiola

Tokens ['there', 'is', 'no', 'sense', 'in', 'my', 'foot', 'and', 'penis', '.'] in text there is no sense in my foot and pen

In [102]:
df['text'].replace("","")

0         go
1         ok
2       free
3          u
4        nah
        ... 
5567        
5568        
5569    pity
5570        
5571    rofl
Name: text, Length: 5572, dtype: object

In [103]:
df.tail(10)

Unnamed: 0,label,text,word_l
5562,0,ok,96
5563,0,ard,19
5564,0,,67
5565,0,huh,12
5566,1,reminder,147
5567,1,,160
5568,0,,36
5569,0,pity,57
5570,0,,125
5571,0,rofl,26


Lemmatizing or Stemm function






Implementing text vectorization

It converts the raw text into a format the NLP model can understand and use. Vectorization will create a numerical representation of the text strings called a sparse matrix or word vectors. The model works with numbers and not raw text. We will use TfidfVectorizer to create the sparse matri

In [104]:
#Vectorization
tf_wb= TfidfVectorizer()


We then apply the initialized method to the text column so that it can transform the text strings into a sparse matrix.

In [105]:
X_tf = tf_wb.fit_transform(df['text'])

Converting the sparse matrix into an array


In [111]:
X_tf = X_tf.toarray()


AttributeError: ignored

Splitting the vectorized dataset


In [112]:
X_train_tf, X_test_tf, y_train_tf, y_test_tf = train_test_split(X_tf, df['label'].values, test_size=0.3)


After splitting the dataset, we will use the Counter module to check the number of data samples in the majority and minority classes. We import the module as follows:

In [113]:
from collections import Counter


In [114]:
Counter(y_train_tf)


Counter({0: 3355, 1: 545})

In [115]:
ROS = RandomOverSampler(sampling_strategy=1)
X_train_ros, y_train_ros = ROS.fit_resample(X_train_tf, y_train_tf)



In [116]:
Counter(y_train_ros)


Counter({0: 3355, 1: 3355})

Using the balanced dataset to build the same model

In [117]:
nb = GaussianNB()
nb.fit(X_train_tf, y_train_tf)



GaussianNB()

In [118]:
NB_pred= nb.predict(X_test_tf)
print(NB_pred)

[1 0 1 ... 1 1 1]


**Getting accuracy score of these predictions**

In [119]:
print(accuracy_score(y_test_tf, NB_pred))


0.4072966507177033
