# Casestudy 5 - NLP Classifier

### SPAM Dataset
The dataset contains 5573 emails. They are labeled as spam and ham, where 4825 are ham (non spam) and 747 spam emails. We need to build a NLP classifier that specially uses word2vec from Google. Divide the dataset into 80 and 20 percent and build 3 types of models
1. CBOW
2. Skipgram
3. Pretrained word2vec model from Google



In [None]:
!python -m spacy download en_core_web_lg

2022-10-09 04:17:07.984521: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 10 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [None]:
# Importing supporting directories
import numpy as np
import pandas as pd
import matplotlib.pyplot as pyplot

In [None]:
# Importing Word2Vec
from gensim.models import Word2Vec as wtv
# Importing Keyed Vectors
from gensim.models import KeyedVectors

In [None]:
# Importing PCA
from sklearn.decomposition import PCA
# Import Label Encoder
from sklearn.preprocessing import LabelEncoder
# Import Train Test Splitting 
from sklearn.model_selection import train_test_split
# Build a text classification model using SVM
from sklearn.svm import SVC
# Check its accuracy
from sklearn.metrics import accuracy_score

In [None]:
# Reading dataset
df = pd.read_csv('/content/spam.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [None]:
df['v1'].value_counts()

ham     4825
spam     747
Name: v1, dtype: int64

## Initial Preprocessing

In [None]:
# Checking for missing values
df.isna().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [None]:
df['Unnamed: 2'].value_counts()

 bt not his girlfrnd... G o o d n i g h t . . .@"                                                                                                   3
 PO Box 5249                                                                                                                                        2
this wont even start........ Datz confidence.."                                                                                                     2
GN                                                                                                                                                  2
 don't miss ur best life for anything... Gud nyt..."                                                                                                2
 but dont try to prove it..\" .Gud noon...."                                                                                                        2
 Gud night...."                                                                                     

In [None]:
df['Unnamed: 3'].value_counts()

 MK17 92H. 450Ppw 16"                         2
GE                                            2
 why to miss them                             1
U NO THECD ISV.IMPORTANT TOME 4 2MORO\""      1
i wil tolerat.bcs ur my someone..... But      1
 ILLSPEAK 2 U2MORO WEN IM NOT ASLEEP...\""    1
whoever is the KING\"!... Gud nyt"            1
 TX 4 FONIN HON                               1
 \"OH No! COMPETITION\". Who knew             1
IåÕL CALL U\""                                1
Name: Unnamed: 3, dtype: int64

In [None]:
df['Unnamed: 4'].value_counts()

GNT:-)"                                                     2
 just Keep-in-touch\" gdeve.."                              1
 Never comfort me with a lie\" gud ni8 and sweet dreams"    1
 CALL 2MWEN IM BK FRMCLOUD 9! J X\""                        1
 one day these two will become FREINDS FOREVER!"            1
Name: Unnamed: 4, dtype: int64

In [None]:
# Since the 3 Unnamed cols have together less than 1% of data we can drop them
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.rename({'v1':'Category', 'v2':'Text'}, axis=1, inplace=True)

## Preprocess Using Spacy

In [None]:
# Importing Spacy
import spacy
nlp = spacy.load('en_core_web_lg')

In [None]:
def spacy_process(text):
    filtered = []
    doc = nlp(text)
    for token in doc:
        if token.is_stop or token.is_punct or token.is_space:
            continue
        if token.has_vector:
            filtered.append(token.lemma_)
    return " ".join(filtered)

In [None]:
df['spacytext'] = df['Text'].apply(spacy_process)

In [None]:
df['Text'][0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
df['spacytext']

0       point crazy available n great world la e buffe...
1                                   ok lar joke wif u oni
2       free entry 2 comp win FA Cup final 21st 2005 t...
3                                     U dun early hor u c
4                                       nah think go live
                              ...                        
5567    2nd time try 2 contact u. U win Pound prize 2 ...
5568                             Ì b go esplanade fr home
5569                                 pity mood suggestion
5570     guy bitch act like interested buy week give free
5571                                                 true
Name: spacytext, Length: 5572, dtype: object

In [None]:
token_spacy = pd.Series(df['spacytext'].values)

## Preproccessing Using Simple Preprocess From Gensim

In [None]:
# Importing simple_preprocess
from gensim.utils import simple_preprocess

In [None]:
# preprocess all the articles of the data set
df['simpletext'] = df['Text'].apply(lambda x: simple_preprocess(x))

In [None]:
df['Text'][0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
df['simpletext'][0]

['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'great',
 'world',
 'la',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'wat']

In [None]:
tokens_simple = pd.Series(df.simpletext.values)

## Preprocessing Using NLTK 


In [None]:
# Loading NLTK
import nltk
# Import regular expression
import re
# Import string
import string
# Import beautiful soup
from bs4 import BeautifulSoup
# Import Stopwords
from nltk.corpus import stopwords
# Importing WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

In [None]:
# downloading punkt
nltk.download('punkt')

# downloading stopwords
nltk.download('stopwords')

# downloading wordnet
nltk.download('wordnet')

# downloading omw-1.4
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

# Converting to lower
def to_lower(text):
    return text.lower()

In [None]:
#Define function for removing special characters (expansive)
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"I'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\^^", "", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    return(text)

In [None]:
# Tokenizing Text
def simple_tokenize(text):
    return nltk.word_tokenize(text)

In [None]:
#Lemmatizing the text
def simple_lemmatizer(token_list):
    wlemma = WordNetLemmatizer()
    return [wlemma.lemmatize(token) for token in token_list]

In [None]:
# Removing Punctuation
def remove_punct(token_list):
    return [token for token in token_list if token not in string.punctuation]

In [None]:
# Stopwords 
stop_words = stopwords.words('english')
# Removing Stopwords
def remove_stopwords(token_list):
    return [token for token in token_list if token not in stop_words]

In [None]:
# NLTK Preprocessor
def nltk_preprocess(text):
    text = to_lower(text)
    text = remove_special_characters(text)
    text = simple_tokenize(text)
    text = remove_punct(text)
    text = remove_stopwords(text)
    text = simple_lemmatizer(text)
    return text

In [None]:
df['nltktext'] = df['Text'].apply(nltk_preprocess)

In [None]:
df['Text'][0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
df['nltktext'][0]

['go',
 'jurong',
 'point',
 'crazy',
 'available',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'got',
 'amore',
 'wat']

In [None]:
# Getting the values
tokens = pd.Series(df.nltktext.values)

## 1. CBOW Model

In [None]:
# train a cbow model from the given data set
cbow_model = wtv(tokens, size=300, window=9, min_count=2, sg=0)

In [None]:
# extract vectors from all words in doc
def get_embedding_cbow(doc_tokens):
    embeddings = []
    model = cbow_model
    # iterate over tokens to extract their vectors    
    for tok in doc_tokens:
        if tok in model.wv.vocab:
            embeddings.append(model.wv.word_vec(tok))
    # mean the vectors of individual words to get the vector of the statement
    return np.mean(embeddings, axis=0)

In [None]:
df['cbow_vectors'] = df['Text'].apply(lambda x: get_embedding_cbow(x))

  out=out, **kwargs)


In [None]:
df.isnull().sum()

Category         0
Text             0
spacytext        0
simpletext       0
nltktext         0
cbow_vectors    54
dtype: int64

In [None]:
df[df['cbow_vectors'].isnull() == True]

Unnamed: 0,Category,Text,spacytext,simpletext,nltktext,cbow_vectors
14,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,date SUNDAY,"[have, date, on, sunday, with, will]","[date, sunday]",
43,ham,WHO ARE YOU SEEING?,see,"[who, are, you, seeing]",[seeing],
72,ham,HI BABE IM AT HOME NOW WANNA DO SOMETHING? XX,HI babe IM HOME WANNA xx,"[hi, babe, im, at, home, now, wanna, do, somet...","[hi, babe, im, home, wan, na, something, xx]",
444,ham,\HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYAR...,HEY HOWDY GORGEOUS,"[hey, hey, werethe, monkeespeople, say, we, mo...","[hey, hey, werethe, monkeespeople, say, monkey...",
456,ham,"LOOK AT AMY URE A BEAUTIFUL, INTELLIGENT WOMAN...",look AMY URE beautiful intelligent woman like ...,"[look, at, amy, ure, beautiful, intelligent, w...","[look, amy, ure, beautiful, intelligent, woman...",
569,ham,WOT U WANNA DO THEN MISSY?,WOT U WANNA MISSY,"[wot, wanna, do, then, missy]","[wot, u, wan, na, missy]",
622,ham,MAKE SURE ALEX KNOWS HIS BIRTHDAY IS OVER IN F...,SURE ALEX knows birthday minute FAR YOU'RE con...,"[make, sure, alex, knows, his, birthday, is, o...","[make, sure, alex, know, birthday, fifteen, mi...",
792,ham,Y?WHERE U AT DOGBREATH? ITS JUST SOUNDING LIKE...,u sounding like JAN C AL,"[where, at, dogbreath, its, just, sounding, li...","[ywhere, u, dogbreath, sounding, like, jan, c,...",
908,ham,WHITE FUDGE OREOS ARE IN STORES,white FUDGE STORES,"[white, fudge, oreos, are, in, stores]","[white, fudge, oreo, store]",
983,ham,LOOK AT THE FUCKIN TIME. WHAT THE FUCK YOU THI...,look fuckin TIME fuck think,"[look, at, the, fuckin, time, what, the, fuck,...","[look, fuckin, time, fuck, think]",


The presence of these NaN values indicate that regardless of which preprocess we apply, these words/empty spaces result in empty vectors which means error in further process. So we can drop these.

In [None]:
df = df.dropna().reset_index(drop=True)
df.head(15)

Unnamed: 0,Category,Text,spacytext,simpletext,nltktext,cbow_vectors
0,ham,"Go until jurong point, crazy.. Available only ...",point crazy available n great world la e buffe...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n...","[-0.106335275, -0.06327503, 0.048476845, -0.19..."
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni,"[ok, lar, joking, wif, oni]","[ok, lar, joking, wif, u, oni]","[-0.102702685, -0.061022747, 0.047203705, -0.1..."
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 comp win FA Cup final 21st 2005 t...,"[free, entry, in, wkly, comp, to, win, fa, cup...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[-0.09802247, -0.05802328, 0.044908598, -0.182..."
3,ham,U dun say so early hor... U c already then say...,U dun early hor u c,"[dun, say, so, early, hor, already, then, say]","[u, dun, say, early, hor, u, c, already, say]","[-0.13842326, -0.08239722, 0.062801085, -0.258..."
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think go live,"[nah, don, think, he, goes, to, usf, he, lives...","[nah, dont, think, go, usf, life, around, though]","[-0.10480784, -0.062163178, 0.047725685, -0.19..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...,hey darle 3 week word like fun Tb ok std send,"[freemsg, hey, there, darling, it, been, week,...","[freemsg, hey, darling, 3, week, word, back, i...","[-0.10340355, -0.06136291, 0.047242425, -0.192..."
6,ham,Even my brother is not like to speak with me. ...,brother like speak treat like aids patent,"[even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aid,...","[-0.09139911, -0.053884435, 0.041671406, -0.17..."
7,ham,As per your request 'Melle Melle (Oru Minnamin...,request Melle Melle Oru set Callers press 9 co...,"[as, per, your, request, melle, melle, oru, mi...","[per, request, melle, melle, oru, minnaminungi...","[-0.13480254, -0.0804047, 0.061284255, -0.2514..."
8,spam,WINNER!! As a valued network customer you have...,WINNER value network customer select prize rew...,"[winner, as, valued, network, customer, you, h...","[winner, valued, network, customer, selected, ...","[-0.11448016, -0.06814899, 0.052251305, -0.212..."
9,spam,Had your mobile 11 months or more? U R entitle...,mobile 11 month u r entitle update late colour...,"[had, your, mobile, months, or, more, entitled...","[mobile, 11, month, u, r, entitled, update, la...","[-0.08530863, -0.05060905, 0.03882869, -0.1585..."


In [None]:
# create X from w2vec
X_cbow = pd.DataFrame(df['cbow_vectors'].values.tolist())
X_cbow.shape

(5518, 300)

In [None]:
# label encode the 'label' 
le = LabelEncoder()
# fit_transform() converts the text to numbers, for obtaining y
y = le.fit_transform(df.Category)

In [None]:
# split into train and test
X_train_cb, X_test_cb, y_train_cb, y_test_cb = train_test_split(X_cbow, y, test_size=0.2, random_state=42)

In [None]:
# Build a text classification model
# Initialize classifier
model_1 = SVC()
# Fit the model on the train dataset
model_1 = model_1.fit(X_train_cb, y_train_cb)
# Make predictions on the test dataset
pred_1 = model_1.predict(X_test_cb)

# check the accuracy of the model
a1 = accuracy_score(y_test_cb, pred_1)
print("Accuracy:", a1*100, "%")

Accuracy: 85.59782608695652 %


## 2. Skipgram Model

In [None]:
# train a skipgram model from the given data set
skgram_model = wtv(tokens, size=300, window=9, min_count=2, sg=1)

In [None]:
# extract vectors from all words in doc
def get_embedding_sg(doc_tokens):
    embeddings = []
    model = skgram_model
    # iterate over tokens to extract their vectors    
    for tok in doc_tokens:
        if tok in model.wv.vocab:
            embeddings.append(model.wv.word_vec(tok))
    # mean the vectors of individual words to get the vector of the statement
    return np.mean(embeddings, axis=0)

In [None]:
df['sgram_vectors'] = df['Text'].apply(lambda x: get_embedding_sg(x))

In [None]:
df.isnull().sum()

Category         0
Text             0
spacytext        0
simpletext       0
nltktext         0
cbow_vectors     0
sgram_vectors    0
dtype: int64

Since all the problem rows have been already deleted so we have no issues here

In [None]:
# create X from w2vec
X_skg = pd.DataFrame(df['sgram_vectors'].values.tolist())
X_skg.shape

(5518, 300)

In [None]:
# label encode the 'label' 
le = LabelEncoder()
# fit_transform() converts the text to numbers
y = le.fit_transform(df.Category)

In [None]:
# split into train and test
X_train_sg, X_test_sg, y_train_sg, y_test_sg = train_test_split(X_skg, y, test_size=0.2, random_state=42)

In [None]:
# Build a text classification model
# Initialize classifier
model_2 = SVC()
# Fit the model on the train dataset
model_2 = model_2.fit(X_train_sg, y_train_sg)
# Make predictions on the test dataset
pred_2 = model_2.predict(X_test_sg)

# check the accuracy of the model
a2 = accuracy_score(y_test_sg, pred_2)
print("Accuracy:", a2*100, "%")

Accuracy: 85.86956521739131 %


## 3. Pretrained Google Word2Vec Model Based

In [None]:
file_name = "/content/drive/MyDrive/GoogleNews-vectors-negative300.bin"

In [None]:
# load into gensim pretrained model
google_w2vec = KeyedVectors.load_word2vec_format(file_name, binary=True)

In [None]:
# extract vectors from all words in doc
def get_embedding_ggl(doc_tokens):
    embeddings = []
    model = google_w2vec
    # iterate over tokens to extract their vectors    
    for tok in doc_tokens:
        if tok in model.wv.vocab:
            embeddings.append(model.wv.word_vec(tok))
    # mean the vectors of individual words to get the vector of the statement
    return np.mean(embeddings, axis=0)

In [None]:
df['google_vectors'] = df['Text'].apply(lambda x: get_embedding_ggl(x))

  import sys
  


In [None]:
# create X from w2vec
X_ggl = pd.DataFrame(df['google_vectors'].values.tolist())
X_ggl.shape

(5518, 300)

In [None]:
# label encode the 'label' 
le = LabelEncoder()
# fit_transform() converts the text to numbers
y = le.fit_transform(df.Category)

In [None]:
# split into train and test
X_train_gl, X_test_gl, y_train_gl, y_test_gl = train_test_split(X_ggl, y, test_size=0.2, random_state=42)

In [None]:
# Build a text classification model
# Initialize classifier
model_3 = SVC()
# Fit the model on the train dataset
model_3 = model_3.fit(X_train_gl, y_train_gl)
# Make predictions on the test dataset
pred_3 = model_3.predict(X_test_gl)

# check the accuracy of the model
a3 = accuracy_score(y_test_gl, pred_3)
print("Accuracy:", a3*100, "%")

Accuracy: 97.01086956521739 %


## Results

In [None]:
print('\n\t\t\t Accuracy Of Email Classification Model',
      '\nCBOW Model\t\t\t\t: ',a1,
      '\nSkipgram Model \t\t\t\t: ',a2,
      '\nPretrained Google Model \t\t: ',a3)


			 Accuracy Of Email Classification Model 
CBOW Model				:  0.8559782608695652 
Skipgram Model 				:  0.8586956521739131 
Pretrained Google Model 		:  0.970108695652174
