#### Project Introduction:

This project aims to determine if a message is a spam or not. For word to vector conversion, we are going to use word2vec and avgword2vec and see the difference.
importantly, we are going to train the project using Word2Vec.

In [123]:
# Installing the dependencies
#!pip install gensim

In [124]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [125]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

#### Google Pretrained Word2Vec Model:
We are going to use the google pretrained word2vec model for feature representation which has been trained on a huge amount of data.
This google word2vec has been trained on google news and after using this pretrained model on words, the words get converted into vectors with 300 dimensions. Therefore, the shape of each word would be 300 dimensions.
these dimensions help to preserve the underlying the meaning of the words so the words having similar meaning are close to each other on the vector space.

#### Cosine Similarity:
Cosine similarity is used to identify the distance between two words or the similarity between two words.

In [126]:
# downloading the google pretrained word2vec model for feature representation of the vectors
import gensim.downloader as api
#wv=api.load('word2vec-google-news-300')


In [127]:
# the word gets converted into a vector with 300 dimensions with the google pretrained word2vec model.
vec_player=wv['player']
vec_player.shape

(300,)

In [128]:
df=pd.read_csv('drive/MyDrive/spam.csv',encoding="ISO-8859-1")
  

In [129]:
#dropping the irrelavent columns
df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'], axis=1, inplace=True)


In [130]:
# renaming the column names
df.rename(columns={'v1':'label','v2':'message'},inplace=True)


In [131]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [132]:
df.shape

(5572, 2)

#### Text PreProcessing:
we are going to perform below steps to clean the text so that the model can be trained.
1) Tokenization:
that is to extract words from sentences

2) StopWords:
that is to remove those words which don't hold much of a meaning like 'is', 'or', 'they' etc

3) Stemming/Lemmatization:
that is to reduce words into their base/root words so we can have less number of unique words and the dimension can be reduced.

4) Word2Vec:

that is to convert words into vectors either by using a pretrained model or training/creating a model from scratch.

In [133]:
# importing dependencies for text preprocessing
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [134]:
lemmatizer=WordNetLemmatizer()

In [135]:
corpus=[]
for i in range(len(df)):
  review=re.sub('[^a-zA-Z]',' ',df['message'][i])
  review=review.lower()
  review=review.split()

  review=[lemmatizer.lemmatize(words) for words in review if not words in stopwords.words('english')]
  review=' '.join(review)
  corpus.append(review)


In [136]:
print(len(corpus))
print(corpus[0:2])

5572
['go jurong point crazy available bugis n great world la e buffet cine got amore wat', 'ok lar joking wif u oni']



before and after text preprocessing, there are 8 records misssing.
those records have been found in the below cell and they are missing due to the characters in the sentence other than a-z and A-Z characters.

In [137]:
# 8 sentences which have been replaced with white space. 
[[i,j,k] for i,j,k in zip(map(len,corpus),corpus,df['message']) if i<1]

[[0, '', 'What you doing?how are you?'],
 [0, '', 'Where @'],
 [0, '', '645'],
 [0, '', 'Can a not?'],
 [0, '', ':) '],
 [0, '', 'What you doing?how are you?'],
 [0, '', ':( but your not here....'],
 [0, '', ':-) :-)']]

In [138]:
# Alternative way to preprocess the text:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [139]:
words=[]
for sentences in corpus:
  sentences_token=sent_tokenize(sentences) # tokenzing the sentences
  for word in sentences_token:
    lower_sent=simple_preprocess(word) #lowering the words
    words.append(lower_sent)


In [140]:
"""
sent=[]
for i in corpus:
  #print(i)
  splitting=i.split(' ')
  sent.append(splitting)
print(sent)

"""


"\nsent=[]\nfor i in corpus:\n  #print(i)\n  splitting=i.split(' ')\n  sent.append(splitting)\nprint(sent)\n\n"

#### Word to Vector Conversion

let's train the word2vec model from scratch.


In [141]:
import gensim

In [142]:
model=gensim.models.Word2Vec(words,size=100)



In [143]:
model.epochs

5

In [144]:
model.wv['good'].shape

(100,)

#### Word2Vec vs AvgWord2Vec:
every word gets converted into vectors with 100 dimensions if we use the model which we have trained just now. it means if you have 10 words in a sentence then all of the words will be converted into vectors with 100 dimensions.

In avgword2vec, we take the avg value of each dimension of all the words in a sentence and then take the avg value. that way we have only one vector for the whole sentence with 100 dimensions. it is computionally more effecient

In [145]:
def avg_word2vec(doc):

    vocab=model.wv.vocab.keys()
    return np.mean([model.wv[word] for word in doc if word in vocab],axis=0)
                

In [146]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [147]:
from tqdm import tqdm # to track the progress

In [148]:
avg_word2vec(words[0]).shape # it is converting a sentence into a vector by taking avg values

(100,)

In [149]:
words[0]

['go',
 'jurong',
 'point',
 'crazy',
 'available',
 'bugis',
 'great',
 'world',
 'la',
 'buffet',
 'cine',
 'got',
 'amore',
 'wat']

In [150]:
# apply avgword2vec to every sentence:
X=[]
for i in tqdm(range(len(words))):
  X.append(avg_word2vec(words[i]))



  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 5564/5564 [00:00<00:00, 9262.97it/s]


In [151]:
len(X)

5564

In [152]:
# converting the list into arrays
X_new=np.array(X)

  X_new=np.array(X)


In [153]:
X_new

array([array([ 0.1546087 , -0.14088577,  0.17547633, -0.35944661, -0.01137265,
               0.33409747,  0.25645402,  0.08449929,  0.05000579,  0.09256528,
              -0.2762603 ,  0.10092197, -0.38037717,  0.04303493,  0.01650086,
               0.00516385,  0.10958337, -0.22792922, -0.01705717,  0.01475398,
               0.07394274, -0.26079592,  0.5097234 , -0.18557414, -0.24847946,
              -0.38059634,  0.18595581, -0.14688142, -0.02537193, -0.2804627 ,
               0.33316672,  0.04540907,  0.3477824 ,  0.00393646,  0.22351123,
              -0.18171024, -0.24479757, -0.09076348, -0.05686321, -0.20773746,
              -0.24526219,  0.1893187 ,  0.17749065,  0.04909588,  0.12174215,
               0.3230932 ,  0.13673571,  0.23983826, -0.1218561 ,  0.05355744,
              -0.04939915,  0.13673426,  0.04475321, -0.10286453, -0.07896338,
               0.24872617,  0.2180737 , -0.01450358, -0.16486555, -0.2397254 ,
              -0.23084283,  0.13880688, -0.02583972,

In [154]:
# Test
sum=lambda x,y:x+y


In [155]:
sum(1,2)

3

In [156]:
# Dependent Feature:
def length(x):
  if len(x)>0:
    return x


y=df[list(map(lambda x:len(x)>0,corpus))]
y=pd.get_dummies(y['label'],drop_first=True)
y=y.iloc[:,0].values
y

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

In [157]:
#printing the shape of dependent and independent features
print(X_new.shape)# independent feature
print(y.shape)#dependent feature

(5564,)
(5564,)


#### Analysis:
upon checking the shape of independent and dependent features, 8 records have been removed from the independent feature.
The reasons of missing 8 records in the independent feature is that some of the sentences have been replaced with a white space due to the condition in preprocessing and hence they couldn't be converted into vectors. therefore, they have been removed from the independent variable.

In [158]:
X[0]

array([ 0.1546087 , -0.14088577,  0.17547633, -0.35944661, -0.01137265,
        0.33409747,  0.25645402,  0.08449929,  0.05000579,  0.09256528,
       -0.2762603 ,  0.10092197, -0.38037717,  0.04303493,  0.01650086,
        0.00516385,  0.10958337, -0.22792922, -0.01705717,  0.01475398,
        0.07394274, -0.26079592,  0.5097234 , -0.18557414, -0.24847946,
       -0.38059634,  0.18595581, -0.14688142, -0.02537193, -0.2804627 ,
        0.33316672,  0.04540907,  0.3477824 ,  0.00393646,  0.22351123,
       -0.18171024, -0.24479757, -0.09076348, -0.05686321, -0.20773746,
       -0.24526219,  0.1893187 ,  0.17749065,  0.04909588,  0.12174215,
        0.3230932 ,  0.13673571,  0.23983826, -0.1218561 ,  0.05355744,
       -0.04939915,  0.13673426,  0.04475321, -0.10286453, -0.07896338,
        0.24872617,  0.2180737 , -0.01450358, -0.16486555, -0.2397254 ,
       -0.23084283,  0.13880688, -0.02583972,  0.10321329, -0.2266865 ,
       -0.24000959, -0.13631266, -0.25968912,  0.0498648 ,  0.12

In [120]:
#X_new[0].reshape(1,-1).shape # converting it into 2d array

In [159]:
# Final independent features
# converting 1d array into 2d dataframe
df_new=pd.DataFrame()
for i in range(0,len(X)):
  df_new=df_new.append(pd.DataFrame(X[i].reshape(1,-1)),ignore_index=True)

In [160]:
df_new.shape

(5564, 100)

In [163]:
# sentences with 100 dimensions
df_new.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.154609,-0.140886,0.175476,-0.359447,-0.011373,0.334097,0.256454,0.084499,0.050006,0.092565,...,-0.168423,0.00567,-0.064839,-0.245212,0.044382,0.030419,0.167871,-0.169364,0.319271,-0.102991
1,0.125978,-0.11958,0.14283,-0.29543,-0.010687,0.272264,0.212524,0.068719,0.039765,0.076359,...,-0.140152,0.003899,-0.058514,-0.200573,0.033658,0.028258,0.138971,-0.139665,0.263648,-0.082256
2,0.162086,-0.142242,0.176655,-0.370914,-0.009273,0.34411,0.262651,0.087392,0.048641,0.093462,...,-0.173095,0.007253,-0.0687,-0.253436,0.046868,0.032809,0.174604,-0.174763,0.330344,-0.106403
3,0.22005,-0.206744,0.252645,-0.520019,-0.01566,0.481049,0.370581,0.121353,0.073174,0.133557,...,-0.24778,0.005469,-0.09256,-0.354782,0.065128,0.044279,0.244863,-0.246951,0.462552,-0.14825
4,0.173329,-0.162881,0.199891,-0.411318,-0.012927,0.379406,0.29343,0.095023,0.056839,0.106779,...,-0.195423,0.007979,-0.075818,-0.279146,0.051623,0.034376,0.194645,-0.197156,0.365712,-0.118568


In [168]:
X=df_new # Independent Features
y # dependent features

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

In [169]:
print(X.shape)
print(y.shape)

(5564, 100)
(5564,)


In [186]:
#checking null values:
df_new.isnull().sum()

0     68
1     68
2     68
3     68
4     68
      ..
95    68
96    68
97    68
98    68
99    68
Length: 100, dtype: int64

In [190]:
#dealing with Null values:

df_new['Output']=y

df_new.dropna(axis=0,inplace=True)
df_new.isnull().sum()

0         0
1         0
2         0
3         0
4         0
         ..
96        0
97        0
98        0
99        0
Output    0
Length: 101, dtype: int64

In [193]:
#seperating independent and dependent features:
X=df_new.iloc[:,:-1]
y=df_new.iloc[:,-1:]

#### Model Traning:
we have converted words into vectors. 
Now we are going to apply ML algorithm to train the model.

In [194]:
#train-test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=2)

In [195]:
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

(4396, 100) (4396, 1) (1100, 100) (1100, 1)


In [196]:
# applying ML algorithm to train the model
from sklearn.ensemble import RandomForestClassifier
spam_detection_model=RandomForestClassifier()

In [197]:
spam_model=spam_detection_model.fit(X_train,y_train)

  spam_model=spam_detection_model.fit(X_train,y_train)


In [199]:
y_predict=spam_model.predict(X_test)

#### Model Evaluation

In [201]:
from sklearn.metrics import classification_report,accuracy_score
classification_report(y_test,y_predict)

'              precision    recall  f1-score   support\n\n           0       0.96      0.99      0.98       954\n           1       0.95      0.71      0.82       146\n\n    accuracy                           0.96      1100\n   macro avg       0.96      0.85      0.90      1100\nweighted avg       0.96      0.96      0.95      1100\n'

In [203]:
accuracy_score(y_test,y_predict)

0.9572727272727273