# NLP I Tutorial

Notebook from [Zain Hasan](https://drive.google.com/drive/folders/1eISkk5lT79Ao3Mgngz5nkzn8qafX0RUF?usp=sharing).


## Text Simplification

### Read in the data

Dataset can also be found at: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [1]:
# Read in and view the raw data
import pandas as pd

messages = pd.read_table('data/SMSSpamCollection.txt', encoding='latin-1',header=None)
messages.head(10)

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [2]:
#change column names to be more descriptive
messages.columns = ["label", "text"]
messages.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [3]:
# How big is this dataset?
messages.shape

(5572, 2)

In [4]:
# What portion of our text messages are actually spam?
messages['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [5]:
# Are we missing any data?
print('Number of nulls in label: {}'.format(messages['label'].isnull().sum()))
print('Number of nulls in text: {}'.format(messages['text'].isnull().sum()))

Number of nulls in label: 0
Number of nulls in text: 0


### Remove Punctuation

In [6]:
# What punctuation is included in the default list?
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [7]:
# Define a function to remove punctuation in our messages
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

In [8]:
txt1 = "hi, my name is jeremy!"
txt2 = remove_punct(txt1)
txt2

'hi my name is jeremy'

In [9]:
txt3 = txt2.split()
txt3

['hi', 'my', 'name', 'is', 'jeremy']

In [10]:
messages['text_clean'] = messages['text'].apply(lambda x: remove_punct(x))

messages.head(10)

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...
5,spam,FreeMsg Hey there darling it's been 3 week's n...,FreeMsg Hey there darling its been 3 weeks now...
6,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...
7,ham,As per your request 'Melle Melle (Oru Minnamin...,As per your request Melle Melle Oru Minnaminun...
8,spam,WINNER!! As a valued network customer you have...,WINNER As a valued network customer you have b...
9,spam,Had your mobile 11 months or more? U R entitle...,Had your mobile 11 months or more U R entitled...


### Tokenize (and lowercase)


In [11]:
# Define a function to split our sentences into a list of words
# ['Define', 'a', 'function', ...]

#import regular expression
import re

def tokenize(text):
    tokens = text.split()
    return tokens

messages['text_tokenized'] = messages['text_clean'].apply(lambda x: tokenize(x.lower()))

messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l..."


### Remove Stopwords

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing written in the Python programming language.

We will first load NLTK and download its pre-defined stopwords.

In [12]:
# !pip install nltk

In [13]:
# Import the NLTK package and download the necessary data
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# view the stopwords
stopwords.words()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\enger\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['إذ',
 'إذا',
 'إذما',
 'إذن',
 'أف',
 'أقل',
 'أكثر',
 'ألا',
 'إلا',
 'التي',
 'الذي',
 'الذين',
 'اللاتي',
 'اللائي',
 'اللتان',
 'اللتيا',
 'اللتين',
 'اللذان',
 'اللذين',
 'اللواتي',
 'إلى',
 'إليك',
 'إليكم',
 'إليكما',
 'إليكن',
 'أم',
 'أما',
 'أما',
 'إما',
 'أن',
 'إن',
 'إنا',
 'أنا',
 'أنت',
 'أنتم',
 'أنتما',
 'أنتن',
 'إنما',
 'إنه',
 'أنى',
 'أنى',
 'آه',
 'آها',
 'أو',
 'أولاء',
 'أولئك',
 'أوه',
 'آي',
 'أي',
 'أيها',
 'إي',
 'أين',
 'أين',
 'أينما',
 'إيه',
 'بخ',
 'بس',
 'بعد',
 'بعض',
 'بك',
 'بكم',
 'بكم',
 'بكما',
 'بكن',
 'بل',
 'بلى',
 'بما',
 'بماذا',
 'بمن',
 'بنا',
 'به',
 'بها',
 'بهم',
 'بهما',
 'بهن',
 'بي',
 'بين',
 'بيد',
 'تلك',
 'تلكم',
 'تلكما',
 'ته',
 'تي',
 'تين',
 'تينك',
 'ثم',
 'ثمة',
 'حاشا',
 'حبذا',
 'حتى',
 'حيث',
 'حيثما',
 'حين',
 'خلا',
 'دون',
 'ذا',
 'ذات',
 'ذاك',
 'ذان',
 'ذانك',
 'ذلك',
 'ذلكم',
 'ذلكما',
 'ذلكن',
 'ذه',
 'ذو',
 'ذوا',
 'ذواتا',
 'ذواتي',
 'ذي',
 'ذين',
 'ذينك',
 'ريث',
 'سوف',
 'سوى',
 'شتان',
 'عدا',
 'عسى',
 'عل'

You will see that NLTK contains pre-defined stopwords accross many different languages. We want to only use their english stopwords.

In [14]:
ENGstopwords = stopwords.words('english')
ENGstopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [15]:
# Define a function to remove all stopwords
def remove_stopwords(tokenized_text):    
    text = [word for word in tokenized_text if word not in ENGstopwords]
    return text

messages['text_nostop'] = messages['text_tokenized'].apply(lambda x: remove_stopwords(x))

messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


### Stemming

In [16]:
# importing modules 
nltk.download('punkt')
from nltk.stem import PorterStemmer

ps = PorterStemmer() #will remove pre-defined stems
   
sentence="Programmers have programed by learning how to programatically program"

#first tokenize (function defined earlier)
words = tokenize(sentence)
words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\enger\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Programmers',
 'have',
 'programed',
 'by',
 'learning',
 'how',
 'to',
 'programatically',
 'program']

In [17]:
#stem each of these tokens and compare
for w in words:
    print(w, " -> ", ps.stem(w))

Programmers  ->  programm
have  ->  have
programed  ->  program
by  ->  by
learning  ->  learn
how  ->  how
to  ->  to
programatically  ->  programat
program  ->  program


### Lemmatization 

Lemmatization tries to 'intelligently' (compared to stemming) reduce vocabulary size by simplifying words that are similar in meaning. Requires more advanced techniques such as: POS, dictionary look-up etc.

In [18]:
# import these modules
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer() 

print("Rocks ->", lemmatizer.lemmatize("rocks")) 
print("Corpora ->", lemmatizer.lemmatize("corpora")) 

#by default, wordnet assumes everything is a noun
# 'a' denotes adjective in "pos" 
print("better ->", lemmatizer.lemmatize("better"))
print("better ->", lemmatizer.lemmatize("better", pos ="a"))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\enger\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Rocks -> rock
Corpora -> corpus
better -> better
better -> good


## Text Data Representation (Vectorizing)

### Create Document Term Matrix - Vectorize your list of words

In [19]:
from sklearn.feature_extraction.text import CountVectorizer #for BoW
vect = CountVectorizer()

# Convenience function
# This function will apply a vectorizer to a series of text
# Will return a dataframe with the results of the vectorizer.
def create_doc_term_matrix(text,vectorizer):
    doc_term_matrix = vectorizer.fit_transform(text)
    return pd.DataFrame(doc_term_matrix.toarray(), columns = vectorizer.get_feature_names_out())

#example
ex_text = pd.Series(["The movie scary", " Tenet is a great movie", "Last movie I saw was in March"])
ex_text

0                  The movie scary
1           Tenet is a great movie
2    Last movie I saw was in March
dtype: object

In [20]:
create_doc_term_matrix(ex_text,vect)

Unnamed: 0,great,in,is,last,march,movie,saw,scary,tenet,the,was
0,0,0,0,0,0,1,0,1,0,1,0
1,1,0,1,0,0,1,0,0,1,0,0
2,0,1,0,1,1,1,1,0,0,0,1


### Bag of Words (BoW)

In [21]:
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


In [22]:
vectorizer_input = messages['text_nostop'].apply(lambda x: " ".join(x) )
vectorizer_input

0       go jurong point crazy available bugis n great ...
1                                 ok lar joking wif u oni
2       free entry 2 wkly comp win fa cup final tkts 2...
3                     u dun say early hor u c already say
4             nah dont think goes usf lives around though
                              ...                        
5567    2nd time tried 2 contact u u â£750 pound prize...
5568                         ã¼ b going esplanade fr home
5569                          pity mood soany suggestions
5570    guy bitching acted like id interested buying s...
5571                                       rofl true name
Name: text_nostop, Length: 5572, dtype: object

In [23]:
X = vect.fit_transform(vectorizer_input)
type(X)

scipy.sparse._csr.csr_matrix

In [24]:
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [25]:
X.shape

(5572, 9459)

#### Using above function for convenience

In [26]:
#function that was defined earlier to view BoW as a DataFrame
create_doc_term_matrix(vectorizer_input,CountVectorizer())

Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,02070836089,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,¾ã,ã¼,ã¼ll
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Drawbacks of a Bag of Words Model

* If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.

* Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)

* We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text. - We rely only on word frequency and throw all other relations and patterns out (ignores negation).

### Text Frequency - Inverse Document Frequency (TF-IDF)

Creates a document-term matrix where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how **important** a word is to a document.

In [27]:
ex_text

0                  The movie scary
1           Tenet is a great movie
2    Last movie I saw was in March
dtype: object

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer


create_doc_term_matrix(ex_text,TfidfVectorizer())

Unnamed: 0,great,in,is,last,march,movie,saw,scary,tenet,the,was
0,0.0,0.0,0.0,0.0,0.0,0.385372,0.0,0.652491,0.0,0.652491,0.0
1,0.546454,0.0,0.546454,0.0,0.0,0.322745,0.0,0.0,0.546454,0.0,0.0
2,0.0,0.432385,0.0,0.432385,0.432385,0.255374,0.432385,0.0,0.0,0.0,0.432385


In [29]:
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


In [30]:
# Fit a basic TFIDF Vectorizer and view the results
tfidf_vect = TfidfVectorizer()
X_tfidf = tfidf_vect.fit_transform(vectorizer_input)
type(X_tfidf)

scipy.sparse._csr.csr_matrix

In [31]:
X_tfidf.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [32]:
X_tfidf.shape

(5572, 9459)

In [33]:
#function that was defined earlier to view TF-IDF result as a DataFrame
create_doc_term_matrix(vectorizer_input,TfidfVectorizer())

Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,02070836089,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,¾ã,ã¼,ã¼ll
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.378219,0.0
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


## Example of classification on our data
Let's try classifying the SMS as spam or ham (real) using the TF-IDF transformed data. We'll just do this with logistic regression.

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

X = X_tfidf.toarray()
y = messages['label'] == 'spam' #true=spam messages, false=real messages

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = LogisticRegression()
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
C = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {acc}')
print(f'Confusion matrix:\n {C}')

Accuracy: 0.9515550239234449
Confusion matrix:
 [[1444    4]
 [  77  147]]


Confusion matrix review:
- Column 0: predicted 0 (real)
- Column 1: predicted 1 (spam)
- Row 0: actually 0 (real)
- Row 1: actually 1 (spam)

|  **true negatives** | **false positives** |
|:-------------------:|:-------------------:|
| **false negatives** |  **true positives** |

Model results:
- 1444 true negatives (predicted real and was real)
- 77 false negatives (predicted real but was spam)
- 147 true positives (predicted spam and was spam)
- 4 false positives (predicted spam but was real)

What "hurts" us more in this situation? False negatives or false positives?

Review:
- Precision = $\dfrac{tp}{tp+fp}$ = "My correct % when I predict positive" = "when I predict spam, how often am I correct?"
- Recall = $\dfrac{tp}{tp+fn}$ = "For all of the actual positives, what % do I get correct?" = "What percentage of the spam emails do I predict correctly?"
- f1 = $2 \times \dfrac{precision \times recall}{precision + recall}$

In [35]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

pre = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Precision: {acc}')
print(f'Recall: {rec}')
print(f'f1-score: {f1}')

Precision: 0.9515550239234449
Recall: 0.65625
f1-score: 0.784


## Easy Balancing with SMOTE

Our data set consists of more real messages than spam messages. We can use SMOTE to oversample under-represented class (spam messages in this case).

In [36]:
print(X_train.shape)
y_train.value_counts()

(3900, 9459)


False    3377
True      523
Name: label, dtype: int64

In [37]:
# !pip install imblearn

In [38]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

print(X_res.shape)
y_res.value_counts()

(6754, 9459)


False    3377
True     3377
Name: label, dtype: int64

In [39]:
clf_res = LogisticRegression()
clf_res.fit(X_res,y_res)

y_res_pred = clf_res.predict(X_test)

acc = accuracy_score(y_test, y_res_pred)
pre = precision_score(y_test, y_res_pred)
rec = recall_score(y_test, y_res_pred)
f1 = f1_score(y_test, y_res_pred)
C = confusion_matrix(y_test, y_res_pred)

print(f'Accuracy: {acc}')
print(f'Precision: {acc}')
print(f'Recall: {rec}')
print(f'f1-score: {f1}')
print(f'Confusion matrix:\n {C}')

Accuracy: 0.9772727272727273
Precision: 0.9772727272727273
Recall: 0.9017857142857143
f1-score: 0.9140271493212669
Confusion matrix:
 [[1432   16]
 [  22  202]]


### Word2Vec - A Class of Neural Net Vectorizers
word2vec is a shallow 2 layer neural network that accepts text corpus as input and return word embeddings as output
#### Why Word 2 Vec?
1. Preserves Relationship between words
2. Deals with addition of new words in vocabulary
3. Results in better predictions if word2vec processed data used as input to another larger NN

In [40]:
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


In [41]:
# !pip install gensim

In [42]:
import gensim

# Create CBoW model
#size: output vector size
#window: size of window for surrounding words
#min_count: ignores all words that appear less than min_count times
Model_CBoW = gensim.models.Word2Vec(messages['text_tokenized'], size = 100, window = 3, min_count = 2)

#Vector represenation of the word 'king'
Model_CBoW.wv['king']

array([-0.00015496, -0.0215259 ,  0.03630557, -0.06822649,  0.01102275,
       -0.02449784,  0.02897923, -0.02125908,  0.09300253, -0.04150956,
       -0.01907781, -0.08700907,  0.0542681 , -0.09134455, -0.01677775,
       -0.00054375,  0.04392523, -0.04110511,  0.03905876,  0.02563499,
        0.08110146, -0.01525502,  0.0055393 ,  0.06101158,  0.00163346,
        0.03482507,  0.031552  ,  0.03945646, -0.01660838,  0.07296526,
       -0.069723  ,  0.03075303,  0.02469926,  0.03930927,  0.01173939,
       -0.06289896,  0.03673577, -0.01746828,  0.06749853, -0.06443445,
       -0.0041235 ,  0.0725765 ,  0.03936092,  0.01489849,  0.02850564,
       -0.04357232, -0.01927561, -0.00445219,  0.00934493, -0.0855462 ,
        0.034843  ,  0.05703941, -0.0244067 ,  0.02066357, -0.01101948,
       -0.04741342, -0.01615389, -0.03048112,  0.00315223,  0.00595966,
        0.04356307,  0.03485566,  0.03656057, -0.0218999 ,  0.02328006,
       -0.00412468,  0.02576547,  0.06213411,  0.03650833,  0.00

In [43]:
# Find the most similar words to "hello" based on word vectors from our trained model
Model_CBoW.wv.most_similar('hello')

[('d', 0.9998430609703064),
 ('my', 0.9998379945755005),
 ('today', 0.9998365640640259),
 ('amp', 0.9998316764831543),
 ('as', 0.9998304843902588),
 ('da', 0.9998303651809692),
 ('well', 0.9998284578323364),
 ('last', 0.9998210072517395),
 ('c', 0.9998205900192261),
 ('say', 0.9998131990432739)]

In [44]:
# Create Skip Gram model (notice sg parameter)
Model_SG = gensim.models.Word2Vec(messages['text_tokenized'], size = 100, window = 3, min_count = 2, sg = 1)
  
#Vector represenation of the word 'king'
Model_SG.wv['king']

array([-0.02803215, -0.0146966 ,  0.09850621, -0.13715036,  0.03307633,
       -0.03670636,  0.05815274, -0.09601659,  0.34234607, -0.1280102 ,
       -0.00173065, -0.2688391 ,  0.15124065, -0.26570135,  0.0388335 ,
       -0.00159834,  0.126596  , -0.14970672,  0.11751682,  0.12102795,
        0.27480483, -0.05340677,  0.04008303,  0.16985314,  0.02361733,
        0.09874941,  0.10157059,  0.11704186, -0.10816357,  0.18560493,
       -0.14831547,  0.13020548,  0.04430413,  0.08217313,  0.08756179,
       -0.13105722,  0.09015936, -0.04166153,  0.20433678, -0.10988935,
        0.00903545,  0.20010915,  0.10692555,  0.05195532,  0.11819371,
       -0.10305245, -0.06506041, -0.00752653, -0.04515985, -0.25956202,
        0.13595797,  0.2110356 , -0.03585917,  0.07992498, -0.03435447,
       -0.14372638, -0.08772189, -0.0636804 , -0.02553392,  0.00884872,
        0.13785906,  0.12314767,  0.13136153, -0.07787172,  0.05254714,
        0.03363005,  0.08997763,  0.23236737,  0.08319782,  0.08

In [45]:
# Find the most similar words to "hello" based on word vectors from our trained model
Model_SG.wv.most_similar('hello')

[('boy', 0.9994226694107056),
 ('long', 0.9992021322250366),
 ('god', 0.9990752935409546),
 ('far', 0.99901282787323),
 ('hows', 0.9989965558052063),
 ('kind', 0.9989460110664368),
 ('feeling', 0.9989080429077148),
 ('most', 0.9988796710968018),
 ('knew', 0.9988732933998108),
 ('afternoon', 0.9988508820533752)]

#### Pre-trained word2vec models

There are also pre-trained word2vec models. For example, [`word2vec-google-news-300`](https://code.google.com/archive/p/word2vec/) is trained on a Google News dataset consisting of about 100 billion words. The model consists of 300-dimensional vectors for 3 million words and phrases.

In [46]:
import gensim.downloader as api

google_news_300 = api.load('word2vec-google-news-300')

In [47]:
google_news_300['king']

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [48]:
google_news_300.most_similar('hello')

[('hi', 0.6548984050750732),
 ('goodbye', 0.639905571937561),
 ('howdy', 0.6310957670211792),
 ('goodnight', 0.5920577645301819),
 ('greeting', 0.5855878591537476),
 ('Hello', 0.5842196345329285),
 ("g'day", 0.5754078030586243),
 ('See_ya', 0.5688871145248413),
 ('ya_doin', 0.5643119812011719),
 ('greet', 0.5636603832244873)]