# NLP I Tutorial

Notebook from [Zain Hasan](https://drive.google.com/drive/folders/1eISkk5lT79Ao3Mgngz5nkzn8qafX0RUF?usp=sharing).


## Text Simplification

### Read in the data

Dataset can also be found at: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [1]:
# Read in and view the raw data
import pandas as pd

messages = pd.read_table('data/SMSSpamCollection.txt', encoding='latin-1',header=None)
messages.head(10)

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [3]:
#change column names to be more descriptive
messages.columns = ["label", "text"]
messages.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
# How big is this dataset?
messages.shape

(5572, 2)

In [5]:
# What portion of our text messages are actually spam?
messages['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
# Are we missing any data?
print('Number of nulls in label: {}'.format(messages['label'].isnull().sum()))
print('Number of nulls in text: {}'.format(messages['text'].isnull().sum()))

Number of nulls in label: 0
Number of nulls in text: 0


### Remove Punctuation

In [7]:
# What punctuation is included in the default list?
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
# Define a function to remove punctuation in our messages
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

In [11]:
txt1 = 'hi! my name is Jess.'
txt2 = remove_punct(txt1)
print(txt2)

hi my name is Jess


In [10]:
messages['text_clean'] = messages['text'].apply(lambda x: remove_punct(x))

messages.head(10)


Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...
5,spam,FreeMsg Hey there darling it's been 3 week's n...,FreeMsg Hey there darling its been 3 weeks now...
6,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...
7,ham,As per your request 'Melle Melle (Oru Minnamin...,As per your request Melle Melle Oru Minnaminun...
8,spam,WINNER!! As a valued network customer you have...,WINNER As a valued network customer you have b...
9,spam,Had your mobile 11 months or more? U R entitle...,Had your mobile 11 months or more U R entitled...


### Tokenize (and lowercase)


In [12]:
# Define a function to split our sentences into a list of words
# ['Define', 'a', 'function', ...]

#import regular expression
import re

def tokenize(text):
    tokens = text.split()
    return tokens

messages['text_tokenized'] = messages['text_clean'].apply(lambda x: tokenize(x.lower()))

messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l..."


### Remove Stopwords

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing written in the Python programming language.

We will first load NLTK and download its pre-defined stopwords.

In [13]:
# Import the NLTK package and download the necessary data
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# view the stopwords
stopwords.words()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


['إذ',
 'إذا',
 'إذما',
 'إذن',
 'أف',
 'أقل',
 'أكثر',
 'ألا',
 'إلا',
 'التي',
 'الذي',
 'الذين',
 'اللاتي',
 'اللائي',
 'اللتان',
 'اللتيا',
 'اللتين',
 'اللذان',
 'اللذين',
 'اللواتي',
 'إلى',
 'إليك',
 'إليكم',
 'إليكما',
 'إليكن',
 'أم',
 'أما',
 'أما',
 'إما',
 'أن',
 'إن',
 'إنا',
 'أنا',
 'أنت',
 'أنتم',
 'أنتما',
 'أنتن',
 'إنما',
 'إنه',
 'أنى',
 'أنى',
 'آه',
 'آها',
 'أو',
 'أولاء',
 'أولئك',
 'أوه',
 'آي',
 'أي',
 'أيها',
 'إي',
 'أين',
 'أين',
 'أينما',
 'إيه',
 'بخ',
 'بس',
 'بعد',
 'بعض',
 'بك',
 'بكم',
 'بكم',
 'بكما',
 'بكن',
 'بل',
 'بلى',
 'بما',
 'بماذا',
 'بمن',
 'بنا',
 'به',
 'بها',
 'بهم',
 'بهما',
 'بهن',
 'بي',
 'بين',
 'بيد',
 'تلك',
 'تلكم',
 'تلكما',
 'ته',
 'تي',
 'تين',
 'تينك',
 'ثم',
 'ثمة',
 'حاشا',
 'حبذا',
 'حتى',
 'حيث',
 'حيثما',
 'حين',
 'خلا',
 'دون',
 'ذا',
 'ذات',
 'ذاك',
 'ذان',
 'ذانك',
 'ذلك',
 'ذلكم',
 'ذلكما',
 'ذلكن',
 'ذه',
 'ذو',
 'ذوا',
 'ذواتا',
 'ذواتي',
 'ذي',
 'ذين',
 'ذينك',
 'ريث',
 'سوف',
 'سوى',
 'شتان',
 'عدا',
 'عسى',
 'عل'

You will see that NLTK contains pre-defined stopwords accross many different languages. We want to only use their english stopwords.

In [14]:
ENGstopwords = stopwords.words('english')
ENGstopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [16]:
# Define a function to remove all stopwords
def remove_stopwords(tokenized_text):    
    text = [word for word in tokenized_text if word not in ENGstopwords]
    return text

messages['text_nostop'] = messages['text_tokenized'].apply(lambda x: remove_stopwords(x))

messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


### Stemming

In [17]:
# importing modules 
nltk.download('punkt')
from nltk.stem import PorterStemmer

ps = PorterStemmer() #will remove pre-defined stems
   
sentence="Programmers have programed by learning how to programatically program"

#first tokenize (function defined earlier)
words = tokenize(sentence)
words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Programmers',
 'have',
 'programed',
 'by',
 'learning',
 'how',
 'to',
 'programatically',
 'program']

In [18]:
#stem each of these tokens and compare
for w in words:
    print(w, " -> ", ps.stem(w))

Programmers  ->  programm
have  ->  have
programed  ->  program
by  ->  by
learning  ->  learn
how  ->  how
to  ->  to
programatically  ->  programat
program  ->  program


### Lemmatization 

Lemmatization tries to 'intelligently' (compared to stemming) reduce vocabulary size by simplifying words that are similar in meaning. Requires more advanced techniques such as: POS, dictionary look-up etc.

In [20]:
! pip install nltk

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip available: 22.3 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# import these modules
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer() 

print("Rocks ->", lemmatizer.lemmatize("rocks")) 
print("Corpora ->", lemmatizer.lemmatize("corpora")) 

#by default, wordnet assumes everything is a noun
# 'a' denotes adjective in "pos" 
print("better ->", lemmatizer.lemmatize("better"))
print("better ->", lemmatizer.lemmatize("better", pos ="a"))

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Rocks -> rock
Corpora -> corpus
better -> better
better -> good


## Text Data Representation (Vectorizing)

### Create Document Term Matrix - Vectorize your list of words

In [22]:
from sklearn.feature_extraction.text import CountVectorizer #for BoW
vect = CountVectorizer()

# Convenience function
# This function will apply a vectorizer to a series of text
# Will return a dataframe with the results of the vectorizer.
def create_doc_term_matrix(text,vectorizer):
    doc_term_matrix = vectorizer.fit_transform(text)
    return pd.DataFrame(doc_term_matrix.toarray(), columns = vectorizer.get_feature_names())

#example
ex_text = pd.Series(["The movie scary", " Tenet is a great movie", "Last movie I saw was in March"])
ex_text

0                  The movie scary
1           Tenet is a great movie
2    Last movie I saw was in March
dtype: object

In [23]:
create_doc_term_matrix(ex_text,vect)



Unnamed: 0,great,in,is,last,march,movie,saw,scary,tenet,the,was
0,0,0,0,0,0,1,0,1,0,1,0
1,1,0,1,0,0,1,0,0,1,0,0
2,0,1,0,1,1,1,1,0,0,0,1


### Bag of Words (BoW)

In [24]:
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


In [25]:
vectorizer_input = messages['text_nostop'].apply(lambda x: " ".join(x) )
vectorizer_input

0       go jurong point crazy available bugis n great ...
1                                 ok lar joking wif u oni
2       free entry 2 wkly comp win fa cup final tkts 2...
3                     u dun say early hor u c already say
4             nah dont think goes usf lives around though
                              ...                        
5567    2nd time tried 2 contact u u â£750 pound prize...
5568                         ã¼ b going esplanade fr home
5569                          pity mood soany suggestions
5570    guy bitching acted like id interested buying s...
5571                                       rofl true name
Name: text_nostop, Length: 5572, dtype: object

In [26]:
X = vect.fit_transform(vectorizer_input)
type(X)

scipy.sparse._csr.csr_matrix

In [27]:
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [28]:
X.shape

(5572, 9459)

#### Using above function for convenience

In [29]:
#function that was defined earlier to view BoW as a DataFrame
create_doc_term_matrix(vectorizer_input,CountVectorizer())



Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,02070836089,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,¾ã,ã¼,ã¼ll
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Drawbacks of a Bag of Words Model

* If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.

* Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)

* We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text. - We rely only on word frequency and throw all other relations and patterns out (ignores negation).

### Text Frequency - Inverse Document Frequency (TF-IDF)

Creates a document-term matrix where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how **important** a word is to a document.

In [30]:
ex_text

0                  The movie scary
1           Tenet is a great movie
2    Last movie I saw was in March
dtype: object

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

create_doc_term_matrix(ex_text,TfidfVectorizer())



Unnamed: 0,great,in,is,last,march,movie,saw,scary,tenet,the,was
0,0.0,0.0,0.0,0.0,0.0,0.385372,0.0,0.652491,0.0,0.652491,0.0
1,0.546454,0.0,0.546454,0.0,0.0,0.322745,0.0,0.0,0.546454,0.0,0.0
2,0.0,0.432385,0.0,0.432385,0.432385,0.255374,0.432385,0.0,0.0,0.0,0.432385


In [32]:
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


In [33]:
# Fit a basic TFIDF Vectorizer and view the results
tfidf_vect = TfidfVectorizer()
X_tfidf = tfidf_vect.fit_transform(vectorizer_input)
type(X_tfidf)

scipy.sparse._csr.csr_matrix

In [34]:
X_tfidf.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [35]:
X_tfidf.shape

(5572, 9459)

In [36]:
#function that was defined earlier to view TF-IDF result as a DataFrame
create_doc_term_matrix(vectorizer_input,TfidfVectorizer())



Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,02070836089,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,¾ã,ã¼,ã¼ll
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.378219,0.0
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


### Word2Vec - A Class of Neural Net Vectorizers
word2vec is a shallow 2 layer neural network that accepts text corpus as input and return word embeddings as output
#### Why Word 2 Vec?
1. Preserves Relationship between words
2. Deals with addition of new words in vocabulary
3. Results in better predictions if word2vec processed data used as input to another larger NN

In [36]:
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


In [52]:
import gensim

# Create CBoW model
#size: output vector size
#window: size of window for surrounding words
#min_count: ignores all words that appear less than min_count times
Model_CBoW = gensim.models.Word2Vec(messages['text_tokenized'], vector_size = 100, window = 5, min_count = 1)
Model_CBoW.train

#Vector represenation of the word 'king'
Model_CBoW.wv['king']

array([ 0.00053299,  0.0770587 ,  0.0129337 ,  0.00099061,  0.00288038,
       -0.11428785,  0.06753374,  0.16055693, -0.06282012, -0.06971613,
       -0.01141732, -0.09305993, -0.03204932,  0.04160111,  0.02307217,
       -0.06142305,  0.00107547, -0.07183092,  0.00029542, -0.13335095,
        0.04767521,  0.02346842,  0.04948096, -0.05237101, -0.03553404,
       -0.00685976, -0.04844962, -0.0488181 , -0.06557083,  0.03341866,
        0.07576155, -0.02012962,  0.03622498, -0.07867093, -0.02741073,
        0.06345442,  0.01731435, -0.01382209, -0.04488309, -0.09873967,
       -0.00294101, -0.04300049,  0.01552348, -0.00074072,  0.05983271,
       -0.04845626, -0.03273592, -0.01112364,  0.0495051 ,  0.05376087,
        0.00170919, -0.06579264, -0.02823585, -0.0149427 , -0.01750498,
        0.04085903,  0.04102078,  0.00111599, -0.06691798,  0.03095351,
        0.03284454,  0.01292648,  0.01077312,  0.00735037, -0.09392164,
        0.08586631,  0.03405686,  0.06382684, -0.09616807,  0.07

In [53]:
# Find the most similar words to "early" based on word vectors from our trained model
Model_CBoW.wv.most_similar('want')

[('know', 0.9995521903038025),
 ('do', 0.9994624853134155),
 ('think', 0.9994074702262878),
 ('need', 0.9993860721588135),
 ('what', 0.9993777871131897),
 ('miss', 0.9993184804916382),
 ('dont', 0.9992822408676147),
 ('can', 0.9992190599441528),
 ('if', 0.9992154836654663),
 ('love', 0.9991970658302307)]

In [55]:
# Create Skip Gram model (notice sg parameter)
Model_SG = gensim.models.Word2Vec(messages['text_tokenized'], vector_size = 100, window = 3, min_count = 2, sg = 1)
  
#Vector represenation of the word 'king'
Model_SG.wv['king']

array([-0.05412154,  0.13770632, -0.01818684, -0.00588391,  0.00424419,
       -0.2835862 ,  0.14033507,  0.36311802, -0.1728852 , -0.15868893,
       -0.05419724, -0.26420334, -0.10196494,  0.11091769,  0.06366011,
       -0.13318095,  0.04745481, -0.1582334 ,  0.05767003, -0.32978764,
        0.11864798,  0.06538022,  0.1188538 , -0.07103211, -0.02752342,
        0.01811941, -0.0195948 , -0.05654018, -0.10825837,  0.08342687,
        0.19546086, -0.01796416,  0.08976912, -0.13683195, -0.03405635,
        0.12145037,  0.06709848, -0.07653404, -0.12559177, -0.28846246,
        0.04038489, -0.1211011 , -0.04711917,  0.06296743,  0.129568  ,
       -0.10197935, -0.16531448, -0.06551541,  0.08517002,  0.06217918,
        0.02138794, -0.10913286, -0.05052787, -0.07496309, -0.04296915,
        0.05578769,  0.14179902, -0.00404403, -0.20216748, -0.00106298,
        0.04323087, -0.06279748, -0.02001817, -0.03898003, -0.18491656,
        0.1576738 ,  0.10569096,  0.1671272 , -0.24403779,  0.21

In [57]:
# Find the most similar words to "early" based on word vectors from our trained model
Model_SG.wv.most_similar('want')

[('need', 0.9807649254798889),
 ('tell', 0.9681048393249512),
 ('see', 0.9676313400268555),
 ('get', 0.9566439390182495),
 ('miss', 0.9564326405525208),
 ('let', 0.9551256895065308),
 ('know', 0.9541807770729065),
 ('give', 0.9531234502792358),
 ('can', 0.952731192111969),
 ('think', 0.9515970349311829)]

#### Pre-trained word2vec models

There are also pre-trained word2vec models. For example, [`word2vec-google-news-300`](https://code.google.com/archive/p/word2vec/) is trained on a Google News dataset consisting of about 100 billion words. The model consists of 300-dimensional vectors for 3 million words and phrases.

In [58]:
import gensim.downloader as api

google_news_300 = api.load('word2vec-google-news-300')



In [59]:
google_news_300['king']

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [38]:
google_news_300.most_similar('now')

[('still', 0.6091184616088867),
 ('Now', 0.5883526802062988),
 ('already', 0.5807033181190491),
 ('Click_Clear', 0.5615708827972412),
 ('currently', 0.554436445236206),
 ('presently', 0.5278058648109436),
 ('just', 0.5243062973022461),
 ('Right', 0.5044599771499634),
 ('reserving_judgment_Hurley', 0.49310898780822754),
 ('unemploment_benefit', 0.4808916449546814)]