In [None]:
!pip install gensim



Create a Dictionary from a list of sentences
======================================

In gensim, the dictionary contains a map of all words (tokens) to its unique id.

You can create a dictionary from a paragraph of sentences, from a text file that contains multiple lines of text and from multiple such text files contained in a directory. For the second and third cases, we will do it without loading the entire file into memory so that the dictionary gets updated as you read the text line by line.

Let’s start with the ‘List of sentences’ input.

When you have multiple sentences, you need to convert each sentence to a list of words. List comprehensions is a common way to do this.

In [20]:
import gensim
from gensim import corpora

# How to create a dictionary from a list of sentences?
documents = ["The Saudis are preparing a report that will acknowledge that",
             "Saudi journalist Jamal Khashoggi's death was the result of an",
             "interrogation that went wrong, one that was intended to lead",
             "to his abduction from Turkey, according to two sources."]

documents_2 = ["One source says the report will likely conclude that",
                "the operation was carried out without clearance and",
                "transparency and that those involved will be held",
                "responsible. One of the sources acknowledged that the",
                "report is still being prepared and cautioned that",
                "things could change."]

# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]
print(texts)

# Create dictionary
dictionary = corpora.Dictionary(texts)

# Get information about the dictionary
print(dictionary)


[['The', 'Saudis', 'are', 'preparing', 'a', 'report', 'that', 'will', 'acknowledge', 'that'], ['Saudi', 'journalist', 'Jamal', "Khashoggi's", 'death', 'was', 'the', 'result', 'of', 'an'], ['interrogation', 'that', 'went', 'wrong,', 'one', 'that', 'was', 'intended', 'to', 'lead'], ['to', 'his', 'abduction', 'from', 'Turkey,', 'according', 'to', 'two', 'sources.']]
Dictionary<33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...>


As it says the dictionary has 33 unique tokens (or words). Let’s see the unique ids for each of these tokens.

In [None]:
# Show the word to id map
print(dictionary.token2id)


{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32}


We have successfully created a Dictionary object. Gensim will use this dictionary to create a bag-of-words corpus where the words in the documents are replaced with its respective id provided by this dictionary.

If you get new documents in the future, it is also possible to update an existing dictionary to include the new words.

In [21]:
documents_2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

texts_2 = [[text for text in doc.split()] for doc in documents_2]
print(texts_2)

dictionary.add_documents(texts_2)


# If you check now, the dictionary should have been updated with the new words (tokens).
print(dictionary)

print(dictionary.token2id)


[['The', 'intersection', 'graph', 'of', 'paths', 'in', 'trees'], ['Graph', 'minors', 'IV', 'Widths', 'of', 'trees', 'and', 'well', 'quasi', 'ordering'], ['Graph', 'minors', 'A', 'survey']]
Dictionary<48 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...>
{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32, 'graph': 33, 'in': 34, 'intersection': 35, 'paths': 36, 'trees': 37, 'Graph': 38, 'IV': 39, 'Widths': 40, 'and': 41, 'minors': 42, 'ordering': 43, 'quasi': 44, 'well': 45, 'A': 46, 'survey': 47}


Create a Dictionary from one file
=============================

You can also create a dictionary from a text file.

The below example reads a file line-by-line and uses gensim’s simple_preprocess to process one line of the file at a time.

The advantage here is it let’s you read an entire text file without loading the file in memory all at once.

In [None]:
from gensim.utils import simple_preprocess
import os

# Create gensim dictionary form a single tet file, deacc=True -> remove accent marks from tokens
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('Alice_lines_utf8.txt', encoding='utf-8'))

# Token to Id map
dictionary.token2id


Create the TFIDF matrix
=======================

The Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.

How is TFIDF computed?

Tf-Idf is computed by multiplying a local component like term frequency (TF) with a global component, that is, inverse document frequency (IDF) and optionally normalizing the result to unit length.

As a result of this, the words that occur frequently across documents will get downweighted.

There are multiple variations of formulas for TF and IDF existing. Gensim uses the SMART Information retrieval system that can be used to implement these variations. You can specify what formula to use specifying the smartirs parameter in the TfidfModel. See help(models.TfidfModel) for more details.

So, how to get the TFIDF weights?

By training the corpus with models.TfidfModel(). Then, apply the corpus within the square brackets of the trained tfidf model. See example below.

In [30]:
from gensim import models
import numpy as np

documents = ["This is the first line",
             "This is the second sentence",
             "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
print(mydict)
corpus = [mydict.doc2bow(simple_preprocess(line), allow_update=True) for line in documents]
print(corpus)

# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

print('======TF-IDF======')
# Create the TF-IDF model
tfidf = models.TfidfModel(corpus)

# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

print(tfidf[corpus[0]] )

Dictionary<9 unique tokens: ['first', 'is', 'line', 'the', 'this']...>
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(4, 1), (7, 1), (8, 1)]]
[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]
[['first', 0.66], ['is', 0.24], ['line', 0.66], ['the', 0.24]]
[['is', 0.24], ['the', 0.24], ['second', 0.66], ['sentence', 0.66]]
[['document', 0.71], ['third', 0.71]]
[(0, 0.6633689723434505), (1, 0.2448297500958463), (2, 0.6633689723434505), (3, 0.2448297500958463)]


Notice the difference in weights of the words between the original corpus and the tfidf weighted corpus.

The words ‘is’ and ‘the’ occur in two documents and were weighted down. The word ‘this’ appearing in all three documents was removed altogether. In simple terms, words that occur more frequently across the documents get smaller weights.

Gensim provides an inbuilt API to download popular text datasets and word embedding models.

A comprehensive list of available datasets and models is maintained here: https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json.

Using the API to download the dataset is as simple as calling the api.load() method with the right data or model name.

Now you know how to download datasets and pre-trained models with gensim.

Let’s download the text8 dataset, which is nothing but the “First 100,000,000 bytes of plain text from Wikipedia”. Then, from this, we will generate our word2vec model.

Train Word2Vec model using gensim
=================================

A word embedding model is a model that can provide numerical vectors for a given word. Using the Gensim’s downloader API, you can download pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.

However, if you are working in a specialized niche such as technical documents, you may not able to get word embeddings for all the words. So, in such cases its desirable to train your own model.

Gensim’s Word2Vec implementation let’s you train your own word embedding model for a given corpus.

In [32]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

# Split the data into 2 parts.
data_part1 = data[:1000]

# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, window=5, min_count = 2, sg=0, workers=cpu_count())

# Get the word vector for given word
model.wv['topic']

array([ 1.2703217 , -0.8083792 , -0.2925763 ,  0.433493  , -0.8865123 ,
       -0.6884289 , -1.3115997 ,  1.1001226 ,  0.808066  , -0.2905409 ,
        0.839751  , -0.626363  , -0.49549112, -1.5518533 , -0.7057076 ,
        0.90056425, -0.12318409, -0.53909725,  0.2051187 , -0.26116267,
        0.00583857,  1.57774   , -0.580765  ,  0.5894798 , -0.82096   ,
       -0.1604243 ,  0.06158253,  0.3948521 , -0.18716957,  0.7764375 ,
        0.23271777,  0.31557995, -0.24596128, -1.2741156 ,  0.03714888,
       -0.6501605 , -0.81270087, -1.2894961 ,  0.99802935, -0.21924587,
        0.40310413, -0.04981372, -0.27114603, -0.11670995,  0.31651703,
       -0.15808582, -0.22778812, -0.08785888, -0.41514707,  0.4629715 ,
       -0.6558279 ,  0.4093873 ,  0.82389724, -0.16391252, -1.0488547 ,
        0.14271432, -1.4065292 ,  0.1945657 , -1.8327117 ,  0.8029601 ,
        0.11568496,  0.77454966,  0.3071646 ,  0.09257562, -0.35892737,
        0.10722373,  0.19503492, -0.30807117, -0.7341304 ,  0.62

In [None]:
#get similar words
model.wv.most_similar('topic')

[('discussion', 0.7371512651443481),
 ('focus', 0.7094348073005676),
 ('interpretation', 0.7049460411071777),
 ('discourse', 0.7043935656547546),
 ('debate', 0.6989267468452454),
 ('speculation', 0.6963982582092285),
 ('premise', 0.6891854405403137),
 ('consensus', 0.6872879862785339),
 ('explanation', 0.680461585521698),
 ('focuses', 0.6751286387443542)]

In [None]:
# Save and Load Model
model.save('newmodel')
model = Word2Vec.load('newmodel')

In [None]:
# Get the word vector for given word
model.wv['cat']

array([ 0.10726921,  0.17539935, -1.6782835 ,  0.88824236, -1.0143253 ,
       -0.58637846,  1.2031175 ,  0.9354335 , -0.08056835, -0.99758744,
        0.25955707,  0.6650942 , -0.07427248,  0.55190355,  0.90653425,
       -0.16370137, -1.0263704 , -0.60833365,  0.35087383,  1.2698172 ,
       -1.591476  , -0.45269197, -0.5508202 ,  0.35510886, -1.3410037 ,
       -0.13008042, -0.4552002 , -0.17718093, -0.16941968, -1.0512463 ,
        1.5862764 , -0.36579835,  0.04987056, -0.44437885, -1.4062667 ,
       -0.5154404 , -0.9602719 ,  0.29555872, -0.40189412, -0.97448653,
        1.1201429 ,  0.08775884, -0.36510828, -2.4648805 ,  0.25924042,
       -1.0812641 , -0.45518818,  0.02472565,  0.2062556 , -1.5449716 ,
        1.6602125 ,  0.8662719 , -2.1257217 ,  0.05641563, -0.09212045,
       -0.9565714 , -0.9600496 , -0.609136  , -1.0849751 ,  0.1684088 ,
        0.4651139 , -1.4281328 ,  0.48215285,  0.53489006, -0.2632438 ,
        1.4148672 ,  0.12195655,  0.82940644,  0.19183348, -0.15

In [None]:
#get similar words
model.wv.most_similar('cat')

[('dog', 0.8032337427139282),
 ('bee', 0.7688228487968445),
 ('sweet', 0.7592843174934387),
 ('flower', 0.7481402158737183),
 ('goat', 0.7353066205978394),
 ('blonde', 0.7340176701545715),
 ('dogs', 0.7266868948936462),
 ('bird', 0.7220409512519836),
 ('honey', 0.7205797433853149),
 ('bear', 0.7186970710754395)]

In [None]:
#get similarity between two words
model.wv.similarity('dog','cat')

0.8032337

Import pre-trainined word2vec
============================

We just saw how to get the word vectors for Word2Vec model we just trained. However, gensim lets you download state of the art pretrained models through the downloader API. Let’s see how to extract the word vectors from a couple of these models.

In [None]:
import gensim.downloader as api

# Download the models (1660MB)
#word2vec_model300 = api.load('word2vec-google-news-100')

#get similar words
#word2vec_model300.wv.most_similar('support')

In [7]:
#download a model based on Glove (128MB)
import gensim.downloader as api
glove_model100 = api.load('glove-wiki-gigaword-100')
#get similar words
glove_model100.most_similar('dog')



[('cat', 0.8798074126243591),
 ('dogs', 0.8344309329986572),
 ('pet', 0.7449564337730408),
 ('puppy', 0.723637580871582),
 ('horse', 0.7109653949737549),
 ('animal', 0.6817063093185425),
 ('pig', 0.655417263507843),
 ('boy', 0.6545308232307434),
 ('cats', 0.6471932530403137),
 ('rabbit', 0.6468630433082581)]

In [None]:
glove_model100.similarity('dog','cat')

0.8798075

#Example of use

In [3]:
#Download Dataset
import pandas as pd
url = 'https://bit.ly/2CdYYuf'
yelp = pd.read_csv(url, sep='\t', header = None)
yelp.rename(columns={0:'Reviews', 1:'Sentiment'}, inplace=True)
yelp.head()

Unnamed: 0,Reviews,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [None]:
#download a model based on Glove (128MB)
import gensim.downloader as api
glove_model100 = api.load('glove-wiki-gigaword-100')


In [12]:
glove_model100.get_vector("office")

array([-0.083012, -0.74463 ,  0.36394 , -0.04265 ,  0.60577 ,  0.13998 ,
       -0.50061 ,  0.90389 ,  0.41351 ,  0.49011 ,  0.10642 , -0.62883 ,
        0.31716 ,  0.77279 , -0.22061 , -0.13117 ,  0.59952 ,  0.40445 ,
       -0.52231 , -0.42995 ,  0.075281,  0.28239 ,  0.014645, -0.32397 ,
       -0.74076 , -0.80056 ,  0.23731 , -0.49243 , -0.32606 , -0.20385 ,
        0.93649 ,  0.22245 ,  0.25503 ,  0.61261 , -0.49376 ,  0.84066 ,
       -0.57353 ,  0.053669,  0.29911 , -0.21548 , -0.22307 , -0.58031 ,
        0.36928 , -0.34358 ,  0.30455 , -0.14287 , -0.38094 , -0.53703 ,
        0.1597  , -0.43649 ,  0.42691 , -1.0276  ,  0.38602 ,  1.0371  ,
       -0.18697 , -2.4962  , -0.37856 ,  0.16619 ,  1.953   ,  0.47491 ,
       -0.49005 , -0.2078  , -0.033339,  0.23562 ,  0.18506 , -0.41896 ,
        0.50037 ,  0.41745 ,  0.51059 ,  0.59109 ,  0.02061 , -0.093909,
       -0.47164 , -0.89987 ,  0.22922 , -0.13374 , -0.28564 ,  0.44327 ,
       -1.5182  , -0.076197,  0.37112 ,  0.14877 , 

In [14]:
import nltk
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
nltk.download('stopwords')
docs_vectors = pd.DataFrame() # creating empty final dataframe
stopwords = nltk.corpus.stopwords.words('english') # removing stop words

for doc in yelp['Reviews'].str.lower().str.replace('[^a-z ]', ''): # looping through each document and cleaning it
    temp = pd.DataFrame()  # creating a temporary dataframe(store value for 1st doc & for 2nd doc remove the details of 1st & proced through 2nd and so on..)
    for word in doc.split(' '): # looping through each word of a single document and spliting through space - You can use a Tokenizer
        if word not in stopwords: # if word is not present in stopwords then (try)
            try:
                word_vec = glove_model100.get_vector(word) # if word is present in embeddings(goole provides weights associate with words(300)) then proceed
                temp = temp.append(pd.Series(word_vec), ignore_index = True) # if word is present then append it to temporary dataframe
            except:
                print("The token: "+word+" not in Embedding Space Vocabulary")
    doc_vector = temp.mean() # take the average of each column(w0, w1, w2,........w300)
    docs_vectors = docs_vectors.append(doc_vector, ignore_index = True) # append each document value to the final dataframe
docs_vectors.shape

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The token: honeslty not in Embedding Space Vocabulary
The token: wayyy not in Embedding Space Vocabulary
The token: ravoli not in Embedding Space Vocabulary
The token: chickenwith not in Embedding Space Vocabulary
The token: cranberrymmmm not in Embedding Space Vocabulary
The token: burrittos not in Embedding Space Vocabulary
The token: rightthe not in Embedding Space Vocabulary
The token: cakeohhh not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token: itfriendly not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token: updatewent not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token:  not in Embedding Space Vocabulary
The token:  not in Embedding Space Voc

(1000, 100)

In [15]:
print(docs_vectors[1])

0      0.367257
1      0.297900
2      0.394870
3     -0.346915
4      0.281877
         ...   
995    0.703294
996    0.331586
997    0.133566
998    0.371684
999    0.360034
Name: 1, Length: 1000, dtype: float32


In [16]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

train_x, test_x, train_y, test_y = train_test_split(docs_vectors,
                                                   yelp['Sentiment'],
                                                   test_size = 0.2,
                                                   random_state = 1)
train_x.shape, train_y.shape, test_x.shape, test_y.shape

((800, 100), (800,), (200, 100), (200,))

In [19]:
model = GaussianNB()
model.fit(train_x, train_y)
test_pred = model.predict(test_x)

from sklearn.metrics import classification_report
print(classification_report(test_y, test_pred))

              precision    recall  f1-score   support

           0       0.69      0.76      0.72       108
           1       0.68      0.60      0.64        92

    accuracy                           0.69       200
   macro avg       0.68      0.68      0.68       200
weighted avg       0.68      0.69      0.68       200

