In [15]:
!pip install gensim



Create a Dictionary from a list of sentences
======================================

In gensim, the dictionary contains a map of all words (tokens) to its unique id.

You can create a dictionary from a paragraph of sentences, from a text file that contains multiple lines of text and from multiple such text files contained in a directory. For the second and third cases, we will do it without loading the entire file into memory so that the dictionary gets updated as you read the text line by line.

Let’s start with the ‘List of sentences’ input.

When you have multiple sentences, you need to convert each sentence to a list of words. List comprehensions is a common way to do this.

In [16]:
import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = ["The Saudis are preparing a report that will acknowledge that",
             "Saudi journalist Jamal Khashoggi's death was the result of an",
             "interrogation that went wrong, one that was intended to lead",
             "to his abduction from Turkey, according to two sources."]

documents_2 = ["One source says the report will likely conclude that",
                "the operation was carried out without clearance and",
                "transparency and that those involved will be held",
                "responsible. One of the sources acknowledged that the",
                "report is still being prepared and cautioned that",
                "things could change."]

# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(texts)

# Get information about the dictionary
print(dictionary)


Dictionary<33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...>


As it says the dictionary has 33 unique tokens (or words). Let’s see the unique ids for each of these tokens.

In [17]:
# Show the word to id map
print(dictionary.token2id)


{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32}


We have successfully created a Dictionary object. Gensim will use this dictionary to create a bag-of-words corpus where the words in the documents are replaced with its respective id provided by this dictionary.

If you get new documents in the future, it is also possible to update an existing dictionary to include the new words.

In [18]:
documents_2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

texts_2 = [[text for text in doc.split()] for doc in documents_2]

dictionary.add_documents(texts_2)


# If you check now, the dictionary should have been updated with the new words (tokens).
print(dictionary)

print(dictionary.token2id)


Dictionary<48 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...>
{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32, 'graph': 33, 'in': 34, 'intersection': 35, 'paths': 36, 'trees': 37, 'Graph': 38, 'IV': 39, 'Widths': 40, 'and': 41, 'minors': 42, 'ordering': 43, 'quasi': 44, 'well': 45, 'A': 46, 'survey': 47}


Create a Dictionary from one file
=============================

You can also create a dictionary from a text file.

The below example reads a file line-by-line and uses gensim’s simple_preprocess to process one line of the file at a time.

The advantage here is it let’s you read an entire text file without loading the file in memory all at once.

In [19]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

# Create gensim dictionary form a single tet file, deacc=True -> remove accent marks from tokens
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('Alice_lines_utf8.txt', encoding='utf-8'))

# Token to Id map
dictionary.token2id


{'adventures': 0,
 'alice': 1,
 'in': 2,
 'wonderland': 3,
 'carroll': 4,
 'lewis': 5,
 'edition': 6,
 'fulcrum': 7,
 'millennium': 8,
 'the': 9,
 'chapter': 10,
 'down': 11,
 'hole': 12,
 'rabbit': 13,
 'and': 14,
 'bank': 15,
 'beginning': 16,
 'book': 17,
 'but': 18,
 'by': 19,
 'conversations': 20,
 'do': 21,
 'get': 22,
 'had': 23,
 'having': 24,
 'her': 25,
 'into': 26,
 'is': 27,
 'it': 28,
 'no': 29,
 'nothing': 30,
 'of': 31,
 'on': 32,
 'once': 33,
 'or': 34,
 'peeped': 35,
 'pictures': 36,
 'reading': 37,
 'she': 38,
 'sister': 39,
 'sitting': 40,
 'thought': 41,
 'tired': 42,
 'to': 43,
 'twice': 44,
 'use': 45,
 'very': 46,
 'was': 47,
 'what': 48,
 'without': 49,
 'as': 50,
 'be': 51,
 'chain': 52,
 'close': 53,
 'considering': 54,
 'could': 55,
 'daisies': 56,
 'daisy': 57,
 'day': 58,
 'eyes': 59,
 'feel': 60,
 'for': 61,
 'getting': 62,
 'hot': 63,
 'made': 64,
 'making': 65,
 'mind': 66,
 'own': 67,
 'picking': 68,
 'pink': 69,
 'pleasure': 70,
 'ran': 71,
 'sleepy': 

Create the TFIDF matrix
=======================

The Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.

How is TFIDF computed?

Tf-Idf is computed by multiplying a local component like term frequency (TF) with a global component, that is, inverse document frequency (IDF) and optionally normalizing the result to unit length.

As a result of this, the words that occur frequently across documents will get downweighted.

There are multiple variations of formulas for TF and IDF existing. Gensim uses the SMART Information retrieval system that can be used to implement these variations. You can specify what formula to use specifying the smartirs parameter in the TfidfModel. See help(models.TfidfModel) for more details.

So, how to get the TFIDF weights?

By training the corpus with models.TfidfModel(). Then, apply the corpus within the square brackets of the trained tfidf model. See example below.

In [20]:
from gensim import models
import numpy as np

documents = ["This is the first line",
             "This is the second sentence",
             "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

print('======TF-IDF======')
# Create the TF-IDF model
tfidf = models.TfidfModel(corpus)

# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])


[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]
[['first', 0.66], ['is', 0.24], ['line', 0.66], ['the', 0.24]]
[['is', 0.24], ['the', 0.24], ['second', 0.66], ['sentence', 0.66]]
[['document', 0.71], ['third', 0.71]]


Notice the difference in weights of the words between the original corpus and the tfidf weighted corpus.

The words ‘is’ and ‘the’ occur in two documents and were weighted down. The word ‘this’ appearing in all three documents was removed altogether. In simple terms, words that occur more frequently across the documents get smaller weights.

Gensim provides an inbuilt API to download popular text datasets and word embedding models.

A comprehensive list of available datasets and models is maintained here: https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json.

Using the API to download the dataset is as simple as calling the api.load() method with the right data or model name.

Now you know how to download datasets and pre-trained models with gensim.

Let’s download the text8 dataset, which is nothing but the “First 100,000,000 bytes of plain text from Wikipedia”. Then, from this, we will generate our word2vec model.

Train Word2Vec model using gensim
=================================

A word embedding model is a model that can provide numerical vectors for a given word. Using the Gensim’s downloader API, you can download pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.

However, if you are working in a specialized niche such as technical documents, you may not able to get word embeddings for all the words. So, in such cases its desirable to train your own model.

Gensim’s Word2Vec implementation let’s you train your own word embedding model for a given corpus.

In [21]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

# Split the data into 2 parts.
data_part1 = data[:1000]

# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, min_count = 0, workers=cpu_count())

# Get the word vector for given word
model.wv['topic']

array([-4.20743041e-02,  3.04608673e-01, -4.71701175e-01,  3.29701543e-01,
       -1.55024156e-01, -1.14245152e+00, -3.34934235e-01,  2.58472025e-01,
        3.46889973e-01,  4.49785233e-01,  1.06470793e-01,  1.12953827e-01,
       -9.74204004e-01, -8.81500840e-01, -5.61051369e-01,  6.51849270e-01,
       -1.50225365e+00, -1.35956228e+00,  2.26832703e-01,  1.64514184e-01,
       -4.19698834e-01,  1.46493077e+00,  4.07226205e-01,  2.83853859e-01,
       -2.47706119e-02,  2.76226759e-01,  2.37817224e-02, -3.18468601e-01,
        8.15056711e-02,  1.36637688e+00,  1.59448445e+00, -2.54947573e-01,
       -6.65343478e-02, -1.64137387e+00,  4.88409579e-01,  1.20389259e+00,
       -1.16157389e+00, -1.41134143e+00, -3.20729762e-01, -4.45019692e-01,
        3.00641030e-01, -7.71032274e-01,  2.76009351e-01, -7.62471914e-01,
       -9.36914086e-02,  2.20479235e-01,  4.48248088e-01,  2.90981621e-01,
       -8.12350154e-01, -9.61535648e-02, -1.91291243e-01,  4.88009416e-02,
        8.20653856e-01, -

In [23]:
#get similar words
model.wv.most_similar('topic')

[('discussion', 0.7371512651443481),
 ('focus', 0.7094348073005676),
 ('interpretation', 0.7049460411071777),
 ('discourse', 0.7043935656547546),
 ('debate', 0.6989267468452454),
 ('speculation', 0.6963982582092285),
 ('premise', 0.6891854405403137),
 ('consensus', 0.6872879862785339),
 ('explanation', 0.680461585521698),
 ('focuses', 0.6751286387443542)]

In [24]:
# Save and Load Model
model.save('newmodel')
model = Word2Vec.load('newmodel')

In [25]:
# Get the word vector for given word
model.wv['cat']

array([ 0.10726921,  0.17539935, -1.6782835 ,  0.88824236, -1.0143253 ,
       -0.58637846,  1.2031175 ,  0.9354335 , -0.08056835, -0.99758744,
        0.25955707,  0.6650942 , -0.07427248,  0.55190355,  0.90653425,
       -0.16370137, -1.0263704 , -0.60833365,  0.35087383,  1.2698172 ,
       -1.591476  , -0.45269197, -0.5508202 ,  0.35510886, -1.3410037 ,
       -0.13008042, -0.4552002 , -0.17718093, -0.16941968, -1.0512463 ,
        1.5862764 , -0.36579835,  0.04987056, -0.44437885, -1.4062667 ,
       -0.5154404 , -0.9602719 ,  0.29555872, -0.40189412, -0.97448653,
        1.1201429 ,  0.08775884, -0.36510828, -2.4648805 ,  0.25924042,
       -1.0812641 , -0.45518818,  0.02472565,  0.2062556 , -1.5449716 ,
        1.6602125 ,  0.8662719 , -2.1257217 ,  0.05641563, -0.09212045,
       -0.9565714 , -0.9600496 , -0.609136  , -1.0849751 ,  0.1684088 ,
        0.4651139 , -1.4281328 ,  0.48215285,  0.53489006, -0.2632438 ,
        1.4148672 ,  0.12195655,  0.82940644,  0.19183348, -0.15

In [27]:
#get similar words
model.wv.most_similar('cat')

[('dog', 0.8032337427139282),
 ('bee', 0.7688228487968445),
 ('sweet', 0.7592843174934387),
 ('flower', 0.7481402158737183),
 ('goat', 0.7353066205978394),
 ('blonde', 0.7340176701545715),
 ('dogs', 0.7266868948936462),
 ('bird', 0.7220409512519836),
 ('honey', 0.7205797433853149),
 ('bear', 0.7186970710754395)]

In [28]:
#get similarity between two words
model.wv.similarity('dog','cat')

0.8032337

Import pre-trainined word2vec
============================

We just saw how to get the word vectors for Word2Vec model we just trained. However, gensim lets you download state of the art pretrained models through the downloader API. Let’s see how to extract the word vectors from a couple of these models.

In [None]:
import gensim.downloader as api

# Download the models (1660MB)
word2vec_model300 = api.load('word2vec-google-news-300')

#get similar words
word2vec_model300.wv.most_similar('support')

[===-----------------------------------------------] 6.2% 102.8/1662.8MB downloaded

In [None]:
#download a model based on Glove (128MB)
glove_model100 = api.load('glove-wiki-gigaword-100')
#get similar words
glove_model100.wv.most_similar('dog')

In [None]:
glove_model100.wv.similarity('dog','cat')