In [1]:
#https://www.machinelearningplus.com/nlp/gensim-tutorial/

2. What is a Dictionary and Corpus?
In order to work on text documents, Gensim requires the words (aka tokens) be converted to unique ids. In order to achieve that, Gensim lets you create a Dictionary object that maps each word to a unique id.

So, how to create a `Dictionary`? By converting your text/sentences to a [list of words] and pass it to the corpora.Dictionary() object.

But why is the dictionary object needed and where can it be used?

The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in.

Alright, what sort of text inputs can gensim handle? The input text typically comes in 3 different forms:

As sentences stored in python’s native list object
As one single text file, small or large.
In multiple text files.

In [10]:
import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = ["The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "interrogation that went wrong, one that was intended to lead", 
             "to his abduction from Turkey, according to two sources."]

documents_2 = ["One source says the report will likely conclude that", 
                "the operation was carried out without clearance and", 
                "transparency and that those involved will be held", 
                "responsible. One of the sources acknowledged that the", 
                "report is still being prepared and cautioned that", 
                "things could change."]

# Tokenize(split) the sentences into words  (creating list of lists)
texts = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(texts)

# Get information about the dictionary
print(dictionary)
#> Dictionary(33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)

Dictionary(33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)


In [12]:
# Show the word to id map
print(dictionary.token2id)  #every word has an ID   
#for more dictionary functions:  #https://radimrehurek.com/gensim/corpora/dictionary.html   including max_df min_df filtering!

{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32}


We have successfully created a Dictionary object. Gensim will use this dictionary to create a bag-of-words corpus where the words in the documents are replaced with its respective id provided by this dictionary.

If you get new documents in the future, it is also possible to update an existing dictionary to include the new words.



In [14]:
#updating the dictionary

documents_2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

texts_2 = [[text for text in doc.split()] for doc in documents_2]

dictionary.add_documents(texts_2)


# If you check now, the dictionary should have been updated with the new words (tokens).
print(dictionary)
#> Dictionary(45 unique tokens: ['Human', 'abc', 'applications', 'computer', 'for']...)

print(dictionary.token2id)
#> {'Human': 0, 'abc': 1, 'applications': 2, 'computer': 3, 'for': 4, 'interface': 5, 
#>  'lab': 6, 'machine': 7, 'A': 8, 'of': 9, 'opinion': 10, 'response': 11, 'survey': 12, 
#>  'system': 13, 'time': 14, 'user': 15, 'EPS': 16, 'The': 17, 'management': 18, 
#>  'System': 19, 'and': 20, 'engineering': 21, 'human': 22, 'testing': 23, 'Relation': 24, 
#>  'error': 25, 'measurement': 26, 'perceived': 27, 'to': 28, 'binary': 29, 'generation': 30, 
#>  'random': 31, 'trees': 32, 'unordered': 33, 'graph': 34, 'in': 35, 'intersection': 36, 
#>  'paths': 37, 'Graph': 38, 'IV': 39, 'Widths': 40, 'minors': 41, 'ordering': 42, 
#>  'quasi': 43, 'well': 44}

Dictionary(48 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)
{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32, 'graph': 33, 'in': 34, 'intersection': 35, 'paths': 36, 'trees': 37, 'Graph': 38, 'IV': 39, 'Widths': 40, 'and': 41, 'minors': 42, 'ordering': 43, 'quasi': 44, 'well': 45, 'A': 46, 'survey': 47}


### 4. How to create a Dictionary from one or more text files?

In [15]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

# Create gensim dictionary form a single tet file
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('sample.txt', encoding='utf-8'))

# Token to Id map
dictionary.token2id

#> {'according': 35,
#>  'and': 22,
#>  'appointment': 23,
#>  'army': 0,
#>  'as': 43,
#>  'at': 24,
#>   ...
#> }

{'army': 0,
 'china': 1,
 'chinese': 2,
 'force': 3,
 'liberation': 4,
 'of': 5,
 'people': 6,
 'recently': 7,
 'recruited': 8,
 'rocket': 9,
 'tank': 10,
 'technicians': 11,
 'the': 12,
 'think': 13,
 'companies': 14,
 'daily': 15,
 'from': 16,
 'on': 17,
 'pla': 18,
 'private': 19,
 'reported': 20,
 'saturday': 21,
 'and': 22,
 'appointment': 23,
 'at': 24,
 'ceremony': 25,
 'experts': 26,
 'founding': 27,
 'hao': 28,
 'letters': 29,
 'other': 30,
 'received': 31,
 'science': 32,
 'technology': 33,
 'zhang': 34,
 'according': 35,
 'by': 36,
 'defense': 37,
 'national': 38,
 'panel': 39,
 'published': 40,
 'report': 41,
 'to': 42,
 'as': 43,
 'fellow': 44,
 'his': 45,
 'honored': 46,
 'will': 47,
 'conduct': 48,
 'design': 49,
 'fields': 50,
 'into': 51,
 'like': 52,
 'members': 53,
 'overall': 54,
 'research': 55,
 'serve': 56,
 'which': 57,
 'five': 58,
 'for': 59,
 'launching': 60,
 'missile': 61,
 'missiles': 62,
 'network': 63,
 'system': 64,
 'years': 65,
 'counterparts': 66,
 '

### Creating a dictionary from multiple text files

Now, how to read one-line-at-a-time from multiple files?
Assuming you have all the text files in the same directory, you need to define a class with an __iter__ method. The __iter__() method should iterate through all the files in a given directory and yield the processed list of word tokens.
Let’s define one such class by the name ReadTxtFiles, which takes in the path to directory containing the text files. I am using this directory of sports food docs as input.

In [19]:
class ReadTxtFiles(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            print(fname)
            for line in open(os.path.join(self.dirname, fname), encoding='utf-8'):
                yield simple_preprocess(line)

path_to_text_directory = "fooddocs"

dictionary = corpora.Dictionary(ReadTxtFiles(path_to_text_directory))
print(dictionary.token2id)

cricket.txt
badminton.txt
dosa.txt
pizza.txt
pasta.txt
noodles.txt
table tennis.txt
idli.txt
baseball.txt
{'doctype': 0, 'html': 1, 'en': 2, 'lang': 3, 'head': 4, 'charset': 5, 'meta': 6, 'utf': 7, 'com': 8, 'dns': 9, 'github': 10, 'githubassets': 11, 'href': 12, 'https': 13, 'link': 14, 'prefetch': 15, 'rel': 16, 'avatars': 17, 'amazonaws': 18, 'cloud': 19, 'images': 20, 'user': 21, 'ac': 22, 'all': 23, 'anonymous': 24, 'assets': 25, 'bfeb': 26, 'cb': 27, 'crossorigin': 28, 'css': 29, 'dbe': 30, 'dfhmjkb': 31, 'eyldda': 32, 'frameworks': 33, 'ga': 34, 'hfay': 35, 'hjsk': 36, 'integrity': 37, 'iw': 38, 'jleiufjaob': 39, 'kgk': 40, 'media': 41, 'qx': 42, 'sha': 43, 'stylesheet': 44, 'yekvqlma': 45, 'yyon': 46, 'bqr': 47, 'ca': 48, 'chmtgkwc': 49, 'fd': 50, 'fnq': 51, 'gwzobulnpxleh': 52, 'jhhfiw': 53, 'jlycxt': 54, 'lzjav': 55, 'tzwgg': 56, 'zdxzcmb': 57, 'content': 58, 'device': 59, 'name': 60, 'viewport': 61, 'width': 62, 'at': 63, 'cricket': 64, 'datasets': 65, 'master': 66, 'selva':