<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Working with the Gensim Library (Part 1)</p>

In [1]:
'''
 The Gensim library, which is another extremely useful NLP library for Python
'''

'\n The Gensim library, which is another extremely useful NLP library for Python\n'

In [2]:
'''
 Gensim was primarily developed for topic modeling.
 However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), 
 document to vectors (doc2vec), finding text similarity, and text summarization.
'''

'\n Gensim was primarily developed for topic modeling.\n However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), \n document to vectors (doc2vec), finding text similarity, and text summarization.\n'

<p style="font-family:consolas; font-size: 26px; color: magenta; text-decoration-line: overline; "> Creating Dictionaries using In-Memory Objects</p>

In [3]:
import gensim
from gensim import corpora
from pprint import pprint

In [4]:
text = ["""In computer science, artificial intelligence (AI),
             sometimes called machine intelligence, is intelligence
             demonstrated by machines, in contrast to the natural intelligence
             displayed by humans and animals. Computer science defines
             AI research as the study of intelligent agents: any device that
             perceives its environment and takes actions that maximize its chance
             of successfully achieving its goals."""]


In [5]:
tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary(tokens)

print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")

for k, v in gensim_dictionary.token2id.items():
    print(f'{k:{15}} {v:{10}}')

The dictionary has: 46 tokens
(AI),                    0
AI                       1
Computer                 2
In                       3
achieving                4
actions                  5
agents:                  6
and                      7
animals.                 8
any                      9
artificial              10
as                      11
by                      12
called                  13
chance                  14
computer                15
contrast                16
defines                 17
demonstrated            18
device                  19
displayed               20
environment             21
goals.                  22
humans                  23
in                      24
intelligence            25
intelligence,           26
intelligent             27
is                      28
its                     29
machine                 30
machines,               31
maximize                32
natural                 33
of                      34
perceives               3

In [6]:
'''The output shows each unique word in our text along with the numeric ID that the word has been assigned'''
print(gensim_dictionary.token2id["study"])

40


In [7]:
'''Similarly, you can use the following script to find the key or word for a specific ID.'''
print(list(gensim_dictionary.token2id.keys())[list(gensim_dictionary.token2id.values()).index(40)])

study


In [8]:
'''you can directly print the tokens and their IDs by printing the dictionary, as shown here:'''
print(gensim_dictionary.token2id)

{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45}


In [9]:
'''Let's now see how we can add more tokens to an existing dictionary using a new document.'''
text = ["""Colloquially, the term "artificial intelligence" is used to
           describe machines that mimic "cognitive" functions that humans
           associate with other human minds, such as "learning" and "problem solving"""]

tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary.add_documents(tokens)

print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)

The dictionary has: 65 tokens
{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45, '"artificial': 46, '"cognitive"': 47, '"learning"': 48, '"problem': 49, 'Colloquially,': 50, 'associate': 51, 'describe': 52, 'functions': 53, 'human': 54, 'intelligence"': 55, 'machines': 56, 'mimic': 57, 'minds,': 58, 'other': 59, 'solving': 60, 'such': 61, 'term': 62, 'used': 63, 'with': 64}


In [10]:
'''In the script above we have a new document that contains the second part of the first paragraph of the Wikipedia article 
on Artificial Intelligence. 
We split the text into tokens and then simply call the add_documents method to add the tokens to our existing dictionary'''

'In the script above we have a new document that contains the second part of the first paragraph of the Wikipedia article \non Artificial Intelligence. \nWe split the text into tokens and then simply call the add_documents method to add the tokens to our existing dictionary'

<p style="font-family:consolas; font-size: 26px; color: magenta; text-decoration-line: overline; "> Creating Dictionaries using Text Files</p>

In [11]:
'''What if we want to create a dictionary by reading a text file from the hard drive?'''

'What if we want to create a dictionary by reading a text file from the hard drive?'

In [12]:
'''Now let's create a dictionary that will contain tokens from the text file "file1.txt":'''
from gensim.utils import simple_preprocess

gensim_dictionary = corpora.Dictionary(simple_preprocess(sentence, deacc=True) for sentence in open(r'data/file1.txt', 
                                                                                                    encoding='utf-8'))
'''using the simple_preprocess method. The method returns tokens in each line of the document. 
The tokens are then used to create the dictionary'''
print(gensim_dictionary.token2id)

{'air': 0, 'also': 1, 'an': 2, 'and': 3, 'aspect': 4, 'average': 5, 'by': 6, 'caused': 7, 'change': 8, 'climate': 9, 'commonly': 10, 'continuing': 11, 'earlier': 12, 'earth': 13, 'economy': 14, 'effects': 15, 'emissions': 16, 'episodes': 17, 'experienced': 18, 'gasses': 19, 'geological': 20, 'global': 21, 'greenhouse': 22, 'in': 23, 'increase': 24, 'industrial': 25, 'is': 26, 'long': 27, 'mainly': 28, 'measurements': 29, 'modern': 30, 'multiple': 31, 'observed': 32, 'ocean': 33, 'of': 34, 'periods': 35, 'refers': 36, 'rise': 37, 'shown': 38, 'since': 39, 'system': 40, 'temperature': 41, 'temperatures': 42, 'term': 43, 'the': 44, 'though': 45, 'to': 46, 'warming': 47}


<p style="font-family:consolas; font-size: 26px; color: magenta; text-decoration-line: overline; "> Creating Bag of Words Corpus from In-Memory Objects</p>

In [13]:
import gensim
from gensim import corpora
from pprint import pprint

text = ["""In computer science, artificial intelligence (AI),
           sometimes called machine intelligence, is intelligence
           demonstrated by machines, in contrast to the natural intelligence
           displayed by humans and animals. Computer science defines
           AI research as the study of intelligent agents: any device that
           perceives its environment and takes actions that maximize its chance
           of successfully achieving its goals."""]

tokens = [[token for token in sentence.split()] for sentence in text]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

print(gensim_corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 3), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 2), (45, 1)]]


In [14]:
'''In the script above, we have text which we split into tokens'''
''' Next, we initialize a Dictionary object from the corpora module'''
'''the object contains a method doc2bow, which basically performs two tasks:
        It iterates through all the words in the text, if the word already exists in the corpus, 
        it increments the frequency count for the word
        Otherwise it inserts the word into the corpus and sets its frequency count to 1
'''


'the object contains a method doc2bow, which basically performs two tasks:\n        It iterates through all the words in the text, if the word already exists in the corpus, \n        it increments the frequency count for the word\n        Otherwise it inserts the word into the corpus and sets its frequency count to 1\n'

In [15]:
'''Let's now print the word and the frequency count to make things clear'''
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_frequencies)

[[('(AI),', 1), ('AI', 1), ('Computer', 1), ('In', 1), ('achieving', 1), ('actions', 1), ('agents:', 1), ('and', 2), ('animals.', 1), ('any', 1), ('artificial', 1), ('as', 1), ('by', 2), ('called', 1), ('chance', 1), ('computer', 1), ('contrast', 1), ('defines', 1), ('demonstrated', 1), ('device', 1), ('displayed', 1), ('environment', 1), ('goals.', 1), ('humans', 1), ('in', 1), ('intelligence', 3), ('intelligence,', 1), ('intelligent', 1), ('is', 1), ('its', 3), ('machine', 1), ('machines,', 1), ('maximize', 1), ('natural', 1), ('of', 2), ('perceives', 1), ('research', 1), ('science', 1), ('science,', 1), ('sometimes', 1), ('study', 1), ('successfully', 1), ('takes', 1), ('that', 2), ('the', 2), ('to', 1)]]


<p style="font-family:consolas; font-size: 26px; color: magenta; text-decoration-line: overline; "> Creating Bag of Words Corpus from Text Files</p>

In [16]:
'''Like dictionaries, we can also create a bag of words corpus by reading a text file'''

'Like dictionaries, we can also create a bag of words corpus by reading a text file'

In [17]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

tokens = [simple_preprocess(sentence, deacc=True) for sentence in open(r'./data/file1.txt', encoding='utf-8')]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]

print(word_frequencies)

[[('air', 1), ('also', 1), ('an', 1), ('and', 3), ('aspect', 1), ('average', 2), ('by', 3), ('caused', 1), ('change', 1), ('climate', 2), ('commonly', 1), ('continuing', 1), ('earlier', 1), ('earth', 1), ('economy', 1), ('effects', 1), ('emissions', 1), ('episodes', 1), ('experienced', 1), ('gasses', 1), ('geological', 1), ('global', 1), ('greenhouse', 1), ('in', 3), ('increase', 1), ('industrial', 1), ('is', 1), ('long', 1), ('mainly', 1), ('measurements', 1), ('modern', 1), ('multiple', 1), ('observed', 1), ('ocean', 1), ('of', 5), ('periods', 1), ('refers', 1), ('rise', 1), ('shown', 1), ('since', 1), ('system', 1), ('temperature', 2), ('temperatures', 1), ('term', 2), ('the', 6), ('though', 1), ('to', 1), ('warming', 3)]]


<p style="font-family:consolas; font-size: 26px; color: magenta; text-decoration-line: overline; "> Creating TF-IDF Corpus</p>

In [18]:
'''Term frequency = (Frequency of the word in a document)/(Total words in the document)'''
'''The term frequency is calculated as:'''

'''The Inverse Document Frequency is calculated as:'''
'''IDF(word) = Log((Total number of documents)/(Number of documents containing the word))'''

'IDF(word) = Log((Total number of documents)/(Number of documents containing the word))'

In [19]:
import gensim
from gensim import corpora
from pprint import pprint

text = ["I like to play Football",
       "Football is the best game",
       "Which game do you like to play ?"]

tokens = [[token for token in sentence.split()] for sentence in text]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

from gensim import models
import numpy as np

tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')

for sent in tfidf[gensim_corpus]:
    print([[gensim_dictionary[id], np.around(frequency, decimals=2)] for id, frequency in sent])

[['Football', 0.35], ['I', 0.71], ['like', 0.35], ['play', 0.35], ['to', 0.35]]
[['Football', 0.27], ['best', 0.53], ['game', 0.27], ['is', 0.53], ['the', 0.53]]
[['like', 0.22], ['play', 0.22], ['to', 0.22], ['game', 0.22], ['?', 0.45], ['Which', 0.45], ['do', 0.45], ['you', 0.45]]


<p style="font-family:consolas; font-size: 26px; color: magenta; text-decoration-line: overline; "> Downloading Built-In Gensim Models and Datasets</p>

In [20]:
'''Gensim comes with a variety of built-in datasets and word embedding models that can be directly used.'''

'Gensim comes with a variety of built-in datasets and word embedding models that can be directly used.'

In [21]:
'''To download a built-in model or dataset, we can use the downloader class from the gensim library'''
import gensim.downloader as api

w2v_embedding = api.load("glove-wiki-gigaword-100")

In [22]:
'''With the commands above, we download the "glove-wiki-gigaword-100" word embedding model, 
which is basically based on Wikipedia text and is 100 dimensional.'''

'With the commands above, we download the "glove-wiki-gigaword-100" word embedding model, \nwhich is basically based on Wikipedia text and is 100 dimensional.'

In [23]:
w2v_embedding.most_similar('toyota')

[('honda', 0.8739858865737915),
 ('nissan', 0.8108116388320923),
 ('automaker', 0.7918164134025574),
 ('mazda', 0.7687168717384338),
 ('bmw', 0.7616022825241089),
 ('ford', 0.7547588348388672),
 ('motors', 0.7539199590682983),
 ('volkswagen', 0.7176680564880371),
 ('prius', 0.7156583070755005),
 ('chrysler', 0.7085398435592651)]

In [None]:
w2v_embedding.most_similar('toyota')

[('honda', 0.8739858865737915),
 ('nissan', 0.8108116388320923),
 ('automaker', 0.7918164134025574),
 ('mazda', 0.7687168717384338),
 ('bmw', 0.7616022825241089),
 ('ford', 0.7547588348388672),
 ('motors', 0.7539199590682983),
 ('volkswagen', 0.7176680564880371),
 ('prius', 0.7156583070755005),
 ('chrysler', 0.7085398435592651)]