<a href="https://colab.research.google.com/github/plthiyagu/Personnel/blob/master/Working_with_the_Gensim_Library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://stackabuse.com/python-for-nlp-working-with-the-gensim-library-part-1/

In [26]:
! pip install gensim



Gensim was primarily developed for topic modeling. However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), document to vectors (doc2vec), finding text similarity, and text summarization.

In [35]:
import gensim
from gensim import corpora
from pprint import pprint

text = ["""In computer science, artificial intelligence (AI),
             sometimes called machine intelligence, is intelligence
             demonstrated by machines, in contrast to the natural intelligence
             displayed by humans and animals. Computer science defines
             AI research as the study of intelligent agents: any device that
             perceives its environment and takes actions that maximize its chance
             of successfully achieving its goals."""]

tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary(tokens)

print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")

for k, v in gensim_dictionary.token2id.items():
    print(f'{k:{15}} {v:{10}}')

The dictionary has: 46 tokens
(AI),                    0
AI                       1
Computer                 2
In                       3
achieving                4
actions                  5
agents:                  6
and                      7
animals.                 8
any                      9
artificial              10
as                      11
by                      12
called                  13
chance                  14
computer                15
contrast                16
defines                 17
demonstrated            18
device                  19
displayed               20
environment             21
goals.                  22
humans                  23
in                      24
intelligence            25
intelligence,           26
intelligent             27
is                      28
its                     29
machine                 30
machines,               31
maximize                32
natural                 33
of                      34
perceives               3

In [36]:
print(gensim_dictionary.token2id["study"])

40


In [37]:
print(list(gensim_dictionary.token2id.keys())[list(gensim_dictionary.token2id.values()).index(40)])

study


In [38]:
print(gensim_dictionary.token2id)

{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45}


In [39]:
text = ["""Colloquially, the term "artificial intelligence" is used to
           describe machines that mimic "cognitive" functions that humans
           associate with other human minds, such as "learning" and "problem solving"""]

tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary.add_documents(tokens)

print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)

The dictionary has: 65 tokens
{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45, '"artificial': 46, '"cognitive"': 47, '"learning"': 48, '"problem': 49, 'Colloquially,': 50, 'associate': 51, 'describe': 52, 'functions': 53, 'human': 54, 'intelligence"': 55, 'machines': 56, 'mimic': 57, 'minds,': 58, 'other': 59, 'solving': 60, 'such': 61, 'term': 62, 'used': 63, 'with': 64}


Creating Dictionaries using Text Files


In [40]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
!wget https://raw.githubusercontent.com/plthiyagu/Personnel/master/Dataset/file1.txt

gensim_dictionary = corpora.Dictionary(simple_preprocess(sentence, deacc=True) for sentence in open(r'file1.txt', encoding='utf-8'))

print(gensim_dictionary.token2id)


--2020-10-04 04:37:04--  https://raw.githubusercontent.com/plthiyagu/Personnel/master/Dataset/file1.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 463 [text/plain]
Saving to: ‘file1.txt.1’


2020-10-04 04:37:04 (18.6 MB/s) - ‘file1.txt.1’ saved [463/463]

{'air': 0, 'also': 1, 'an': 2, 'and': 3, 'aspect': 4, 'average': 5, 'by': 6, 'caused': 7, 'change': 8, 'climate': 9, 'commonly': 10, 'continuing': 11, 'earlier': 12, 'earth': 13, 'economy': 14, 'effects': 15, 'emissions': 16, 'episodes': 17, 'experienced': 18, 'gasses': 19, 'geological': 20, 'global': 21, 'greenhouse': 22, 'in': 23, 'increase': 24, 'industrial': 25, 'is': 26, 'long': 27, 'mainly': 28, 'measurements': 29, 'modern': 30, 'multiple': 31, 'observed': 32, 'ocean': 33, 'of': 34, 'periods': 35, 're

In [8]:
!pwd

/content


Creating Dictionaries using Text Files


In [44]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

class ReturnTokens(object):
    def __init__(self, dir_path):
        self.dir_path = dir_path

    def __iter__(self):
        for file_name in os.listdir(self.dir_path):
          if file_name.endswith('.txt'):
            for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
                yield simple_preprocess(sentence)

path_to_text_directory = r"/content/"
gensim_dictionary = corpora.Dictionary(ReturnTokens(path_to_text_directory))

print(gensim_dictionary.token2id)


{'air': 0, 'also': 1, 'an': 2, 'and': 3, 'aspect': 4, 'average': 5, 'by': 6, 'caused': 7, 'change': 8, 'climate': 9, 'commonly': 10, 'continuing': 11, 'earlier': 12, 'earth': 13, 'economy': 14, 'effects': 15, 'emissions': 16, 'episodes': 17, 'experienced': 18, 'gasses': 19, 'geological': 20, 'global': 21, 'greenhouse': 22, 'in': 23, 'increase': 24, 'industrial': 25, 'is': 26, 'long': 27, 'mainly': 28, 'measurements': 29, 'modern': 30, 'multiple': 31, 'observed': 32, 'ocean': 33, 'of': 34, 'periods': 35, 'refers': 36, 'rise': 37, 'shown': 38, 'since': 39, 'system': 40, 'temperature': 41, 'temperatures': 42, 'term': 43, 'the': 44, 'though': 45, 'to': 46, 'warming': 47}


Creating Bag of Words Corpus


In [10]:
import gensim
from gensim import corpora
from pprint import pprint

text = ["""In computer science, artificial intelligence (AI),
           sometimes called machine intelligence, is intelligence
           demonstrated by machines, in contrast to the natural intelligence
           displayed by humans and animals. Computer science defines
           AI research as the study of intelligent agents: any device that
           perceives its environment and takes actions that maximize its chance
           of successfully achieving its goals."""]

tokens = [[token for token in sentence.split()] for sentence in text]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

print(gensim_corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 3), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 2), (45, 1)]]


In [None]:
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_frequencies)

Creating Bag of Words Corpus from Text Files


In [49]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
#!wget https://raw.githubusercontent.com/plthiyagu/Personnel/master/Dataset/file1.txt

tokens = [simple_preprocess(sentence, deacc=True) for sentence in open(r'file1.txt', encoding='utf-8')]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]

print(word_frequencies)


[[('air', 1), ('also', 1), ('an', 1), ('and', 3), ('aspect', 1), ('average', 2), ('by', 3), ('caused', 1), ('change', 1), ('climate', 2), ('commonly', 1), ('continuing', 1), ('earlier', 1), ('earth', 1), ('economy', 1), ('effects', 1), ('emissions', 1), ('episodes', 1), ('experienced', 1), ('gasses', 1), ('geological', 1), ('global', 1), ('greenhouse', 1), ('in', 3), ('increase', 1), ('industrial', 1), ('is', 1), ('long', 1), ('mainly', 1), ('measurements', 1), ('modern', 1), ('multiple', 1), ('observed', 1), ('ocean', 1), ('of', 5), ('periods', 1), ('refers', 1), ('rise', 1), ('shown', 1), ('since', 1), ('system', 1), ('temperature', 2), ('temperatures', 1), ('term', 2), ('the', 6), ('though', 1), ('to', 1), ('warming', 3)]]


In [50]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

class ReturnTokens(object):
    def __init__(self, dir_path):
        self.dir_path = dir_path

    def __iter__(self):
        for file_name in os.listdir(self.dir_path):
          if file_name.endswith('.txt'):
            for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
                yield simple_preprocess(sentence)

path_to_text_directory = r"/content/"

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in ReturnTokens(path_to_text_directory)]
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]

print(word_frequencies)


[[('air', 1), ('also', 1), ('an', 1), ('and', 3), ('aspect', 1), ('average', 2), ('by', 3), ('caused', 1), ('change', 1), ('climate', 2), ('commonly', 1), ('continuing', 1), ('earlier', 1), ('earth', 1), ('economy', 1), ('effects', 1), ('emissions', 1), ('episodes', 1), ('experienced', 1), ('gasses', 1), ('geological', 1), ('global', 1), ('greenhouse', 1), ('in', 3), ('increase', 1), ('industrial', 1), ('is', 1), ('long', 1), ('mainly', 1), ('measurements', 1), ('modern', 1), ('multiple', 1), ('observed', 1), ('ocean', 1), ('of', 5), ('periods', 1), ('refers', 1), ('rise', 1), ('shown', 1), ('since', 1), ('system', 1), ('temperature', 2), ('temperatures', 1), ('term', 2), ('the', 6), ('though', 1), ('to', 1), ('warming', 3)]]


Creating TF-IDF Corpus


Term frequency = (Frequency of the word in a document)/(Total words in the document)


IDF(word) = Log((Total number of documents)/(Number of documents containing the word))


In [None]:
import gensim
from gensim import corpora
from pprint import pprint

text = ["I like to play Football",
       "Football is the best game",
       "Which game do you like to play ?"]

tokens = [[token for token in sentence.split()] for sentence in text]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

from gensim import models
import numpy as np

tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')

for sent in tfidf[gensim_corpus]:
    print([[gensim_dictionary[id], np.around(frequency, decimals=2)] for id, frequency in sent])

Downloading Built-In Gensim Models and Datasets


In [51]:
import gensim.downloader as api

w2v_embedding = api.load("glove-wiki-gigaword-100")



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [52]:
w2v_embedding.most_similar('toyota')

  if np.issubdtype(vec.dtype, np.int):


[('honda', 0.8739858865737915),
 ('nissan', 0.8108116984367371),
 ('automaker', 0.7918164134025574),
 ('mazda', 0.7687169313430786),
 ('bmw', 0.7616021633148193),
 ('ford', 0.7547588348388672),
 ('motors', 0.7539199590682983),
 ('volkswagen', 0.7176680564880371),
 ('prius', 0.7156583070755005),
 ('chrysler', 0.7085399031639099)]

In [54]:
!pip install wikipedia



In [53]:
!pip install pyLDAvis



In [55]:
import wikipedia
import nltk

nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

global_warming = wikipedia.page("Climate_change")
artificial_intelligence = wikipedia.page("Artificial Intelligence")
mona_lisa = wikipedia.page("Mona Lisa")
eiffel_tower = wikipedia.page("Eiffel Tower")

corpus = [global_warming.content, artificial_intelligence.content, mona_lisa.content, eiffel_tower.content]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Data Preprocessing


In [57]:
import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

def preprocess_text(document):
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()

        # Lemmatization
        tokens = document.split()
        tokens = [stemmer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if word not in en_stop]
        tokens = [word for word in tokens if len(word)  > 5]

        return tokens

In [59]:
  import nltk
  nltk.download('wordnet')

processed_data = [];
for doc in corpus:
    tokens = preprocess_text(doc)
    processed_data.append(tokens)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [60]:
from gensim import corpora

gensim_dictionary = corpora.Dictionary(processed_data)
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]

In [61]:
import pickle

pickle.dump(gensim_corpus, open('gensim_corpus_corpus.pkl', 'wb'))
gensim_dictionary.save('gensim_dictionary.gensim')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [62]:
import gensim

lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=20)
lda_model.save('gensim_model.gensim')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [63]:
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)


(0, '0.032*"climate" + 0.024*"change" + 0.018*"warming" + 0.015*"emission" + 0.014*"global" + 0.013*"greenhouse" + 0.009*"energy" + 0.008*"temperature" + 0.008*"effect" + 0.008*"carbon"')
(1, '0.000*"intelligence" + 0.000*"system" + 0.000*"artificial" + 0.000*"machine" + 0.000*"climate" + 0.000*"research" + 0.000*"painting" + 0.000*"change" + 0.000*"problem" + 0.000*"eiffel"')
(2, '0.019*"intelligence" + 0.015*"machine" + 0.013*"artificial" + 0.011*"problem" + 0.011*"system" + 0.008*"knowledge" + 0.008*"research" + 0.007*"approach" + 0.006*"computer" + 0.006*"learning"')
(3, '0.021*"painting" + 0.017*"eiffel" + 0.010*"leonardo" + 0.007*"french" + 0.006*"second" + 0.006*"louvre" + 0.005*"portrait" + 0.005*"century" + 0.005*"original" + 0.004*"museum"')


In [64]:
lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=8, id2word=gensim_dictionary, passes=15)
lda_model.save('gensim_model.gensim')
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.045*"painting" + 0.021*"leonardo" + 0.012*"portrait" + 0.012*"louvre" + 0.008*"century"')
(1, '0.000*"painting" + 0.000*"leonardo" + 0.000*"louvre" + 0.000*"eiffel" + 0.000*"climate"')
(2, '0.000*"climate" + 0.000*"intelligence" + 0.000*"machine" + 0.000*"change" + 0.000*"eiffel"')
(3, '0.000*"climate" + 0.000*"change" + 0.000*"global" + 0.000*"warming" + 0.000*"greenhouse"')
(4, '0.000*"climate" + 0.000*"change" + 0.000*"warming" + 0.000*"global" + 0.000*"intelligence"')
(5, '0.000*"climate" + 0.000*"intelligence" + 0.000*"painting" + 0.000*"system" + 0.000*"machine"')
(6, '0.031*"eiffel" + 0.009*"second" + 0.007*"structure" + 0.007*"french" + 0.007*"exposition"')
(7, '0.018*"climate" + 0.014*"change" + 0.012*"intelligence" + 0.010*"warming" + 0.009*"machine"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


https://stackabuse.com/python-for-nlp-working-with-the-gensim-library-part-2/

In [65]:
lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=20)
lda_model.save('gensim_model.gensim')
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)


(0, '0.032*"climate" + 0.024*"change" + 0.018*"warming" + 0.015*"emission" + 0.014*"global" + 0.013*"greenhouse" + 0.009*"energy" + 0.008*"temperature" + 0.008*"effect" + 0.008*"carbon"')
(1, '0.021*"painting" + 0.017*"eiffel" + 0.010*"leonardo" + 0.007*"french" + 0.006*"second" + 0.006*"louvre" + 0.005*"portrait" + 0.005*"century" + 0.005*"original" + 0.004*"museum"')
(2, '0.000*"climate" + 0.000*"change" + 0.000*"warming" + 0.000*"intelligence" + 0.000*"global" + 0.000*"problem" + 0.000*"machine" + 0.000*"artificial" + 0.000*"emission" + 0.000*"knowledge"')
(3, '0.019*"intelligence" + 0.015*"machine" + 0.013*"artificial" + 0.011*"problem" + 0.011*"system" + 0.008*"research" + 0.008*"knowledge" + 0.007*"approach" + 0.006*"computer" + 0.006*"learning"')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [66]:
test_doc = 'Great structures are build to remember an event happened in the history.'
test_doc = preprocess_text(test_doc)
bow_test_doc = gensim_dictionary.doc2bow(test_doc)

print(lda_model.get_document_topics(bow_test_doc))

[(0, 0.08403021), (1, 0.7484379), (2, 0.083368815), (3, 0.08416306)]


In [67]:
print('\nPerplexity:', lda_model.log_perplexity(gensim_corpus))

from gensim.models import CoherenceModel

coherence_score_lda = CoherenceModel(model=lda_model, texts=processed_data, dictionary=gensim_dictionary, coherence='c_v')
coherence_score = coherence_score_lda.get_coherence()

print('\nCoherence Score:', coherence_score)


Perplexity: -7.537522110617259

Coherence Score: 0.5983808903271268


Visualizing the LDA


In [68]:
gensim_dictionary = gensim.corpora.Dictionary.load('gensim_dictionary.gensim')
gensim_corpus = pickle.load(open('gensim_corpus_corpus.pkl', 'rb'))
lda_model = gensim.models.ldamodel.LdaModel.load('gensim_model.gensim')

import pyLDAvis.gensim

lda_visualization = pyLDAvis.gensim.prepare(lda_model, gensim_corpus, gensim_dictionary, sort_topics=False)
pyLDAvis.display(lda_visualization)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Topic Modeling via LSI
Latent Semantic Indexing (LSI)

In [69]:
from gensim.models import LsiModel

lsi_model = LsiModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary)
topics = lsi_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.518*"climate" + 0.382*"change" + 0.280*"warming" + 0.233*"emission" + 0.220*"global" + 0.199*"greenhouse" + 0.136*"energy" + 0.133*"effect" + 0.129*"temperature" + 0.125*"carbon"')
(1, '-0.428*"intelligence" + -0.332*"machine" + -0.293*"artificial" + -0.240*"problem" + -0.216*"system" + -0.179*"knowledge" + -0.175*"research" + 0.162*"climate" + -0.141*"approach" + -0.140*"learning"')
(2, '-0.690*"painting" + -0.326*"leonardo" + -0.181*"eiffel" + -0.180*"louvre" + -0.177*"portrait" + -0.153*"french" + -0.132*"century" + -0.121*"museum" + -0.096*"original" + -0.092*"italian"')
(3, '-0.657*"eiffel" + 0.268*"painting" + -0.181*"second" + -0.145*"exposition" + -0.145*"structure" + 0.131*"leonardo" + -0.128*"tallest" + -0.117*"engineer" + -0.108*"design" + -0.102*"restaurant"')
