<a href="https://colab.research.google.com/github/harikrishnareddymallavarapu/DeepAndNLP/blob/master/GensimTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
import gensim
from gensim import corpora
from pprint import pprint
import itertools as it

#Gensim Introduction

- Gensim requires every word to be mappeed to a unique id, to do this gensim creates a Dictionary object to map word to unique id
- To achieve this, we have to convert text / sentences to [list of words] and pass it to the corpora.Dictionary() object
- Dictionary is typically used to create 'Bag of Words' corpus
- When text files are large, Gensim lets us to update the dictionary one line at a time without loading the entire text file into the memory

- Dictionary is a map of all words to its unique id

In [4]:
documents = ["The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "to his abduction from Turkey, according to two sources."]

###Breaking sentences to words

In [None]:
text = [ [text for text in doc.split()] for doc in documents]
print(text)

In [None]:
#Breaks sentence in characters
text1 = list(it.chain(documents[0]))
text1

###Creating Dictionary object using Corpora

In [21]:
dictionary = corpora.Dictionary(text)
print(dictionary)

<gensim.corpora.dictionary.Dictionary object at 0x7fcf73d9bb38>


In [None]:
#Checking for token id and word
print(dictionary.token2id)

###Extending dictionary object with new documents

In [24]:
documents_2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

In [25]:
text2 = [[text for text in doc.split()] for doc in documents_2]

In [26]:
dictionary.add_documents(text2)
print(dictionary)

###Loading Large file, Here we load the files line by line without loading the entire corpus into the memory

In [27]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

In [29]:
simple_preprocess(documents[0],deacc=True)

['the',
 'saudis',
 'are',
 'preparing',
 'report',
 'that',
 'will',
 'acknowledge',
 'that']

In [None]:
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('sample.txt', encoding='utf-8'))

In [None]:
dictionary.token2id

###Creating Corpora from multiple files without loading into the memory

In [None]:
class ReadTxtFiles(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname), encoding='latin'):
                yield simple_preprocess(line)

In [None]:
path_to_text_directory = "driveFolder"

In [None]:
dictionary = corpora.Dictionary(ReadTxtFiles(path_to_text_directory))

###Creating Bag of words using the dictionary, words are replaced with their ids
- Its equivalent to Document Term Matrix
- Need to know why cant the dictioanry object used directly to create BOW
- the output is a tuple that contains the id of the word and its frequency

In [None]:
tokenized_list = [simple_preprocess(doc) for doc in documents]
tokenized_list

In [32]:
dictionaryObj = corpora.Dictionary()
bowText = [dictionaryObj.doc2bow(doc, allow_update = True) for doc in tokenized_list]
pprint(bowText)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1)],
 [(6, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1)],
 [(17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 2), (23, 1), (24, 1)]]


In [39]:
print(bowText[0])
print("**")
print(bowText[0][0])
print("**")
print(type(bowText[0][0]))

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1)]
'**'
(0, 1)
'**'
<class 'tuple'>
