<a href="https://colab.research.google.com/github/priyanshgupta1998/Machine_learning/blob/master/NLP_gensim_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [0]:
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

Folder "/tmp" will be used to save temporary dictionary and corpus.


# From Strings to Vectors   

we will be using 'gensim' python library to tokenize / vectorize the string.

In [0]:
from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

2019-05-14 06:43:06,931 : INFO : 'pattern' package not found; tag filters are not available for English


This is a tiny corpus of nine documents, each consisting of only a single sentence.

First, let’s tokenize the documents, remove common words (Stopwords) as well as words that only appear once in the corpus:

In [0]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())    # Some selected stopwords 
texts = [[word for word in document.lower().split() if word not in stoplist]for document in documents]
texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

In [0]:

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

print(frequency)

defaultdict(<class 'int'>, {'human': 2, 'machine': 1, 'interface': 2, 'lab': 1, 'abc': 1, 'computer': 2, 'applications': 1, 'survey': 2, 'user': 3, 'opinion': 1, 'system': 4, 'response': 2, 'time': 2, 'eps': 2, 'management': 1, 'engineering': 1, 'testing': 1, 'relation': 1, 'perceived': 1, 'error': 1, 'measurement': 1, 'generation': 1, 'random': 1, 'binary': 1, 'unordered': 1, 'trees': 3, 'intersection': 1, 'graph': 3, 'paths': 1, 'minors': 2, 'iv': 1, 'widths': 1, 'well': 1, 'quasi': 1, 'ordering': 1})


In [0]:
texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # pretty-printer
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


In [0]:
dictionary = corpora.Dictionary(texts)
dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  # store the dictionary, for future reference
print(dictionary) # dictionary has unique words.

2019-05-14 06:44:03,419 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-05-14 06:44:03,421 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2019-05-14 06:44:03,422 : INFO : saving Dictionary object under /tmp/deerwester.dict, separately None
2019-05-14 06:44:03,426 : INFO : saved /tmp/deerwester.dict


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


### Here we assigned a unique integer ID to all words

In [0]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


### convert tokenized documents to vectors:

In [0]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

[(0, 1), (1, 1)]


In [0]:
new_doc = "Human computer system"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  

[(0, 1), (1, 1), (5, 1)]


###  The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a bag-of-words--a sparse vector, in the form of [(word_id, word_count), ...].

In [0]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'deerwester.mm'), corpus)  # store to disk, for later use
for c in corpus:
    print(c)

2019-05-14 06:44:25,727 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
2019-05-14 06:44:25,731 : INFO : saving sparse matrix to /tmp/deerwester.mm
2019-05-14 06:44:25,733 : INFO : PROGRESS: saving document #0
2019-05-14 06:44:25,741 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2019-05-14 06:44:25,746 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index


[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


### check the consistency between documents comparing their vectors.

# Corpus Streaming – One Document at a Time

iIt is clear that corpus above resides fully in memory, as a plain Python list. In this simple example, it doesn’t matter much, but just to make things clear, let’s assume there are millions of documents in the corpus.         
we can't Store all of them in RAM . Instead of this , let’s assume the documents are stored in a file on disk, per line one document . Gensim only requires that a corpus  should be able to return one document vector at a time:

In [0]:
#__iter__ is expected to return the next element of the iterable object that returned it, and raise a StopIteration exception when no more elements are
#available.


from smart_open import smart_open
class MyCorpus(object):
    def __iter__(self):
        for line in smart_open('/home/corpora_docs.txt', 'rb'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())




In [0]:
corpus_memory_friendly = MyCorpus()    # it doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x7fad05b00208>


###  let’s iterate over the corpus and print each document vector (one at a time):

In [0]:
for vector in corpus_memory_friendly:     # load one vector into memory at a time
    print(vector)

`Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.   `                 


`We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (dictionary.dfs) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary. Keep in mind that dictionary.filter_tokens (and some other functions such as dictionary.add_document) will call dictionary.compactify() to remove the gaps in the token id series thus enumeration of remaining tokens can be changed.`

In [0]:
from six import iteritems
from smart_open import smart_open

# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in smart_open('/home/corpora_docs.txt', 'rb'))


# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist 
            if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]


# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)
print(dictionary)