<a href="https://colab.research.google.com/github/harnalashok/deeplearning-sequences/blob/main/0_document_to_id_conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Last amended: 06th March, 2021
# Ref: https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors
#      https://www.tutorialspoint.com/gensim/gensim_creating_a_bag_of_words_corpus.htm
#
# Objective:
#            Using gensim
#         A. Convert tokens with each document to corresponding
#            'token-ids' or integer-tokens.
#            For text cleaning, pl refer wikiclustering file
#            in folder: 10.nlp_workshop/text_clustering
#            This file uses gensim for tokenization
#         B. Keras also has  Tokenizer class that can also be
#            used for integer-tokenization. See file:
#            8.rnn/3.keras_tokenizer_class.py
#         C. nltk can also tokenize. See file:
#            10.nlp_workshop/word2vec/nlp_workshop_word2vec.py



The core concepts are:  
**Document**  
A document is an object of the text sequence type (commonly known as str in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.  
`document = "Human machine interface for lab abc computer applications"`
    

**Corpus**  
A corpus is a collection of Document objects. Corpora serve two roles in Gensim:
1.  Input for training a Model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters.

2.  Gensim focuses on unsupervised models so that no human intervention, such as costly annotations or tagging documents by hand, is required.

3. Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).

4.  Such corpora can be indexed for Similarity Queries, queried by semantic similarity, clustered etc.


In [37]:
!pip install gensim



In [53]:

#%reset -f

# 1.1  gensim contains tools for Natural Language Processing
#      Module 'corpora' contains sub-modules and methods to
#      work with text documents
from gensim import corpora

# 1.2 defaultdict is like an ordinary dict. Only that if a key does
#     not exist in the dict, then on its search it inserts that 'key'
#     (as if that key existed)with a value that is defined by an 
#     initialization function (such as int()). 
#     See at the end of code: 'Dictionaries in Python'

from collections import defaultdict

# 1.3
from gensim.utils import simple_preprocess

# 1.4 To unnest a list of lists
from gensim.utils import flatten

# 1.5 pprint does pretty printing
import pprint

import gensim
gensim.__version__

'3.6.0'

In [14]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [60]:

# 2. Here is an example corpus. It consists of 
#    9 documents, where each document is a string
#    consisting of a single sentence. 
#    Create a sample collection (list) of documents
#    See text_clustering.py file as to how to get this list
#    from folder of files or pandas dataframe:

#    The first string is a paragraph having 3 sentences

text_corpus = [ 
                "Human machine interface for lab computer applications. \
                 Use it at abc. OK.",
                 "A survey at of user opinion of computer system response time",
                "The EPS user interface management system",
                "System and human system engineering testing of EPS",
                "Relation of user perceived response time to error measurement",
                "The generation of random binary unordered trees",
                "The intersection graph of paths in trees",
                "Graph minors IV Widths of trees and well at quasi ordering",
                "Graph minors A survey"
                ]


Note  
The above example loads the entire corpus into memory. In practice, corpora may be very large, so loading them into memory may be impossible. Gensim intelligently handles such corpora by streaming them one document at a time. See Corpus Streaming – One Document at a Time for details.

Pre-processing  
After collecting our corpus, there are typically a number of preprocessing steps we want to undertake. We’ll keep it simple and just remove some commonly used English words (such as ‘the’) and words that occur only once in the corpus. In the process of doing so, we’ll tokenize our data. Tokenization breaks up the documents into words (in this case using space as a delimiter).

In [None]:
# 2.1 Clean documents: See file text_clustering.py
# 2.2 Stem documents : See file text_clustering.py

In [61]:
# 2.3 Lowercase each document, split it by white space and filter out stopwords

# 2.3.1
#     Create some list of stopwords that we do not want
#     Detailed list of english stopwords is available at:
#     https://gist.github.com/sebleier/554280

#     We are not including the word 'at' here:

stoplist = set('for a of the and to in'.split())


In [62]:
# 3. Define our own tokenize function.
#     This function parses list of strings into
#      list of words for each element or document
#       in the document-collection
def tokenize(docs):
    tokenized = []          # Ist List: This will be a list of lists
    for document in docs:   # For each senetence in the document-collection
        tokenized_document = []  # IInd list: List of words per string or document
        for word in document.lower().split():
            if word not in stoplist:
                tokenized_document.append(word)  # Append it to a list
        tokenized.append(tokenized_document)         # Append list of words to a list
    return tokenized


In [63]:
texts = tokenize(text_corpus)

In [66]:

pprint.pprint(texts)               #  List of list. The inner list
                    #  contains tokens of respective documents
len(texts)

[['human',
  'machine',
  'at',
  'interface',
  'lab',
  'abc',
  'computer',
  'applications'],
 ['survey', 'at', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph',
  'minors',
  'iv',
  'widths',
  'trees',
  'well',
  'at',
  'quasi',
  'ordering'],
 ['graph', 'minors', 'survey']]


9

In [67]:
# Should you like, you can 
# unnest the list of lists:
ft = flatten(texts)
print(ft)
vocab = set(ft)
print("\n---Vocab-----\n")
print(vocab)
print("\n---str length-----\n")
len(ft)   # 55
print("\n---vocab length-----\n")
len(vocab)  # 36

['human', 'machine', 'at', 'interface', 'lab', 'abc', 'computer', 'applications', 'survey', 'at', 'user', 'opinion', 'computer', 'system', 'response', 'time', 'eps', 'user', 'interface', 'management', 'system', 'system', 'human', 'system', 'engineering', 'testing', 'eps', 'relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement', 'generation', 'random', 'binary', 'unordered', 'trees', 'intersection', 'graph', 'paths', 'trees', 'graph', 'minors', 'iv', 'widths', 'trees', 'well', 'at', 'quasi', 'ordering', 'graph', 'minors', 'survey']

---Vocab-----

{'opinion', 'measurement', 'machine', 'system', 'eps', 'iv', 'at', 'response', 'survey', 'computer', 'generation', 'lab', 'ordering', 'graph', 'minors', 'trees', 'widths', 'abc', 'random', 'applications', 'engineering', 'binary', 'unordered', 'quasi', 'interface', 'intersection', 'paths', 'time', 'error', 'human', 'well', 'management', 'relation', 'user', 'perceived', 'testing'}

---str length-----



55


---vocab length-----



36

In [68]:

# 3.1 The following code is equivalent to above nested for-loops
#     There being one list comprehension between another, output
#     is not one list (as in ordinary list comprehension) but list
#     within list.
texts = [
         [word  for word in document.lower().split(' ') if word not in stoplist]
         for document in text_corpus 
        ]

texts

[['human',
  'machine',
  'at',
  'interface',
  'lab',
  'abc',
  'computer',
  'applications'],
 ['survey', 'at', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph',
  'minors',
  'iv',
  'widths',
  'trees',
  'well',
  'at',
  'quasi',
  'ordering'],
 ['graph', 'minors', 'survey']]

In [None]:
# Convert a document into a list of lowercase tokens,
 # ignoring tokens that are too short or too long.
 #  Uses tokenize() internally.
 texts = []
 for doc in text_corpus:
   inner = []
   out = simple_preprocess(doc, min_len=3, max_len=15)
   outer.append(out)

 texts    

In [9]:

# 4.
# Ref : https://www.ludovf.net/blog/python-collections-defaultdict/
#  A defaultdict is just like a regular Python dict,
#  except that it supports an additional argument at
#  initialization: a function. If someone attempts to
#  access a key to which no value has been assigned,
#  that function will be called (without arguments)
#  and its return value is used as the default value
#  for the key.


In [69]:
# 4.1 Initialise and create an empty dictionary
#     by name of 'frequency'
int()          # This function gives 0.
               # Use it in defaultdict
frequency = defaultdict(int)   # defaultdict(int) => key-values are int
                               # defaultdict(list) => key-values are lists
                               # Example: {'a' :['xx','yy'], 'b':['zz']}


0

In [70]:

# 4.2 Get count of each word in the 'documents'
# for every list in lists
for doc in texts:        
    # for every word in the inner list  
    for token in doc:
    	# frequency[token] will first add a key 'token' to dict
    	#  (if the 'key' does not already exit) holding value '0'.
    	#   In either case value of the key will be incremented by 1
    	# So after all the loop is completed, value of each key
    	# will show its frequency
        frequency[token] += 1


In [71]:

print(frequency)

defaultdict(<class 'int'>, {'human': 2, 'machine': 1, 'at': 3, 'interface': 2, 'lab': 1, 'abc': 1, 'computer': 2, 'applications': 1, 'survey': 2, 'user': 3, 'opinion': 1, 'system': 4, 'response': 2, 'time': 2, 'eps': 2, 'management': 1, 'engineering': 1, 'testing': 1, 'relation': 1, 'perceived': 1, 'error': 1, 'measurement': 1, 'generation': 1, 'random': 1, 'binary': 1, 'unordered': 1, 'trees': 3, 'intersection': 1, 'graph': 3, 'paths': 1, 'minors': 2, 'iv': 1, 'widths': 1, 'well': 1, 'quasi': 1, 'ordering': 1})


In [72]:
# 4.3 Remove words that appear only once
#     So we create another list of lists
#     texts = [['he','he','to'],['to','go']]
#     frequency={'he' : 2, 'to': 2, 'go': 1}

processed_corpus = list([])       # outer list
# 4.3.1 For every list in the the
#       outer list 
for doc in texts:
  #  4.3.2 A blank list of tokens
  #        with higher frequency
	tokens = list([]) 
  # 4.3.3 For every word in this 
  #       inner list   
	for word in doc:
		if frequency[word] > 1:
			tokens.append(word)
	processed_corpus.append(tokens)

In [73]:
processed_corpus

[['human', 'at', 'interface', 'computer'],
 ['survey', 'at', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees', 'at'],
 ['graph', 'minors', 'survey']]

In [74]:
# The above is equivalent to the following:
#  Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)

[['human', 'at', 'interface', 'computer'],
 ['survey', 'at', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees', 'at'],
 ['graph', 'minors', 'survey']]


In [54]:

# 4.4
print(processed_corpus)   #     output = [['he','he','to'],['to']]
print(texts)              # Compare the above with this


[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]


In [80]:
text_corpus

['Human machine at interface for lab abc computer applications',
 'A survey at of user opinion of computer system response time',
 'The EPS user interface management system',
 'System and human system engineering testing of EPS',
 'Relation of user perceived response time to error measurement',
 'The generation of random binary unordered trees',
 'The intersection graph of paths in trees',
 'Graph minors IV Widths of trees and well at quasi ordering',
 'Graph minors A survey']

Before proceeding further, we want to associate each word in the corpus with a unique integer ID. We can do this using the gensim.corpora.Dictionary class. This dictionary defines the vocabulary of all words that our processing knows about.

In [42]:

# 5. Module 'corpora.Dictionary' implements the concept of
#     Dictionary – a mapping between words and their integer ids.
#    Ref: https://radimrehurek.com/gensim/corpora/dictionary.html
dictionary = corpora.Dictionary(processed_corpus)


In [43]:

# 5.1
dictionary      # Just informs where it is stroed in memory


<gensim.corpora.dictionary.Dictionary at 0x7f1217db0fd0>

In [17]:

# 5.2
print(dictionary.token2id)      # Another function is id2token
 

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [18]:

# 5.3 Convert document into the bag-of-words (bow)
#      format ie list of (integer-tokens, token_count) per document.
#      Bag of sequences does not give integer sequences in the order
#      they are in the sentence but in increasing order of integer values.
#      Thus, it is just a bag.
bag = [dictionary.doc2bow(text) for text in output]
bag


[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

In [19]:

# 5.4 Just seperate integer-tokens from frequency
id_text=[]
for doc in bag:
    doclist=[]
    for id,_ in doc:
        doclist.append(id)
    id_text.append(doclist)

id_text

[[0, 1, 2],
 [0, 3, 4, 5, 6, 7],
 [2, 5, 7, 8],
 [1, 5, 8],
 [3, 6, 7],
 [9],
 [9, 10],
 [9, 10, 11],
 [4, 10, 11]]

In [20]:

#####################################
"""
Dictionaries in Python:
=======================

Module Collections has two types of dictioaries:
i) OrderedDict and ii)

OrderedDict in Python
--------------------
    An OrderedDict is a dictionary subclass that remembers
    the order that keys were first inserted. The only
    difference between dict() and OrderedDict() is that
    OrderedDict preserves the order in which the keys are inserted.
    A regular dict doesn’t track the insertion order, and iterating
    it gives the values in an arbitrary order. By contrast, the order
    the items are inserted is remembered by OrderedDict and are returned
    in that order while iterating.

defaultdict
------------
    A defaultdict works exactly like a normal dict, but it is initialized
    with a function (called “default factory”) that takes no arguments
    and provides the default value for a nonexistent key.
    A defaultdict will never raise a KeyError. Any key that does not
    exist gets the value returned by the default function.

    from collections import defaultdict
    # Create a defaultdict with an initialization
    #  function:
    ice_cream = defaultdict(lambda: 'Vanilla')
    # Insert a key-value pair
    ice_cream['Sarah'] = 'Chunky Monkey'
    ice_cream['Sarah']    # Returns 'Chunky Monkey'
    ice_cream['Joe']      # Key non-nonexistent. Returns Vanilla

"""

