<a href="https://colab.research.google.com/github/harnalashok/deeplearning-sequences/blob/main/0_basic_document_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Last amended: 07th March, 2021
# Ref: https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors
#      https://www.tutorialspoint.com/gensim/gensim_creating_a_bag_of_words_corpus.htm
#
# Objective(s):
#         A. Familiarising with Document processing
#            using gensim.
#         B. Convert tokens with each document to corresponding
#            'token-ids' or integer-tokens.
#            (For text cleaning, pl refer wikiclustering file
#            in folder: 10.nlp_workshop/text_clustering)
#            (Keras also has  Tokenizer class that can also be
#            used for integer-tokenization. See file:
#            8.rnn/3.keras_tokenizer_class.py
#            nltk can also tokenize. See file:
#            10.nlp_workshop/word2vec/nlp_workshop_word2vec.py)
#         C. Creating a Bag-of-words model
#         D. Discovering document similarity



The core concepts are:  
**Document**  
A document is an object of the text sequence type (commonly known as str in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.  
`document = "Human machine interface for lab abc computer applications"`
    

**Corpus**  
A corpus is a collection of Document objects. Corpora serve two roles in Gensim:
1.  Input for training a Model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters.

2.  Gensim focuses on unsupervised models so that no human intervention, such as costly annotations or tagging documents by hand, is required.

3. Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).

4.  Such corpora can be indexed for Similarity Queries, queried by semantic similarity, clustered etc.


In [27]:
#0.0 Upgrade existing gensim
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/5c/4e/afe2315e08a38967f8a3036bbe7e38b428e9b7a90e823a83d0d49df1adf5/gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 45.1MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.8.0
    Uninstalling gensim-3.8.0:
      Successfully uninstalled gensim-3.8.0
Successfully installed gensim-3.8.3


In [77]:
## Call libraries

%reset -f

# 1.1  gensim contains tools for Natural Language Processing
#      Module 'corpora' contains sub-modules and methods to
#      work with text documents
from gensim import corpora

# 1.2 defaultdict is like an ordinary dict. Only that if a key does
#     not exist in the dict, then on its search it inserts that 'key'
#     (as if that key existed)with a value that is defined by an 
#     initialization function (such as int()). 
#     See at the end of code: 'Dictionaries in Python'

from collections import defaultdict

# 1.3
from gensim.utils import simple_preprocess

# 1.4 To unnest a list of lists
from gensim.utils import flatten

# 1.5 pprint does pretty printing
import pprint

# 1.6
import gensim
gensim.__version__


'3.8.3'

In [78]:
# 1.7 Display outputs of multiple commands from a cell:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


### Preprocessing and tokenizing corpus

What is a Document  

A document is an object of the text sequence type (commonly known as str in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.
`document = "Human machine interface for lab abc computer applications"`

**Corpus**  
A corpus is a collection of Document objects. Corpora serve two roles in Gensim:
1.  Input for training a Model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters.

2.  Gensim focuses on unsupervised models so that no human intervention, such as costly annotations or tagging documents by hand, is required.

3. Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).

4.  Such corpora can be indexed for Similarity Queries, queried by semantic similarity, clustered etc.

In [79]:
# 2. Here is an example corpus. It consists of 
#    9 documents, where each document is a string
#    consisting of a single sentence. 
#    Create a sample collection (list) of documents
#    See text_clustering.py file as to how to get this list
#    from folder of files or pandas dataframe:

#    The first string is a paragraph having 3 sentences

text_corpus = [ 
                "Human machine interface for lab computer applications. Use it at abc. OK.",
                "A survey at of user opinion of computer system response time",
                "The EPS user interface management system",
                "System and human system engineering testing of EPS",
                "Relation of user perceived response time to error measurement",
                "The generation of random binary unordered trees",
                "The intersection graph of paths in trees",
                "Graph minors IV Widths of trees and well at quasi ordering",
                "Graph minors A survey"
                ]


Note  
The above example loads the entire corpus into memory. In practice, corpora may be very large, so loading them into memory may be impossible. Gensim intelligently handles such corpora by streaming them one document at a time. See Corpus Streaming – One Document at a Time for details.

Pre-processing steps:   

After collecting our corpus, there are typically a number of preprocessing steps we want to undertake. We’ll keep it simple and just remove some commonly used English words (such as ‘the’) and words that occur only once in the corpus. In the process of doing so, we’ll tokenize our data. Tokenization breaks up the documents into words (in this case using space as a delimiter).

In [7]:
# 2.1 Clean documents: See file text_clustering.py
# 2.2 Stem documents : See file text_clustering.py

In [80]:
# 2.3 Lowercase each document, split it by white space and filter out stopwords

# 2.3.1
#     Create some list of stopwords that we do not want
#     Detailed list of english stopwords is available at:
#     https://gist.github.com/sebleier/554280

#     We are not including the word 'at' here:

stoplist = set('for a of the and to in'.split())


In [81]:
# 3. Tokenize--I
#    Define our own tokenize function.
#     This function parses list of strings into
#      list of words for each element or document
#       in the document-collection
def tokenize(docs):
    tokenized = []          # Ist List: This will be a list of lists
    for document in docs:   # For each senetence in the document-collection
        tokenized_document = []  # IInd list: List of words per string or document
        for word in document.lower().split():
            if word not in stoplist:
                tokenized_document.append(word)  # Append it to a list
        tokenized.append(tokenized_document)         # Append list of words to a list
    return tokenized


In [82]:
# 3.1 Apply the above function
#     to tokenize (parse) the corpus:

texts = tokenize(text_corpus)

#3.1.1 
pprint.pprint(texts)   #  List of list. The inner list
                       #  contains tokens of respective documents
len(texts)

[['human',
  'machine',
  'interface',
  'lab',
  'computer',
  'applications.',
  'use',
  'it',
  'at',
  'abc.',
  'ok.'],
 ['survey', 'at', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph',
  'minors',
  'iv',
  'widths',
  'trees',
  'well',
  'at',
  'quasi',
  'ordering'],
 ['graph', 'minors', 'survey']]


9

In [83]:
# 3.1.2 Should you like, you can 
#       unnest the list of lists:

ft = flatten(texts)
print(ft)
vocab = set(ft)
print("\n---Vocab-----\n")
print(vocab)
print("\n---str length-----\n")
len(ft)   # 55
print("\n---vocab length-----\n")
len(vocab)  # 36

['human', 'machine', 'interface', 'lab', 'computer', 'applications.', 'use', 'it', 'at', 'abc.', 'ok.', 'survey', 'at', 'user', 'opinion', 'computer', 'system', 'response', 'time', 'eps', 'user', 'interface', 'management', 'system', 'system', 'human', 'system', 'engineering', 'testing', 'eps', 'relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement', 'generation', 'random', 'binary', 'unordered', 'trees', 'intersection', 'graph', 'paths', 'trees', 'graph', 'minors', 'iv', 'widths', 'trees', 'well', 'at', 'quasi', 'ordering', 'graph', 'minors', 'survey']

---Vocab-----

{'lab', 'quasi', 'testing', 'widths', 'human', 'graph', 'measurement', 'at', 'well', 'engineering', 'random', 'relation', 'machine', 'perceived', 'interface', 'applications.', 'use', 'minors', 'binary', 'opinion', 'response', 'time', 'management', 'error', 'unordered', 'system', 'paths', 'abc.', 'trees', 'generation', 'user', 'eps', 'intersection', 'ok.', 'it', 'iv', 'computer', 'survey', 'ordering'}



58


---vocab length-----



39

In [32]:
# 3.2 Tokenize--II
#     The following code is equivalent to above nested for-loops
#     There being one list comprehension between another, output
#     is not one list (as in ordinary list comprehension) but list
#     within list.

texts = [
         [word  for word in document.lower().split(' ') if word not in stoplist]
         for document in text_corpus 
        ]

# 3.2.1
texts

[['human',
  'machine',
  'interface',
  'lab',
  'computer',
  'applications.',
  'use',
  'it',
  'at',
  'abc.',
  'ok.'],
 ['survey', 'at', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph',
  'minors',
  'iv',
  'widths',
  'trees',
  'well',
  'at',
  'quasi',
  'ordering'],
 ['graph', 'minors', 'survey']]

In [84]:
# 3.3 Tokenize--III
#     Convert a document into a list of lowercase tokens,
#     ignoring tokens that are too short or too long.
#     Uses tokenize() internally.

my_texts = []

for doc in text_corpus:
   inner = []
   out = simple_preprocess(doc, min_len=3, max_len=15)
   my_texts.append(out)

my_texts    

[['human',
  'machine',
  'interface',
  'for',
  'lab',
  'computer',
  'applications',
  'use',
  'abc'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['the', 'eps', 'user', 'interface', 'management', 'system'],
 ['system', 'and', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['the', 'generation', 'random', 'binary', 'unordered', 'trees'],
 ['the', 'intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'widths', 'trees', 'and', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

#### Counting frequency of occurrence

Once a corpus has been tokenized, we can beging counting frequency of occurrence within a corpus. As we will see below, vectorization process does it automatically.

In [85]:
# 4.
# Ref : https://www.ludovf.net/blog/python-collections-defaultdict/

#  A defaultdict is just like a regular Python dict,
#  except that it supports an additional argument at
#  initialization: a function. If someone attempts to
#  access a key to which no value has been assigned,
#  that function will be called (without arguments)
#  and its return value is used as the default value
#  for the key.


In [86]:
# 4.1 Initialise and create an empty dictionary
#     by name of 'frequency'

# 4.1.1
int()          # This function gives 0.
               # Use it in defaultdict
# 4.1.2               
frequency = defaultdict(int)   # defaultdict(int) => key-values are int
                               # defaultdict(list) => key-values are lists
                               # Example: {'a' :['xx','yy'], 'b':['zz']}

# 4.1.3                               
frequency

0

defaultdict(int, {})

In [87]:
# 4.2 Get count of each word in the 'documents'
#     for every list in lists

for doc in texts:        
    # for every word in the inner list  
    for token in doc:
    	# frequency[token] will first add a key 'token' to dict
    	#  (if the 'key' does not already exit) holding value '0'.
    	#   In either case value of the key will be incremented by 1
    	# So after all the loop is completed, value of each key
    	# will show its frequency
        frequency[token] += 1


In [88]:
# 4.2.1
print(frequency)

defaultdict(<class 'int'>, {'human': 2, 'machine': 1, 'interface': 2, 'lab': 1, 'computer': 2, 'applications.': 1, 'use': 1, 'it': 1, 'at': 3, 'abc.': 1, 'ok.': 1, 'survey': 2, 'user': 3, 'opinion': 1, 'system': 4, 'response': 2, 'time': 2, 'eps': 2, 'management': 1, 'engineering': 1, 'testing': 1, 'relation': 1, 'perceived': 1, 'error': 1, 'measurement': 1, 'generation': 1, 'random': 1, 'binary': 1, 'unordered': 1, 'trees': 3, 'intersection': 1, 'graph': 3, 'paths': 1, 'minors': 2, 'iv': 1, 'widths': 1, 'well': 1, 'quasi': 1, 'ordering': 1})


In [89]:
# 4.3 Remove words that appear only once
#     So we create another list of lists
#     texts = [['he','he','to'],['to','go']]
#     frequency={'he' : 2, 'to': 2, 'go': 1}

# 4.3.1
processed_corpus = []       # outer list

# 4.3.2 For every list in the the
#       outer list 
for doc in texts:
  #  4.3.3 A blank list of tokens
  #        with higher frequency
	tokens = [] 

  # 4.3.4 For every word in this 
  #       inner list   
	for word in doc:
		if frequency[word] > 1:
			tokens.append(word)
 # 4.3.5
	processed_corpus.append(tokens)

In [90]:
# 4.3.6 Processed output
processed_corpus

[['human', 'interface', 'computer', 'at'],
 ['survey', 'at', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees', 'at'],
 ['graph', 'minors', 'survey']]

In [91]:
# 4.4 The above is equivalent to the following:
#     Only keep words that appear more than once

processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)
print("\n-----Compare-----\n")
print(texts)              # Compare the above with this

[['human', 'interface', 'computer', 'at'],
 ['survey', 'at', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees', 'at'],
 ['graph', 'minors', 'survey']]

-----Compare-----

[['human', 'machine', 'interface', 'lab', 'computer', 'applications.', 'use', 'it', 'at', 'abc.', 'ok.'], ['survey', 'at', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'at', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]


### Associating word with ID

Before proceeding further, we want to associate each word in the corpus with a unique integer ID. We can do this using the gensim.corpora.Dictionary class. This dictionary defines the vocabulary of all words that our processing knows about.

In [92]:
# 5. Module 'corpora.Dictionary' implements the concept of
#     Dictionary – a mapping between words and their integer ids.
#    Ref: https://radimrehurek.com/gensim/corpora/dictionary.html

dictionary = corpora.Dictionary(processed_corpus)


In [101]:
# 5.1
dictionary      # Just informs where it is stroed in memory


# 5.1.1
print("\n\n---Words and corresponding IDs------\n")
list(dictionary.values())

# 5.1.2
print("\n\n--------------\n")
dictionary.keys()


# Id of 'computer' is 1
# Id of 'human' is 2
# Id of 'system' is 6
# Id of 'minors' is 12
# Word 'interaction' is absent

<gensim.corpora.dictionary.Dictionary at 0x7fe8d95f1e10>



---Words and corresponding IDs------



['at',
 'computer',
 'human',
 'interface',
 'response',
 'survey',
 'system',
 'time',
 'user',
 'eps',
 'trees',
 'graph',
 'minors']



--------------



[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

### Vectorization--Get bag of words

To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector of features. For example, a single feature may be thought of as a question-answer pair:

1. How many times does the word splonge appear in the document? Zero.
2. How many paragraphs does the document consist of? Two.
3. How many fonts does the document use? Five.

#### One way

The question is usually represented only by its integer id (such as 1, 2 and 3). The representation of this document then becomes a series of pairs like (1, 0.0), (2, 2.0), (3, 5.0). This is known as a dense vector, because it contains an explicit answer to each of the above questions.

If we know all the questions in advance, we may leave them implicit and simply represent the document as (0, 2, 5). This sequence of answers is the vector for our document (in this case a 3-dimensional dense vector). For practical purposes, only questions to which the answer is (or can be converted to) a single floating point number are allowed in Gensim.

In practice, vectors often consist of many zero values. To save memory, Gensim omits all vector elements with value 0.0. The above example thus becomes (2, 2.0), (3, 5.0). This is known as a sparse vector or bag-of-words vector. The values of all missing features in this sparse representation can be unambiguously resolved to zero, 0.0.

Assuming the questions are the same, we can compare the vectors of two different documents to each other. For example, assume we are given two vectors (0.0, 2.0, 5.0) and (0.1, 1.9, 4.9). Because the vectors are very similar to each other, we can conclude that the documents corresponding to those vectors are similar, too. Of course, the correctness of that conclusion depends on how well we picked the questions in the first place.

#### Bag of words
Another approach to represent a document as a vector is the bag-of-words model. Under the bag-of-words model each document is represented by a vector containing the frequency counts of each word in the dictionary. For example, assume we have a dictionary containing the words ['coffee', 'milk', 'sugar', 'spoon']. A document consisting of the string "coffee milk coffee" would then be represented by the vector [2, 1, 0, 0] where the entries of the vector are (in order) the occurrences of “coffee”, “milk”, “sugar” and “spoon” in the document. The length of the vector is the number of entries in the dictionary. One of the main properties of the bag-of-words model is that it completely ignores the order of the tokens in the document that is encoded, which is where the name bag-of-words comes from.

Our processed corpus has 13 unique words in it, which means that each document will be represented by a 13-dimensional vector under the bag-of-words model. We can use the dictionary to turn tokenized documents into these 13-dimensional vectors. We can see what these IDs correspond to:

Our processed corpus has 13 unique words in it, which means that each document will be represented by a 13-dimensional vector under the bag-of-words model. We can use the dictionary to turn tokenized documents into these 13-dimensional vectors. We can see what these IDs correspond to:

In [102]:
# 5.2 Another way to read word-Id pairs
print(dictionary.token2id)      # Another function is id2token
 

{'at': 0, 'computer': 1, 'human': 2, 'interface': 3, 'response': 4, 'survey': 5, 'system': 6, 'time': 7, 'user': 8, 'eps': 9, 'trees': 10, 'graph': 11, 'minors': 12}


For example, suppose we wanted to vectorize the phrase “Human computer interaction” (note that this phrase was not in our original corpus). We can create the bag-of-word representation for a document using the doc2bow method of the dictionary, which returns a sparse representation of the word counts:

#### Vectorize an arbitrary document

Based upon our availabe corpus, vectorize any arbitrary document.  

For example, suppose we wanted to vectorize the phrase “Human computer interaction” (note that this phrase was not in our original corpus). We can create the bag-of-word representation for a document using the `doc2bow` method of the dictionary, which returns a sparse representation of the word counts:

In [103]:
# 6.0 Transform to BOW
new_doc = "Human computer to computer to computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(1, 3), (2, 1)]


The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count of this token.   

Note that “interaction” did not occur in the original corpus and so it was not included in the vectorization. Also note that this vector only contains entries for words that actually appeared in the document. Because any given document will only contain a few words out of the many words in the dictionary, words that do not appear in the vectorization are represented as implicitly zero as a space saving measure.

We can convert our entire original corpus to a list of vectors:

In [104]:
# 6.1 Convert document into the bag-of-words (bow)
#      format ie list of (integer-tokens, token_count) per document.
#      Bag of sequences does not give integer sequences in the order
#      they are in the sentence but in increasing order of integer values.
#      Thus, it is just a bag.

bag = [dictionary.doc2bow(text) for text in processed_corpus]
bag


[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(0, 1), (1, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(3, 1), (6, 1), (8, 1), (9, 1)],
 [(2, 1), (6, 2), (9, 1)],
 [(4, 1), (7, 1), (8, 1)],
 [(10, 1)],
 [(10, 1), (11, 1)],
 [(0, 1), (10, 1), (11, 1), (12, 1)],
 [(5, 1), (11, 1), (12, 1)]]

#### Document vs Vector   

The distinction between a document and a vector is that the former is text, and the latter is a mathematically convenient representation of the text. Sometimes, people will use the terms interchangeably: for example, given some arbitrary document D, instead of saying “the vector that corresponds to document D”, they will just say “the vector D” or the “document D”. This achieves brevity at the cost of ambiguity.

As long as you remember that documents exist in document space, and that vectors exist in vector space, the above ambiguity is acceptable.

### Model

Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The model learns the details of this transformation during training, when it reads the training Corpus.

One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

Here’s a simple example. Let’s initialize the tf-idf model, training it on our corpus and transforming the string “system minors”:

In [65]:
# 7.1
from gensim import models

# 7.2 train the model
tfidf = models.TfidfModel(bag)

# 7.3 Transform the "system minors" string:

words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(6, 0.5898341626740045), (12, 0.8075244024440723)]


The tfidf model again returns a list of tuples, where the first entry is the token ID and the second entry is the tf-idf weighting. Note that the ID corresponding to "*system*" (which occurred 4 times in the original corpus in three different documents ) has been weighted lower than the ID corresponding to "*minors*" (which only occurred twice in two different documents).

### Similarity of documents
Once you’ve created the model, you can do all sorts of cool stuff with it. For example, to transform the whole corpus via TfIdf and index it, in preparation for similarity queries:

In [70]:
# 7.4
from gensim import similarities

# 7.5 Create a special data structure, 'index', 
#     to quickly get a similarity score

index = similarities.SparseMatrixSimilarity(tfidf[bag], num_features=13)

and to query the similarity of our query document query_document against every document in the corpus:

In [71]:
# 7.6 Document about which we want to calculate
#     similarity
query_document = 'system engineering'.split()

# 7.6.1 Get Bag-of-words representation of this document
query_bow = dictionary.doc2bow(query_document)

# 7.6.1 Calculate similarity with
#       with respect to each one
#       of other document in the corpus

sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))

[(0, 0.0), (1, 0.3086447), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


How to read this output?  

Total Socuments: 9.  
Document 3 has a similarity score of 0.718=72%,  
Document 2 has a similarity score of 42% etc.   
We can make this slightly more readable by sorting:

In [72]:
# 7.7 Sorted document similarity:

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)

3 0.7184812
2 0.41707572
1 0.3086447
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0


In [20]:

#####################################
"""
Dictionaries in Python:
=======================

Module Collections has two types of dictioaries:
i) OrderedDict and ii)

OrderedDict in Python
--------------------
    An OrderedDict is a dictionary subclass that remembers
    the order that keys were first inserted. The only
    difference between dict() and OrderedDict() is that
    OrderedDict preserves the order in which the keys are inserted.
    A regular dict doesn’t track the insertion order, and iterating
    it gives the values in an arbitrary order. By contrast, the order
    the items are inserted is remembered by OrderedDict and are returned
    in that order while iterating.

defaultdict
------------
    A defaultdict works exactly like a normal dict, but it is initialized
    with a function (called “default factory”) that takes no arguments
    and provides the default value for a nonexistent key.
    A defaultdict will never raise a KeyError. Any key that does not
    exist gets the value returned by the default function.

    from collections import defaultdict
    # Create a defaultdict with an initialization
    #  function:
    ice_cream = defaultdict(lambda: 'Vanilla')
    # Insert a key-value pair
    ice_cream['Sarah'] = 'Chunky Monkey'
    ice_cream['Sarah']    # Returns 'Chunky Monkey'
    ice_cream['Joe']      # Key non-nonexistent. Returns Vanilla

"""



In [66]:
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bag], num_features=13)

In [68]:
index.get_similarities()

TypeError: ignored