# Step 9 xx

|**[Overview](#Overview)** |**[Installation](#Installation)||**[Prior-steps](#Prior-steps)**|**[How-to-use](#How-to-use)**|**[Next-steps](#Next-steps)**|**[Acknowledgements](#Acknowledgments)**|

# Overview
This takes the library submitted. 

It creates a Bag of Words (CBOW) model, which is a vector representation of each document, with as many dimensions as there are frequent words in the library. 

This is not very useful for document comparison in itself.  So CBOW is transformed into a TF-IDF model. 

This model is saved so that it can be applied later, to:
- directly run similarity queries between documents
- create a topic model (LSI)
A index is also saved, of the current library, in TFIDF format, for the future similarity queries within the current library. 


# How-to-use

# Installation

Check installation has been made, as per the [READme](https://github.com/lawrencerowland/Data-Model-for-Project-Frameworks/blob/master/Project-frameworks-by-using-NLP-with-Python-libraries/README.md)

## Prior-steps
Step 5
# How-to-use

## Import modules

In [1]:
#hide
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim import corpora, models, similarities

unable to import 'smart_open.gcs', disabling that module


## Import Whole library with json

In [8]:
#hide
import os
directory= "/Users/lawrence/Documents/GitHub/Data-Model-for-Project-Frameworks/Project-frameworks-by-using-NLP-with-Python-libraries/Interim-results/"

In [3]:
#hide
import json
with open(directory+"Corpus_as_list.json", "r") as read_file:
      Corpus_as_list= json.load(read_file)

## Identify tokens and make-up a dictionary

In [4]:
# remove common words and tokenize
# Here we can add in some odd words we find in the output, or use the NLTK list
stoplist = set('for a of the and to in \uf06e  • \uf0b7 \uf0b7 \uf06e uf09 \uf09f'.split())
Tokens_in_Corpus = [[word for word in document.lower().split() if word not in stoplist]
         for document in Corpus_as_list]
# This saves tokens as a list, one entry per document

In [5]:
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in Tokens_in_Corpus:
    for token in text:
        frequency[token] += 1

Frequent_Tokens_in_Corpus= [[token for token in text if frequency[token] > 1] for text in Tokens_in_Corpus]
# again tokens as a list, one entry per document

In [None]:
from pprint import pprint  # pretty-printer
pprint(Frequent_Tokens_in_Corpus[16]) #these slices of lists go up to before the higher number. 

In [9]:
#create dictionary, save it, then view map from ids to dictionary
dictionary = corpora.Dictionary(Frequent_Tokens_in_Corpus)
dictionary.save(directory+'Library.dict')

In [None]:
print(dictionary,"\n\n")
print(dictionary.token2id)

## We apply the BOW model back to the current library

i.e. we now create recreate each document in our portfolio library as a list of vectors.
However we dont get good results for similarity just from a bag of words model in itself. 

In [7]:
#ie. a list of a list. For each document, we have a list of word frequency for each dictionary item
Corpus_as_BOW = [dictionary.doc2bow(text) for text in Frequent_Tokens_in_Corpus]

In [27]:
with open(directory+"Corpus_as_BOW.json", "w") as write_file:
    json.dump(Corpus_as_BOW, write_file)

In [None]:
#hide
for c in Corpus_as_BOW:
    print(c)

## CREATE TF-IDF MODEL

From Gensim's Quick start [tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py). 
"Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus."

One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.
Let's initialize the tf-idf model, training it on our corpus.

In [11]:
TFIDF_MODEL= models.TfidfModel(Corpus_as_BOW)

"From now on, tfidf is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):"

Applying TfIDF model back onto the corpus:

In [22]:
TFIDF_MODEL.save(directory+'model-from-input-library.tfidf')

## Getting ready for Cosine similarity
This saves a representation of the current library (under TFIDF)

In [20]:
from gensim import similarities
index = similarities.MatrixSimilarity(Corpus_as_TFIDF)

In [21]:
index.save(directory+'Index_for_corpus_for_similarities.index')
index = similarities.MatrixSimilarity.load(directory+'Index_for_corpus_for_similarities.index')

# Next-steps
Go to Step 10 to understand similarities:
- across the whole library
- for new documents

# Postscript (saving model as temporary file)
The model is currently being saved in the Results folder - but the below is an alternative. 

In [16]:
# persist the model
import tempfile
with tempfile.NamedTemporaryFile(prefix='model-', suffix='.tfidf', delete=False) as tmp:
    TFIDF_MODEL.save(tmp.name)  

In [17]:
loaded_TFIDF_model = models.TfidfModel.load(tmp.name)
os.unlink(tmp.name)

# Acknowledgments
This is extensively taken From Gensim's Quick start [tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py). All that has been done has been to apply it to the particular document library submitted in this Github folder, in this case, to ONR. 