<a href="https://colab.research.google.com/github/marianelamin/vector-modeling-lsa/blob/master/similarity_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Mounting my drive


To mount my google drive:

In [0]:
from google.colab import drive
drive.mount('/content/drive')

change directory to custom python module:

In [0]:
cd /content/drive/'My Drive'/NEIU-CS/'cs490 Masters Project'/notebook/nlu_helper

#####check for working directory

In [0]:
ls /content/drive/'My Drive'/NEIU-CS/'cs490 Masters Project'/notebook/nlu_helper

In [0]:
ls

#Comparing two texts

## Importing modules

#### Importing python modules
These modules help us deal with the data.

In [41]:
from traceback import print_exc
import numpy as np
import pandas as pd
import json
import sys
import os
from sklearn.feature_extraction import stop_words



####Importing custome modules
* *utility* has functions.
* *read_model* has the classes needed to load the model, process the text and compare two sentences using cosine similarity.

In [22]:
PROJECT_PATH = os.path.dirname(os.path.abspath('.'))
# print(sys.path)
if PROJECT_PATH not in sys.path:
    sys.path.append(PROJECT_PATH)

from nlu_helper.read_model import MatrixModel, SentenceProcessor
from nlu_helper.utility import load_json2py, create_a_logger, resources_folder, log_and_print, file2pickle
from nlu_helper.utility import create_results_directory_if_needed

print('Load modules from: ', PROJECT_PATH)

Load modules from:  /content/drive/My Drive/NEIU-CS/cs490 Masters Project/notebook


## Loading and Using a model

In [0]:
# check which models have already been uploaded into the google colab
ls resources/all_senate_speeches/ | grep .json

A model is needed in order to compare two texts.  Down below there is a list of available models. These name of each models describes how it was generated.

* `lsa` or `docbyterm`: implementing LSA or not
* if `lsa` applied, `dim` should specify the number of dimensions
* `count`, `tfidf` or `zeroone` defines the type of scoring when vectorizing the corpus. *count* refers to the frequency of the word in the document; *zeroone* to the existence or not of the word in the document; *tfidf* to the term frequecy inverse document frequency (importance of the word in the document).
* `stop` if stopwords were removed from the corpus prior to the model generation
* `RxC` *R* number of rows (words) in the model, *C* number of columns.

From the recostructed matrix taking only the fisrt *C* columns.

**!! Important !!**

There is an issue with the memory.  This demo only works (in this notebook) for models that use LSA.

### Selecting the model

In [37]:
model_filename = "all_senate_speeches-min_df_2-lsa-dim100-count-stop_69414x1000" #@param ["all_senate_speeches-min_df_2-lsa-dim100-count_69724x1000", "all_senate_speeches-min_df_2-lsa-dim100-count-stop_69414x1000", "all_senate_speeches-min_df_2-lsa-dim100-tfidf_69724x1000", "all_senate_speeches-min_df_2-lsa-dim100-tfidf-stop_69414x1000", "all_senate_speeches-min_df_2-lsa-dim100-zeroone_69724x1000", "all_senate_speeches-min_df_2-lsa-dim100-zeroone-stop_69414x1000", "all_senate_speeches-min_df_2-lsa-dim300-count-stop_69414x1000", "all_senate_speeches-min_df_2-lsa-dim300-tfidf-stop_69414x1000", "all_senate_speeches-min_df_2-lsa-dim300-zeroone_69724x1000"]

model_filepath = 'all_senate_speeches/'+ model_filename
print('file: ', model_filepath)


try:
  #create the matrix model, this reads and holds the data
  matrix_model = MatrixModel(model_filepath, logger=None)

  #create the processor object, this takes the model.
  sp = SentenceProcessor(matrix_model)

  print('\t- MatrixModel loaded!')
  print('\t- SentenceProcessor ready!')
except Exception:
  print('There was an error with the model.  Either not found or you are attepting to use a non-lsa')
  print_exc()

file:  all_senate_speeches/all_senate_speeches-min_df_2-lsa-dim100-count-stop_69414x1000
	- MatrixModel loaded!
	- SentenceProcessor ready!


### 1. Using the model to compare


In [0]:
sp.args

In [0]:
sentence1= 'May the forth be with you'
sentence2= 'cancer begings in the cells'

s1_tokenized = sp.tokenize_sentence(sentence1)
print(s1_tokenized)
sentence1_vector = sp.get_sentence_vector(s1_tokenized)
# print(sentence1_vector)

s2_tokenized = sp.tokenize_sentence(sentence2)
print(s2_tokenized)
sentence2_vector = sp.get_sentence_vector(s2_tokenized)
# print(sentence2_vector)

sp.get_sentence_match(sentence1, sentence2)


In [0]:
# check it the words are in English_stop_words set
# print(stop_words.ENGLISH_STOP_WORDS)
def which_were_removed(sentence):
  print('\nsentence: ', sentence)
  for word in sentence.split():
    print('\t[x]' if word in stop_words.ENGLISH_STOP_WORDS else '\t[ ]', word)

which_were_removed(sentence1)
which_were_removed(sentence2)

### 2. Automatic multiple comparisons

#### Defining the Model Applier class.
This is the class that will apply the model selected and provide us with the result. We do not need a log file, because everything will be printed on the notebook.

In [0]:
class ModelApplier:

    def __init__(self, filename):
        # self.logger = create_a_logger('read_model.log')
        # self.logger.info('*************************************** STARTING TO APPLY MODEL')
        # self.logger.info('model file used: ' + filename)
        # self.logger.debug('creating the object of the MatrixModel')
        self.logger = None

        self.matrix_model = MatrixModel(filename, logger=self.logger)

        # self.logger.debug('finish creating the MatrixModel object')
        # print(matrix_model.args)
        self.sp = SentenceProcessor(self.matrix_model, logger=self.logger)
        # self.model_filename = filename

    def paragraph_vs_sentences_similarities(self, real_answer: str, given_answers: list) -> list:
        """
        :param real_answer: is a string containing the systems sentence/paragraph that we need to compare against
        :param given_answers: is a list of the possible answers that could have been given by the human user
        returns a list containing all cosine similarity values that were computed from comparing the answer to each
         possible answer on the question dictionary
        """
        r_tok = self.sp.tokenize_sentence(real_answer)
        real_sentence_vector = self.sp.get_sentence_vector(r_tok)
        array_sim = list()

        for given_answer in given_answers:
            # print(given_answer)
            g_tok = self.sp.tokenize_sentence(given_answer['text'])
            sim = self.sp.cos_sim(real_sentence_vector,
                                  self.sp.get_sentence_vector(g_tok))
            array_sim.append(sim)
            # print('result: ', sim)

        return array_sim

    def sort_similarity(self, cosine_similarity_results, question: dict) -> dict:
        # Add two more columns with 'id', 'text' tags
        ids = [q['id'] for q in question['possible_answers']]
        texts = [q['text'] for q in question['possible_answers']]
        pdf = pd.DataFrame({'id': ids, 'similarity': np.asarray(cosine_similarity_results), 'text': texts},
                           index=ids)

        sorted_pdf = pdf.sort_values(by='similarity', ascending=False)
        self.generate_log_similarities_report(question, sorted_pdf)

        final_order = [i + 1 for i in range(len(sorted_pdf.values))]
        sorted_pdf.insert(1, 'rank', final_order)

        item = {
            "item_id": question['lp_id'],
            "item_text": question['answer'],
            "result_as_pdf": json.loads(sorted_pdf.to_json()),
            "result_only_id": sorted_pdf.index.tolist()
        }
        return item

    def generate_log_similarities_report(self, question, pdf):
        m = '\n--------------------------------------- \nRESULTS \nCompare: ' + str(question['answer']) \
            + '\nTo: \n' + str(pdf) \
            + '\n---------------------------------------'
        log_and_print(m, self.logger)

    def save_model_results(self, output_model_results, items, path_testfile):
        with open(output_model_results + '.json', 'w') as fp:
            json.dump({
                "testfile": path_testfile,
                "model": {
                    "filename": self.matrix_model.filename,
                    "args": self.matrix_model.args
                },
                "items": items
            },
                fp)



#### Defining multiple sentences

In [0]:
questionaire = [
                {
                  "lp_id": "LP2",
                  "answer": "cancer begins in cells",
                  "possible_answers": [
                      {
                        "id": "SP1", "text": "May the forth be with you."
                      }
                    ]
                }
              ]

#### Getting automatic ordered results

In [79]:
print('Model to use: ', model_filepath)

ma = ModelApplier(model_filepath)

items = list()
for question in questionaire:
    possible_answers = question['possible_answers']
    answer = question['answer']

    if len(possible_answers) != 0:
        cos_sim_results = ma.paragraph_vs_sentences_similarities(real_answer=answer,
                                                                  given_answers=possible_answers)
        print(cos_sim_results)
        item = ma.sort_similarity(cos_sim_results, question)
        items.append(item)


Model to use:  all_senate_speeches/all_senate_speeches-min_df_2-lsa-dim100-tfidf-stop_69414x1000
[0.014349999837577343]

--------------------------------------- 
RESULTS 
Compare: cancer begins in cells
To: 
      id  similarity                        text
SP1  SP1     0.01435  May the forth be with you.
---------------------------------------


In [74]:
print(items)

[{'item_id': 'LP2', 'item_text': 'members of america joined the russian Party.', 'result_as_pdf': {'id': {'SP1': 'SP1'}, 'rank': {'SP1': 1}, 'similarity': {'SP1': 0.8380200267}, 'text': {'SP1': 'In America, you can always find a party.'}}, 'result_only_id': ['SP1']}]
