# Set-up and import data 

In [5]:
from google.colab import files
uploaded = files.upload()

Saving sample_repository (1).json to sample_repository (1).json


In [1]:
import json 

with open('sample_repository (1).json') as in_file:
    test_data = json.load(in_file)

titles = [item[0] for item in test_data['data']]
documents = [item[1] for item in test_data['data']]

In [176]:
# titles[:32], documents[:32]

# 1. TF-IDF

## **Conclusion for semantic segmentation using TF-IDF vectorizer (with and without preprocessing):**



1.   TFIDF seems to work decently for single words as well as for multiple word queries. However, TFIDF seems to neglect context to a certain as opposed to BERT which will soon see considers context. Compared to GloVe, TFIDF does "contecualize" a bit better since it looks at both term and document frequency. That is why for the queries "fruit" and "vegetable" it did not return specific documents such as "fruit serving bowl" and "white onions" and instead returned nutrition and major market as the titled for the most relevant documents to the  queriesres "fruit" and "vegetable", respectively.
2.   TFIDF with lemmatization and punctuation removal preprocessing generated different results than for the unpreprocessed documents. That is because TFIDF was probably able to identify greater frequency when words were lemmatized. Removing punctuations did not have an effect since the queries generated the same top scoring documents as in the punctuated TFIDF implemenetation. This is likely due to the non-discriminative role of punctuation since they don't really help to distinguish a document so they're given a low score. 
3.   TFIDF works better than "GloVe" in semantic segmentation as the latter seems to have trouble retrieving relevant documents to queries with multiple words since it's vectors are built based on a dictionary of tokens. When the query of "healthy foods in Canada" is done in tokens as in ["healthy", "foods", "in", "Canada"], GloVe returns reasonable results such as documents titled "Canada's Food Guide", 'Diet', 'fruit serving bowl', etc.



In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.corpora import Dictionary
import gensim.downloader as api
from gensim.models import TfidfModel
nltk.download('wordnet')

nltk.download('punkt')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




In [155]:
import numpy as np
from sklearn.metrics.pairwise import linear_kernel

query = 'fruits'
query_terms = ['fruits', 'vegetables', 'healthy foods in Canada']

# instantiate TDIDF vectorizer
vectorizer = TfidfVectorizer(stop_words=stop_words)

# Vecorizes every word in every document accordint to a TF x IDF approach and generates a sparse matrix in the following format:
# (x, y) z: x -> document, y -> word id/index, z -> TD x IDF score
vectors = vectorizer.fit_transform([query_terms[0]] + documents) 

# Calculate the word frequency, and a measure of similarity (whatever you find it to be approperiate) of the search terms with each document
print("TFIDF for documents are:\n",vectors[:2], "\n")


# Print the top-scoring results and their titles

# Here I'm using cosine similarity based on linear kernels, an approached advised at: https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity
# Intuition behind cosine similarity: it essentially gets the "angle" - in degrees - between two vectors representing our query document and the documents in the corpus that we're comapring
# our query to, and ranks them in order as in product of vectors divided by product of their (vectors) standardized form as in: np.dot(a, b)//np.dot(abs(a),abs(b))

for i in query_terms:
  vectors = vectorizer.fit_transform([i] + documents) # another approach would simply be to include all query words as individual documents
  cosine_similarities = linear_kernel(vectors[:1], vectors).flatten() # first vector represents my first document (as in my query) that is compared to all of the vectors - documents
  related_docs_indices = cosine_similarities.argsort()[:-10:-1][1] # taking index 1 since index 0 is actually our query so it has a perfect cosine similarity score
  print("The top scoring document relevant to {}".format(i), " is: ", titles[related_docs_indices])

TFIDF for documents are:
   (0, 260)	1.0
  (1, 39)	0.10409377510849337
  (1, 36)	0.10409377510849337
  (1, 434)	0.07430797619476548
  (1, 421)	0.10409377510849337
  (1, 395)	0.10409377510849337
  (1, 613)	0.10409377510849337
  (1, 323)	0.08304429837808315
  (1, 43)	0.08304429837808315
  (1, 133)	0.17716206996595776
  (1, 162)	0.10409377510849337
  (1, 560)	0.09535745292517571
  (1, 565)	0.08858103498297888
  (1, 139)	0.11640692965558593
  (1, 38)	0.10409377510849337
  (1, 364)	0.10409377510849337
  (1, 11)	0.10409377510849337
  (1, 378)	0.10409377510849337
  (1, 190)	0.08858103498297888
  (1, 1)	0.10409377510849337
  (1, 47)	0.1660885967561663
  (1, 119)	0.10409377510849337
  (1, 604)	0.31228132532548014
  (1, 388)	0.20818755021698673
  (1, 420)	0.07430797619476548
  :	:
  (1, 531)	0.10409377510849337
  (1, 326)	0.11640692965558593
  (1, 610)	0.10409377510849337
  (1, 259)	0.1660885967561663
  (1, 335)	0.09535745292517571
  (1, 521)	0.11640692965558593
  (1, 469)	0.11640692965558593
  

## Repeat the same task after some preprocessing 


In [5]:
# Preprocessing: Lemmatization
# This is a strightforward lemmatization applied by toneizing words in documents then rejoining them
# and conducting the same queries as above

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()

# Here, words in a sentence are seperated into individual token strings
# to each word in each document of the corpus
docs = []
for i in documents:
  docs.append(nltk.word_tokenize(i))
lemm = WordNetLemmatizer()

# Here, we're getting the "root" of each word in each document of the corpus
documents = []
for doc in range(len(docs)):
  sent = ' '.join([lemm.lemmatize(word) for word in docs[doc]])
  documents.append(sent)

# As in the non-preprocessed TFIDF application above
for i in query_terms:
  vectors = vectorizer.fit_transform([i] + documents) 
  cosine_similarities = linear_kernel(vectors[:1], vectors).flatten()
  related_docs_indices = cosine_similarities.argsort()[:-10:-1][1]
  print("Most relevant document to {}".format(i), " is: ", titles[related_docs_indices])

Most relevant document to fruits  is:  Pink Onions
Most relevant document to vegetables  is:  Pink Onions
Most relevant document to healthy foods in Canada  is:  Major Market


In [6]:
# Preprocessing: Lemmatization + punctuation removal
from nltk.tokenize import RegexpTokenizer as regextk

# Here I ustilize regex tokenier that tokenizes and regularizes expressions that have already been lemmatize above
rgtk = regextk(r'\w+') # removes punctuation according to: https://www.kite.com/python/answers/how-to-remove-all-punctuation-marks-with-nltk-in-python
docs = []
for doc in documents:
  sent = ' '.join(rgtk.tokenize(doc))
  docs.append(sent)

# As in the non-preprocessed TFIDF application above
for i in query_terms:
  vectors = vectorizer.fit_transform([i] + docs)
  cosine_similarities = linear_kernel(vectors[:1], vectors).flatten()
  related_docs_indices = cosine_similarities.argsort()[:-10:-1][1]
  print("Most relevant document to {}".format(i), " is: ", titles[related_docs_indices])

Most relevant document to fruits  is:  Pink Onions
Most relevant document to vegetables  is:  Pink Onions
Most relevant document to healthy foods in Canada  is:  Major Market


# 2. Semantic matching using GloVe embeddings

##**Conclusion for semantic segmentation using GloVe embeddings:**
- GloVe embeddings handle single word queries but seem to run into 
problems when handling sentence queries according to my experience (unless I made a mistake).

- The queries for fruits and vegetables are very relevant - doing better than TFIDF in terms of retrieving the most relevant vector under fruits (TFIDF returns "white onions" as the title of the document that is most relevant to "fruit"), and the results for "Healthy foods in Canada" only returned relevant results when the string was broken into individual tokens.

In [11]:
# !pip install  gensim==4.0.1 # if you decide to use the gensim library and the sample codes below, you would need gensim version >=4.0.1 to be installed 
import gensim
print(gensim.__version__)

4.0.1


In [158]:
import logging
import logging
from re import sub
from multiprocessing import cpu_count

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [159]:
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [160]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [161]:
def preprocess(doc):
    # Tokenize, clean up input document string remvoing special characters, etc. to prepare it for BoW
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    # you may decide to add additional steps here 
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [162]:
# Load test data
with open('sample_repository (1).json') as in_file:
    test_data = json.load(in_file)

titles = [item[0] for item in test_data['data']]
documents = [item[1] for item in test_data['data']]

In [163]:
query_s = ['fruits', 'vegetables', 'healthy foods in Canada']

# This applies the preprocess function above to the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = [preprocess(i) for i in query_terms]
# query[2] = [' '.join(query[2])]

In [164]:
# Download and load the GloVe word vector embeddings

if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)

In [165]:
# Build the term dictionary, TF-idf model
# Keep in mind that the search query must be in the dictionary as well, in case the terms do not overlap with the documents  

dictionary = Dictionary(query + corpus) # query includes all 3 queries
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix. 
# The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column. 
# For my application, I got best results by removing the default value of 100
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf, nonzero_limit=None)

100%|██████████| 568/568 [00:14<00:00, 39.03it/s]


In [166]:
import warnings

In [174]:
# Compute similarity measure between the query and the documents.

# The top 5/10 documents can be retrieved for the highest cosine similarities (indicating the smallest angle between 2 vectors representing 2 documents)
# in order of largest to smallest of 32 consine score:
with warnings.catch_warnings():
  warnings.simplefilter("ignore")

  for i in query_terms:
    query_tf = tfidf[dictionary.doc2bow([i])]

    index = SoftCosineSimilarity(
                tfidf[[dictionary.doc2bow(document) for document in corpus]],
                similarity_matrix)

    doc_similarity_scores = index[query_tf]
    # if more than 1 document exists that is similar and it's score is not 
    if len((doc_similarity_scores).flatten()) > 0 and (doc_similarity_scores).flatten()[0] !=0 : 
      print("The top 10 related documents to: \t{} \t are:\t".format(i), [titles[j] for j in np.argsort(doc_similarity_scores)[:-10:-1]])
    # if doc similarity score are 0
    else: 
      print("No related documents to: \t{} (see next cell)".format(i))

The top 10 related documents to: 	fruits 	 are:	 ['Pomegranate Bhagwa', 'Pomegranate Arakta', "Canada's Food Guide", 'Food classes', 'List of fruit dishes', 'fruit serving bowl', 'Grapes Flame / Red Seedless', 'history of botany', 'About Us']
The top 10 related documents to: 	vegetables 	 are:	 ["Canada's Food Guide", 'White Onions', 'Tomatoes', 'Food classes', 'fruit serving bowl', 'Red Onions', 'Pink Onions', 'List of fruit dishes', 'Pomegranate Arakta']
No related documents to: 	healthy foods in Canada (see next cell)


In [172]:
# Since no related documents are generated when a string "healthy foods in Canada" is queried, we can try querying individual terms together:

query_tf = tfidf[dictionary.doc2bow(["healthy", "foods", "in", "Canada"])]
with warnings.catch_warnings():
  warnings.simplefilter("ignore")
  index = SoftCosineSimilarity(
              tfidf[[dictionary.doc2bow(document) for document in corpus]],
              similarity_matrix)
  doc_similarity_scores = index[query_tf]
  print("The top 10 related documents to: \t{} \t are:\t".format(i), [titles[j] for j in np.argsort(doc_similarity_scores)[:-10:-1]])

The top 10 related documents to: 	healthy foods in Canada 	 are:	 ["Canada's Food Guide", 'Diet', 'fruit serving bowl', 'Pomegranate Bhagwa', 'Pomegranate Arakta', 'Tomatoes', 'Nutrition', 'Grapes Black Sharad Seedless', 'About Us']


# 3. BERT
Use a bert model to create sentence embeddings and calculate the similarity between queries and documents.

## **Conclusions for semantic segmentation using BERT**

- I've opted to use NLI Sentence BERT utilizing max pooling from the spaCy API. NLI Sentence BERT seems to have much more relevant as they relate queries to documents based on 3 types of inference: entailment, contradiction and neutrality. The pooled output helps give one overall "topic" to a document further helping us in querying.

- The conclusions are thet BERT embedding work far better and retrieve more relevant documents as they relate to the queries since BERT contexualizes in that it anaysez sequences both forwards and backwards making it ideal for topic modelling and retrieving relevant documents to the queries and their respective titles. For example,  for "fruit" the top 5 most relevant titles that all relate to fruit directly are: 

List of fruit dishes

Food classes

fruit serving bowl

Pomegranate Arakta

Pomegranate Bhagwa

- I have also tried roberta base and roberta large and they did not yield better results. I have not tried the standard BERT such the configurations found in Kerashub since online discussions have advised against using the standard BERT and referred interested people to Sentence BERT for topic modelling and semantic segmentation.
- In comaprison to TDIDF and GloVe, it appearsthat NLI BERT's resulting documents from the query were much more relevant to the query given their overall entailed meaning.

In [141]:
# ! pip3 install spacy --quiet
# ! pip3 install spacy-transformers --quiet
# ! pip3 install spacy_sentence_bert

import spacy
import spacy_transformers
import spacy_sentence_bert

In [142]:
nlp = spacy_sentence_bert.load_model('en_nli_bert_large_max_pooling') # this BERT uses natural language inference and uses max pooled output (a single output of the sequence)

# Utility function for generating sentence embedding from the text
def embed(x):
    '''
    Applies BERT embedding to a document and returned a vector of embeddings of size n_documents x 768 or 1024 depending on the which BERT is used, the third dimension
    '''
    return nlp(x).vector

# Generating sentence embedding from the text for each query for easy analysis later on
embeddings1 = [embed(doc) for doc in documents + [query_terms[0]]]
embeddings2 = [embed(doc) for doc in documents + [query_terms[1]]]
embeddings3 = [embed(doc) for doc in documents + [query_terms[2]]]

# np.array(embeddings1).shape

In [143]:
# compare the findings  

In [144]:
embeddings = [embeddings1, embeddings2, embeddings3]

# this applies cosine similarity as explained in the TFIDF section
for i, j in zip(query_terms, embeddings):
  cosine_similarities = linear_kernel(j[-1].reshape(1,-1), j).flatten()
  related_docs_indices = cosine_similarities.argsort()[:-10:-1]
  print("\nMost relevant document to '{}' is: ".format(i))
  for i in related_docs_indices:
    try: print(titles[i])
    except: pass # exception for handle the fact that title list length = 32, vector list length = 33 since we included out query as a vector for easy application of cosine similarity


Most relevant document to 'fruits' is: 
List of fruit dishes
Food classes
fruit serving bowl
Pomegranate Arakta
Pomegranate Bhagwa
Welcome to Anushka Avni International (AAI)
Tomatoes
Diet

Most relevant document to 'vegetables' is: 
Diet
Nutrition
Nutrients
Food classes
fruit serving bowl
List of fruit dishes
Welcome to Anushka Avni International (AAI)
Canada's Food Guide

Most relevant document to 'healthy foods in Canada' is: 
Canada's Food Guide
Video Gallery
Downloads
Welcome to Anushka Avni International (AAI)
White Onions
About Us
Contact Us
Canadian Industry Statistics


# Comparisons

1. Context: BERT overall understandas context much better which is why it was able to retrieve documents that were more relevant to queries like fruit, vegetable, or healthy foods in canada, where as TFIDF and GloVe simply seemed to "count" the number of occurance of a query in other documents or the overall corpus and retrieve the most relevant documents based on that and generally ignore context.
2. Relevance: GloVe and BERT were both able to return relatively "relevant" documents to the queries whereas TFIDF at times returned documents that were quite "off" such as the example of returning the document titled "white onions" for the query "fruit".
3. Querying sentences: BERT handled querying multiple words such as "healthy foods in Canada" much better than TFIDF and certainly much better than GloVe since BERT handles context of sequences meaning a sequence query didn't pose a problem for it. TFIDF was the next best at querying multiple words since it used word counts both within a single document and within the corpus. GloVe did not handle lengthy queries well since it specialisez in word-word co-occurence, instead, a sentence query should be split up and the component words queried together.