In [None]:
# Usual library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
from tqdm import tqdm
import gc

### Embeddings/Feature Extraction

Feature extraction means obtaining the embedding vectors for a given text from a pre-trained model.  Once you have the embeddings, which are numerical representations of text, lots of possibilities open up.  You can compare the similarity between documents, you can use the embeddings to match questions to answers, perform clustering based on any algorithm, use the embeddings as features to create clusters of similar documents, and so on.

**Difference between word embeddings and document embeddings**  
So far, we have been talking of word embeddings, which means we have a large embedding vector for every single word in our text data.  What do we mean when we say sentence or document embedding?  A sentence's embedding is derived from the embeddings for all the words in the sentence.  The embedding vectors are generally averaged ('mean-pooled'), though other techniques such as 'max-pooling' are also available.  It is surprising that we spend so much effort computing separate embeddings for words considering context and word order, and then just mash everything up using an average to get a single vector for the entire sentence, or even the document.  It is equally surprising that this approach works remarkably effectively for a large number of tasks.

Fortunately for us, the sentence transformers library knows how to computer mean-pooled or other representations of entire documents based upon the pre-trained model used.  Effectively, we reduce the entire document to a single vector that may have 768 or such number of dimensions.

Let us look at this in action.

First, we get embeddings for our corpus using a specific model.  We use the 'all-MiniLM-L6-v2' for symmetric queries, and any of the MSMARCO models for asymmetric queries.  The difference between symmetric an asymmetric queries is that the query and the sentences are roughly the same length in symmetric queries.  In asymmetric queries, the query is much smaller than the sentences.

This is based upon the documentation on sentence-bert's website.

In [None]:
# Toy example with just three sentences to see what embeddings look like

from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2') #for symmetric queries
model = SentenceTransformer('msmarco-distilroberta-base-v2') #for asymmetric queries
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

In [None]:
embedding.shape

In [None]:
%%time
# Use our data

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') #for symmetric queries
# model = SentenceTransformer('msmarco-distilroberta-base-v2') #for asymmetric queries

#Our sentences we like to encode
sentences = list(corpus.text)

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)



In [None]:
# At this point, the variable embeddings contains all our embeddings, one row for each document
# So we expect there to be 100 rows, and as many columns as the model we chose vectorizes text
# into.

embeddings.shape

### Cosine similarity between sentences 

We can compute the cosine similarity between documents, and that gives us a measure of how similar sentences or documents are.

The below code uses brute force, and finds the most similar sentences.  Very compute intensive, will not run if number of sentences is very large.

In [None]:
from sentence_transformers import util
distances = util.cos_sim(embeddings, embeddings)
distances.shape

In [None]:
df_dist = pd.DataFrame(distances, columns = corpus.index, index = corpus.index)
df_dist

At this point, we can use `stack` to rearrange the data to identify similar articles, but `stack` fails if you have a lot of documents.  Let us see how `stack` does the job.

In [None]:
# Using stack
df_dist = df_dist.stack().reset_index()
df_dist.columns = ['article', 'similar_article', 'similarity']
df_dist = df_dist.sort_values(by = ['article', 'similarity'], ascending = [True, False])
df_dist

In [None]:
# Let us reset our df_dist dataframe
df_dist = pd.DataFrame(distances, columns = corpus.index, index = corpus.index)

Stack will fail if the number of documents is large.  In that case, we decide the number of top similar documents (say 20), that we need identified for each document.

We do this using the below loop

In [None]:
# Using a loop



top_n = 21
temp = []
for col in tqdm(range(len(df_dist))):
    t = pd.DataFrame(df_dist.iloc[:, col].sort_values(ascending = False)[:top_n]).stack().reset_index()
    t.columns = ['similar_article', 'article', 'similarity']
    t = t[['article', 'similar_article', 'similarity']]
    temp.append(t)

pd.concat(temp)