Algorithm to transform document to embedding is SBERT
Approach: SBERT transforms sentences to fixed size embedding
Data = L90 

Source: 

https://medium.com/analytics-vidhya/best-nlp-algorithms-to-get-document-similarity-a5559244b23b


https://www.sbert.net/examples/applications/computing-embeddings/README.html



In [6]:
! pip install transformers sentence_transformers



###0. Text preprocessing (Maybe not needed): 

* Normalization: transforming the text into lower case and removing all the special characters and punctuations.

* Tokenization: getting the normalized text and splitting it into a list of tokens.

* Removing stop words: stop words are the words that are most commonly used in a language and do not add much meaning to the text. Some examples are the words ‘the’, ‘a’, ‘will’,…

* Stemming: it is the process to get the root of the words and sometimes this root is not equal to the morphological root of the word, but the stemming goal is to make that related word maps to the same stem. Examples: branched and branching become branch.

* Lemmatization: This is the process of getting the same word for a group of inflected word forms, the simplest way to do this is with a dictionary. Examples: is, was, were become be.

In [7]:
!pip install stanza
import stanza
print("Downloading English model...")
stanza.download('en')

import glob
import nltk
import ssl
print("Downloading stop words...")
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("stopwords")
from nltk.corpus import stopwords

# getting stopwords
stop_words_english = stopwords.words('english')

def tokenize_and_normalize(doc_raw, stopwords):
    """Tokenizes, lemmatizes, lowercases and removes stop words.
    
    this function takes in a path to a doc, reads the doc file,
    tokenizes it into words, then lemmatizes and lowercases these words.
    finally, stopwords given to the function are removed from the list of doc lemmas
    
    Parameters
    ----------
    doc_raw : str
    stopwords : list of strings
        stopwords that should be removed
    
    Returns
    -------
    normalized_song : list of strings
        a doc represented as a list of its lemmas
    """
    nlp = stanza.Pipeline(lang='en', processors='tokenize, lemma',  verbose=False)
    
    # YOUR CODE HERE
    
    doc=nlp(doc_raw)
    words = doc.iter_words()
    normalized_doc = []
    for w in words:
        w = w.lemma.lower()
        if not w in stopwords:
            normalized_doc.append(w)
    normalized_doc = ' '.join(normalized_doc)
    return normalized_doc

Downloading English model...
Downloading stop words...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


###1. Load dataset

In [8]:
import os
import re
## Load dataset into a list of sentences
os.mkdir("Data")

## Remove the numbers at the start and end of the documents.
DATAFILE = "/content/drive/MyDrive/LeePincombeWelshDocuments.txt"
CLEANFILE = "./Data/cleanLPW.txt"
INDIVIDUAL_DOCS = "./Data/"

if (os.path.exists(CLEANFILE)):
    os.remove(CLEANFILE)
i = 0
DATADICT = {}
with open(DATAFILE, 'r', encoding="utf8", errors="ignore") as inputfile:
     lines = inputfile.readlines()
     for line in lines:
        start_removed = re.sub("(\d*\.\s)", "", line)
        end_removed = re.sub("\(\d* words\)", ".", start_removed)
        normalized_doc = tokenize_and_normalize(end_removed, stop_words_english)
        DATADICT[i] = normalized_doc
        with open(INDIVIDUAL_DOCS+f"/{i}.txt", "w+") as docfile:
            docfile.write(end_removed)
            i = i + 1
DATADICT.pop(0)
DATADICT.pop(51)
print(DATADICT)



###2. Load SBERT for embedding sentences


In [58]:
from sentence_transformers import SentenceTransformer
import pickle

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = list(DATADICT.values())

embeddings = model.encode(sentences)

In [59]:
embeddings.shape
#Store sentences & embeddings on disc
with open('Sbert_embeddings.pkl', "wb") as fOut:
    pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

In [60]:
#Load sentences & embeddings from disc
with open('Sbert_embeddings.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    stored_sentences = stored_data['sentences']
    stored_embeddings = stored_data['embeddings']
print(type(stored_sentences), type(stored_embeddings))

## TO DO test embedding of 1 sentence is fix???? Seems to changes

<class 'list'> <class 'numpy.ndarray'>


# 3. Compare cosine similarity with human feedback

In [73]:
# Load csv file
import pandas as pd
from sentence_transformers import util
HUMANFBFILE = "/content/drive/MyDrive/LeePincombeWelshData.csv"

df = pd.read_csv(HUMANFBFILE)

# create an array that store cosine-similarity result from SBERT
# from doc1 - doc2 - pair from human feedback data
docs1 = df['Document1']
docs2 = df['Document2']
humanchoice = df['Similarity'] 
sbert_cosines = []

# Traverse the human reference df
for i in range(len(df)):
  d1 = stored_embeddings[docs1[i]-1]
  d2 = stored_embeddings[docs2[i]-1]
  score = util.cos_sim(d1, d2) # Runtime = 2s
  sbert_cosines.append(score)

# Append sbert_cosines to df
df['SBert'] = sbert_cosines

In [79]:
df['SBert'] = df['SBert'].astype(float)
df.to_csv('/content/humanfb_sbert.csv', index=False)
df.head(10)

Unnamed: 0,SubjectID,Document1,Document2,Similarity,Time,SBert
0,1,15,4,1,25.417,0.038636
1,1,8,7,4,9.764,0.250195
2,2,17,1,1,56.061,0.112251
3,2,19,1,1,39.767,0.019002
4,2,20,1,1,37.344,0.088691
5,2,25,1,1,14.371,0.14444
6,2,33,1,5,7.962,0.572937
7,2,40,1,1,22.262,0.078665
8,2,50,1,4,15.172,0.328901
9,2,4,2,1,37.805,0.057868


# 4. Average Similarity


In [None]:
human_evaluation_data = pd.read_csv("Data/AverageSimilarities.csv")

## TODO: refactor this and reuse for 3. and 4. 
docs1 = human_evaluation_data['Document_1']
docs2 = human_evaluation_data['Document_2']
sbert_similarities = []

# Traverse the human reference df
for i in range(len(human_evaluation_data)):
  d1 = stored_embeddings[docs1[i]-1]
  d2 = stored_embeddings[docs2[i]-1]
  score = util.cos_sim(d1, d2) # Runtime = 2s
  sbert_similarities.append(score)

# Append sbert_cosines to df
human_evaluation_data['Similarities_SBERT'] = sbert_similarities
human_evaluation_data['Similarities_SBERT'] = human_evaluation_data['Similarities_SBERT'].astype(float)
df.to_csv('/content/Data/AverageSimilarities_.csv', index=False)

: 