Algorithm to transform document to embedding is SBERT
Approach: SBERT transforms sentences to fixed size embedding
Data = L90 

Source: 

https://medium.com/analytics-vidhya/best-nlp-algorithms-to-get-document-similarity-a5559244b23b


https://www.sbert.net/examples/applications/computing-embeddings/README.html



In [1]:
! pip install transformers sentence_transformers



### 0. Text preprocessing (Maybe not needed): 

* Normalization: transforming the text into lower case and removing all the special characters and punctuations.

* Tokenization: getting the normalized text and splitting it into a list of tokens.

* Removing stop words: stop words are the words that are most commonly used in a language and do not add much meaning to the text. Some examples are the words ‘the’, ‘a’, ‘will’,…

* Stemming: it is the process to get the root of the words and sometimes this root is not equal to the morphological root of the word, but the stemming goal is to make that related word maps to the same stem. Examples: branched and branching become branch.

* Lemmatization: This is the process of getting the same word for a group of inflected word forms, the simplest way to do this is with a dictionary. Examples: is, was, were become be.

In [3]:
!pip install stanza
import stanza
print("Downloading English model...")
stanza.download('en')

import glob
import nltk
import ssl
print("Downloading stop words...")
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("stopwords")
from nltk.corpus import stopwords

# getting stopwords
stop_words_english = stopwords.words('english')

def tokenize_and_normalize(doc_raw, stopwords):
    """Tokenizes, lemmatizes, lowercases and removes stop words.
    
    this function takes in a path to a doc, reads the doc file,
    tokenizes it into words, then lemmatizes and lowercases these words.
    finally, stopwords given to the function are removed from the list of doc lemmas
    
    Parameters
    ----------
    doc_raw : str
    stopwords : list of strings
        stopwords that should be removed
    
    Returns
    -------
    normalized_song : list of strings
        a doc represented as a list of its lemmas
    """
    nlp = stanza.Pipeline(lang='en', processors='tokenize, lemma',  verbose=False)
    
    # YOUR CODE HERE
    
    doc=nlp(doc_raw)
    words = doc.iter_words()
    normalized_doc = []
    for w in words:
        w = w.lemma.lower()
        if not w in stopwords:
            normalized_doc.append(w)
    normalized_doc = ' '.join(normalized_doc)
    return normalized_doc

pyenv: pip: command not found

The `pip' command exists in these Python versions:
  3.7.12

Note: See 'pyenv help global' for tips on allowing both
      python2 and python3 to be found.


  from .autonotebook import tqdm as notebook_tqdm


Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json: 142kB [00:00, 31.6MB/s]                    
2022-04-22 11:05:20 INFO: Downloading default packages for language: en (English)...
2022-04-22 11:05:22 INFO: File exists: /Users/esapalosaari/stanza_resources/en/default.zip.
2022-04-22 11:05:24 INFO: Finished downloading models and saved to /Users/esapalosaari/stanza_resources.


Downloading stop words...


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/esapalosaari/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


###1. Load dataset

In [5]:
import os
import re
## Load dataset into a list of sentences


## Remove the numbers at the start and end of the documents.
DATAFILE = "./Data/LeePincombeWelshDocuments.txt"
CLEANFILE = "./Data/cleanLPW.txt"
INDIVIDUAL_DOCS = "./Data/Docs"

if (os.path.exists(CLEANFILE)):
    os.remove(CLEANFILE)
i = 0
DATADICT = {}
with open(DATAFILE, 'r', encoding="utf8", errors="ignore") as inputfile:
     lines = inputfile.readlines()
     for line in lines[1:-1]:
        start_removed = re.sub("(\d*\.\s)", "", line)
        end_removed = re.sub("\(\d* words\)", ".", start_removed)
        normalized_doc = tokenize_and_normalize(end_removed, stop_words_english)
        DATADICT[i] = normalized_doc
        with open(INDIVIDUAL_DOCS+f"/{i}.txt", "w+") as docfile:
            docfile.write(end_removed)
            i = i + 1
print(DATADICT)



### 2. Load SBERT for embedding sentences


In [4]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Max Sequence Length:", model.max_seq_length)


Max Sequence Length: 256


In [15]:
#( Source: https://www.sbert.net/docs/usage/semantic_textual_similarity.html )
def sbert_cosine(d1, d2, max_length = 256):
  ''' Run SBERT encoder on doc 1 (d1) and doc 2 (d2) and compute cosine similarity
      using these document embeddings
      Params:   
            d1: document 1
            d2: document 2
      Return: 
            cosine_d: cosine similarity btw d1 and d2
  ''' 
  # make sure the document length stay within the max_length constraint of SBERT model
  # assert len(d1) < max_length, print(len(d1),"Length of document 1 exceeds max_length of SBERT")
  # assert len(d2) < max_length, print(len(d2),"Length of document 2 exceeds max_length of SBERT")
  
  # Two lists of sentences
  documents1 = [d1]
  documents2 = [d2]

  #Compute embedding for both lists
  embeddings1 = model.encode(documents1, convert_to_tensor=True)
  embeddings2 = model.encode(documents2, convert_to_tensor=True)


  #Compute cosine-similarits
  cosine_scores = util.cos_sim(embeddings1, embeddings2)

  #Output the pairs with their score
  # n = min(len(documents1), len(documents2))
  # for i in range(n):
  #     print("{} \n{} \nScore: {:.4f}".format(documents1[i], documents2[i], cosine_scores[i][i])) # 0.1858 (without Normalized) & 0.2 (with normalize)
  return cosine_scores[0][0]

# 3. Compare cosine similarity with human feedback

In [17]:
# Load csv file
import pandas as pd
HUMANFBFILE = "/content/drive/MyDrive/LeePincombeWelshData.csv"

df = pd.read_csv(HUMANFBFILE)

# create an array that store cosine-similarity result from SBERT
# from doc1 - doc2 - pair from human feedback data
docs1 = df['Document1']
docs2 = df['Document2']
humanchoice = df['Similarity'] 
sbert_cosines = []

# Traverse the human reference df
for i in range(len(df)):
  d1 = DATADICT[docs1[i]]
  d2 = DATADICT[docs2[i]]
  score = sbert_cosine(d1, d2) # Runtime = 36 mins
  sbert_cosines.append(score)

# Append sbert_cosines to df
df['SBert'] = sbert_cosines

Unnamed: 0,SubjectID,Document1,Document2,Similarity,Time,SBert
0,1,15,4,1,25.417,tensor(0.0386)
1,1,8,7,4,9.764,tensor(0.2502)
2,2,17,1,1,56.061,tensor(0.1123)
3,2,19,1,1,39.767,tensor(0.0190)
4,2,20,1,1,37.344,tensor(0.0887)
...,...,...,...,...,...,...
12221,83,37,46,1,2.030,tensor(0.0786)
12222,83,6,47,1,1.310,tensor(0.0333)
12223,83,11,49,1,3.460,tensor(0.1156)
12224,83,29,49,1,1.870,tensor(0.1792)


In [18]:
df['SBert'] = df['SBert'].astype(float)

Unnamed: 0,SubjectID,Document1,Document2,Similarity,Time,SBert
0,1,15,4,1,25.417,0.038636
1,1,8,7,4,9.764,0.250195
2,2,17,1,1,56.061,0.112251
3,2,19,1,1,39.767,0.019002
4,2,20,1,1,37.344,0.088691
...,...,...,...,...,...,...
12221,83,37,46,1,2.030,0.078626
12222,83,6,47,1,1.310,0.033318
12223,83,11,49,1,3.460,0.115638
12224,83,29,49,1,1.870,0.179211


In [20]:
# import os  
# os.makedirs('compare_human_with_sbert', exist_ok=True)  
# df.to_csv('compare_human_with_sbert/out.csv')  