# Measure sentence similarity in a given dataset using Google's and Facebook's encoders 

It is a heavily modified version of nlp-town's <a href="https://github.com/nlptown/nlp-notebooks/blob/master/Simple%20Sentence%20Similarity.ipynb"> notebook </a>

## Data

### Setting the input filenames. Feel free to leave them as they are
- `INPUT_FILE_NAME`  is a file with students responses to be graded needs to be a .csv file with columns: IDStudent, IDStud, IDClass, Category, Field, Field_en, Accuracy_score, Code, Fieldname
- `MODEL_FILE` is a file with model responses. It needs to have the following columns: 'TextName', 'Field1_en', 'Field2_en','Field3_en', 'Field4_en'. The TextName - first columns - needs to contian the name of the task corresponding to "Category" Column in the Input file. 

To be sure, look at the example datafiles provided and replace the content of the columns with the data from your experiment. Remember, that the responses of the students should not be empty. All texts should be in english.


In [None]:
INPUT_FILE_NAME = 'Example_dataset_marble_v2 - 2_data_no_omission.csv'
MODEL_FILE = 'correct_answers.csv'

### Setting the output filenames (if you change them, you will need to change them in B_ script as well, if you want to run postprocessing.)
- `ALL_METHODS_RESULTS_FILE` will contain matched sentences from the model with their similarity scores
- `SMOOTH_INVERSE_RESULTS_FILE` - part of the above file containing only SIF method matching and scores
- `AVG_WORD2VEC_RESULTS_FILE` - same as above, only using the most primitive avarage of the embeddings

In [None]:
ALL_METHODS_RESULTS_FILE = "GSE_INF_complete_result_matched.csv"
GSE_RESULTS_FILE = "GSE_wv2_matched.csv"
infersent_RESULTS_FILE = "INF_matched.csv"

### Required libraries
Libraries ```seaborn```, ```tensorflow``` and ```tensorflow_hub``` are not included in the basic setup of the environment (requirements.txt) because of their size. At first you may need to install them. If you encounter "module not found" error, execute the lines below in terminal 

(activate the virtual environment first with `source env/bin/activate`)
``
pip intall seaborn
pip install tensorflow
pip install tensorflow_hub
``

In [21]:
import pandas as pd
import numpy as np
import scipy
import math
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns

In [24]:
import nltk

STOP = set(nltk.corpus.stopwords.words("english"))

class Sentence:
    
    def __init__(self, sentence):
        self.raw = sentence
        normalized_sentence = sentence.replace("‘", "'").replace("’", "'")
        self.tokens = [t.lower() for t in nltk.word_tokenize(normalized_sentence)]
        self.tokens_without_stop = [t for t in self.tokens if t not in STOP]

### InferSent

[InferSent](https://github.com/facebookresearch/InferSent) is a pre-trained encoder that produces sentence embeddings. 
More particularly, it is a BiLSTM with max pooling that was trained on the SNLI dataset, 570k English sentence pairs labelled with one of three categories: entailment, contradiction or neutral. InferSent was developed and trained by Facebook Research.

Let's first download the resources we need.

In [None]:
  
# !wget -nc https://raw.githubusercontent.com/facebookresearch/InferSent/master/models.py
# !wget -nc https://s3.amazonaws.com/senteval/infersent/infersent.allnli.pickle

At the first time you need to download the infersent1.pkl model. Uncomment and run the cell below

In [6]:
# !mkdir encoder
# !curl -Lo encoder/infersent1.pkl https://s3.amazonaws.com/senteval/infersent/infersent1.pkl

mkdir: encoder: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146M  100  146M    0     0  4366k      0  0:00:34  0:00:34 --:--:-- 6723k 0  3924k      0  0:00:38  0:00:28  0:00:10 5681k


Infersent uses glove model. You can download it by uncommenting and running the cell below

In [None]:
# !curl -Lo models/ https://nlp.stanford.edu/data/glove.840B.300d.zip
# !cd models && unzip glove.840B.300d.zip

In [8]:

PATH_TO_GLOVE = os.path.expanduser("models/glove.840B.300d.txt")


Then we load the model.

In [9]:

import torch
from models import InferSent
V = 1
MODEL_PATH = 'models/infersent1.pkl'
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))
infersent.set_w2v_path(PATH_TO_GLOVE)

In [10]:
# import torch
# # infersent = torch.load('infersent.allnli.pickle')

# infersent = torch.load('infersent.allnli.pickle', map_location=lambda storage, loc: storage)
# infersent.use_cuda = False
# infersent.set_glove_path(PATH_TO_GLOVE)

Finally, we can run the benchmark by having InferSent encode the two sets of sentences and compute the cosine similarity between the corresponding sentences.

In [28]:
from sklearn.metrics.pairwise import cosine_similarity

def run_inf_benchmark(sentences1, sentences2):
    
    raw_sentences1 = [sent1.raw for sent1 in sentences1]
    raw_sentences2 = [sent2.raw for sent2 in sentences2]
    
    infersent.build_vocab(raw_sentences1 + raw_sentences2, tokenize=True)
    embeddings1 = infersent.encode(raw_sentences1, tokenize=True)
    embeddings2 = infersent.encode(raw_sentences2, tokenize=True)
    
    inf_sims = []
    for (emb1, emb2) in zip(embeddings1, embeddings2): 
        sim = cosine_similarity(emb1.reshape(1, -1), emb2.reshape(1, -1))[0][0]
        inf_sims.append(sim)

    return inf_sims   

### Google Sentence Encoder

The [Google Sentence Encoder](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1) is Google's answer to Facebook's InferSent. It comes in two forms: 

- a Transformer model that takes the element-wise sum of the context-aware word representations produced by the encoding subgraph of a Transformer model.
- a Deep Averaging Network (DAN) where input embeddings for words and bigrams are averaged together and passed through a feed-forward deep neural network.

The Transformer model tends to give better results, but at the time of writing, only the DAN-based encoder was available.

In contrast to InferSent, the Google Sentence Encoder was trained on a combination of unsupervised data (in a skip-thought-like task) and supervised data (the SNLI corpus).

The Google Sentence Encoder can be loaded from the Tensorflow Hub.

In [15]:
import tensorflow_hub as hub

#in case ssl errors appear
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

tf.logging.set_verbosity(tf.logging.ERROR)
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/1")

Like InferSent above, we'll have the it encode the two sets of sentences and return the similarities between the embeddings it produced.

In [16]:
def run_gse_benchmark(sentences1, sentences2):
    sts_input1 = tf.placeholder(tf.string, shape=(None))
    sts_input2 = tf.placeholder(tf.string, shape=(None))

    sts_encode1 = tf.nn.l2_normalize(embed(sts_input1))
    sts_encode2 = tf.nn.l2_normalize(embed(sts_input2))
        
    sim_scores = tf.reduce_sum(tf.multiply(sts_encode1, sts_encode2), axis=1)
    
    with tf.Session() as session:
        session.run(tf.global_variables_initializer())
        session.run(tf.tables_initializer())
      
        [gse_sims] = session.run(
            [sim_scores],
            feed_dict={
                sts_input1: [sent1.raw for sent1 in sentences1],
                sts_input2: [sent2.raw for sent2 in sentences2]
            })
    return gse_sims


## Experiments

Finally, it's time to run the actual experiments. 

In [None]:
import functools as ft

benchmarks = [
              ("GSE", run_gse_benchmark),
              ("INF", run_inf_benchmark)
             ]


# Application to our dataset
First we open our input file, for example:

`INPUT_FILE_NAME = 'Example_dataset_marble_v2 - 2_data_no_omission.csv'` 
- you can set the input file name at the top

In [None]:
data = pd.read_csv(INPUT_FILE_NAME)
data.head()

## Match the sentences
Declaring the main experiment function. The function below will run similarity measures for each responso of the student and check which sentence from the model is the most similar to students response. Thank it will match the most similar one and save the similarity score. This procedure takes place for all the similarity measures in `benchmarks`

In [None]:
def run_all_match(df, model, benchmarks): 
    """The function will run each of similarity measures in benchmarks 
    for each respons of the student and check which 
    sentence from the model is the most similar to student's 
    response. Than it will match the most similar model sentence 
    and save the similarity score. """
    size = len(model.index)
    text_frame = df.copy()
    sims = {"stud_sentence":[],
            "stud_field":[],}
    for label, method in benchmarks:
        sims[label+"_all_scores"] = []
        sims[label+"_similarity"] = []
        sims[label+"_aimed_sentence"] = []
        sims[label+"_aimed_field"] = []
        
    for index, row in text_frame.iterrows():
        stud_sentence = row["Field_en"]
        sims["stud_sentence"].append(stud_sentence)
        sims["stud_field"].append(row["Fieldname"])
        student_sentences = [Sentence(stud_sentence)]*size
        model_sentences = model[row['Category']].apply(lambda s: Sentence(s))
    #   pearson_cors, spearman_cors = [], []
        for label, method in benchmarks:
            similarity_scores = method(student_sentences, model_sentences)
            similarity = max(similarity_scores)
            index = np.argmax(similarity_scores)
            aimed_sentence = model_sentences.iloc[index]
            aimed_field = model_sentences.index[index]
            sims[label+"_all_scores"].append(similarity_scores)
            sims[label+"_similarity"].append(similarity)
            sims[label+"_aimed_sentence"].append(aimed_sentence.raw)
            sims[label+"_aimed_field"].append(aimed_field)
    frame = pd.DataFrame(sims)
    return frame

Opening the model file and transposing it, so that text names become columns, and field_numbers become index rows.

In [None]:
model_frame = pd.read_csv(MODEL_FILE, index_col=0)
exp_frame2 = data.copy()
model_frame = model_frame[[ 'Field1_en', 'Field2_en','Field3_en', 'Field4_en']]
model_frame = model_frame.transpose()
model_frame.head()

In [None]:
# Run the experiment on the copy of the original
frame_sim2 = run_all_match(exp_frame2, model_frame, benchmarks)
frame_sim2

Append exta columns to the experiments result: 'IDStud', 'IDClass', "Category","Accuracy_score", "Code","Fieldname"

In [None]:
frame_sim2 = pd.concat([frame_sim2, 
                        exp_frame2[['IDStud', 'IDClass', 
                                    "Category","Accuracy_score",
                                    "Code","Fieldname"]]], axis=1)

Save the all the columns to `ALL_METHODS_RESULTS_FILE` and specific methods columns to other -`RESULTS_FILE`s

In [None]:
frame_sim2.to_csv(ALL_METHODS_RESULTS_FILE)
sif_matched = frame_sim2[["INF_aimed_sentence", 
            "INF_aimed_field", 
            'INF_similarity',
            'stud_field',
            'stud_sentence',
            'IDStud', 'IDClass', 
            "Category","Accuracy_score",
            "Code","Fieldname"]].copy()
sif_matched.to_csv(SMOOTH_INVERSE_RESULTS_FILE)
avg_w2v_matched = frame_sim2[["GSE_aimed_sentence", 
            "GSE_aimed_field", 
            'GSE_similarity',
            'stud_field',
            'stud_sentence',
            'IDStud', 'IDClass', 
            "Category","Accuracy_score",
            "Code","Fieldname"]].copy()
avg_w2v_matched.to_csv("avg_wv2_matched.csv")