#Notebook Intro

Dataset used: https://huggingface.co/datasets/social_i_qa
<br>Dataset info details are written after the cell reading & pre-processing the dataset.
<br><br>
This notebook protoypes 4 **approaches**:
1. F1 Score: Works on {Context & Answer}
2. Similarity calculation using Space: Works on {Context & Answer} as well as {Question & Answer}
3. Sentence similarity using HuggingFace model: Works on {Context & Answer} as well as {Question & Answer}
4. Semantic similarity using Sentence Transformers: Works on {Context & Answer}


In [None]:
!pip install umap-learn
!pip install plotly
!pip install transformers
!pip install datasets
!pip install evaluate
! pip install nltk
! pip install -U sentence-transformers


Collecting umap-learn
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.10.tar.gz (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: umap-learn, pynndescent
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.3-py3-none-any.whl size=82807 sha256=103da2959145ed05a3e42fe11730290455b8d9283266bc3ed32e12e6a979d298
  Stored in directory: /root/.cache/pip/wheels/a0/e8/c6/a37ea663620bd5200ea1ba0907ab3c217042c1d035ef606acc
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for pyn

In [None]:
import pandas as pd
import numpy as np
import umap
from umap import UMAP
import plotly
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
from datasets import load_dataset

# https://huggingface.co/datasets/social_i_qa
dataset = load_dataset("social_i_qa")

qa_orig = pd.DataFrame(dataset['train'])

qa_orig = qa_orig.head(1500)
#qa_df = qa_df.iloc[2000:3501]

qa_orig.head()

Unnamed: 0,context,question,answerA,answerB,answerC,label
0,Cameron decided to have a barbecue and gathere...,How would Others feel as a result?,like attending,like staying home,a good friend to have,1
1,Jan needed to give out jobs for an upcoming pr...,What will Others want to do next?,disagree with Jan,get to work,argue with the assignments,2
2,Remy was an expert fisherman and was on the wa...,What will Remy want to do next?,cast the line,put the boat in the water,invite Kai out on the boat,1
3,Addison gave a hug to Skylar's son when they w...,Why did Addison do this?,better,wrong,keep hugging the son,1
4,Kai found one for sale online but it was too m...,What does Kai need to do before this?,cheaper,Open up her laptop,save money,2


<u>Dataset info:</u> This dataset has 3 potential answers namely, answerA, answerB, and answerC. The label column has the best-matched answer. This can help us in testing the similarity metrics that we would calculate in following cells.

In [None]:
# Considering answerA as the only prediction for now; dropping answerB & answerC

qa_df = qa_orig.copy()
qa_df = qa_df.drop(['answerB', 'answerC'], axis = 1)
qa_df.rename(columns = {'answerA':'answer'}, inplace = True)

qa_df.head()

Unnamed: 0,context,question,answer,label
0,Cameron decided to have a barbecue and gathere...,How would Others feel as a result?,like attending,1
1,Jan needed to give out jobs for an upcoming pr...,What will Others want to do next?,disagree with Jan,2
2,Remy was an expert fisherman and was on the wa...,What will Remy want to do next?,cast the line,1
3,Addison gave a hug to Skylar's son when they w...,Why did Addison do this?,better,1
4,Kai found one for sale online but it was too m...,What does Kai need to do before this?,cheaper,2


In [None]:
qa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   context   1500 non-null   object
 1   question  1500 non-null   object
 2   answer    1500 non-null   object
 3   label     1500 non-null   object
dtypes: object(4)
memory usage: 47.0+ KB


In [None]:
qa_df['label'] = pd.Categorical(qa_df.label)
qa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   context   1500 non-null   object  
 1   question  1500 non-null   object  
 2   answer    1500 non-null   object  
 3   label     1500 non-null   category
dtypes: category(1), object(3)
memory usage: 36.9+ KB


# Approach 1: F1 Score

In [None]:
# Helper functions

def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()

    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)

    common_tokens = set(pred_tokens) & set(truth_tokens)

    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0

    prec = len(common_tokens) / len(pred_tokens) # precision
    rec = len(common_tokens) / len(truth_tokens) # recall

    return 2 * (prec * rec) / (prec + rec)

In [None]:
%%time

qa_f1 = qa_df.copy()
qa_f1['F1_Score'] = qa_f1.apply(lambda x: compute_f1(x['answer'], x['context']), axis=1)
qa_f1.head()

CPU times: user 111 ms, sys: 0 ns, total: 111 ms
Wall time: 116 ms


Unnamed: 0,context,question,answer,label,F1_Score
0,Cameron decided to have a barbecue and gathere...,How would Others feel as a result?,like attending,1,0.0
1,Jan needed to give out jobs for an upcoming pr...,What will Others want to do next?,disagree with Jan,2,0.142857
2,Remy was an expert fisherman and was on the wa...,What will Remy want to do next?,cast the line,1,0.0
3,Addison gave a hug to Skylar's son when they w...,Why did Addison do this?,better,1,0.0
4,Kai found one for sale online but it was too m...,What does Kai need to do before this?,cheaper,2,0.0


### Testing

Filter rows where label = 1, i.e. dataset says answerA is the correct answer.

In [None]:
qa_f1_label1 = qa_f1[qa_f1["label"].isin(["1"])]
#qa_f1_label1

In [None]:
qa_f1_label1.F1_Score.max()

0.6153846153846154

In [None]:
qa_f1_severe = qa_f1_label1[qa_f1_label1['F1_Score'] == 0.000000]
qa_f1_severe.shape

(274, 5)

In [None]:
qa_f1_bad = qa_f1_label1[qa_f1_label1['F1_Score'] <= 0.3]
qa_f1_bad.shape

(471, 5)

In [None]:
qa_f1_good = qa_f1_label1[qa_f1_label1['F1_Score'] >= 0.6]
qa_f1_good.shape

(1, 5)

### Plotting

In [None]:
def plot_func(df_clus, scorename):
  std_scaler = StandardScaler()
  cluster = std_scaler.fit_transform(df_clus.to_numpy())
  km = KMeans(random_state = 42, n_init = 10, max_iter=100)
  km.fit(cluster)
  df_clus['label'] = km.labels_
  df_clus = df_clus.round(decimals = 5)
  fig = px.scatter_3d(df_clus, x = 'UmapComp1', y = 'UmapComp2', z = scorename, color = df_clus[scorename], height = 800, width = 1000)

  fig.update_layout(dragmode='select',
                    activeselection = dict(fillcolor='yellow'))

  fig.show()

In [None]:
%%time

vec1 = CountVectorizer(min_df = 5, stop_words = 'english')
word_doc_mat = vec1.fit_transform(qa_f1.answer)

emb_umap1 = UMAP(metric='cosine', verbose=True).fit_transform(word_doc_mat)


UMAP(angular_rp_forest=True, metric='cosine', verbose=True)
Mon Aug 28 11:03:02 2023 Construct fuzzy simplicial set
Mon Aug 28 11:03:02 2023 Finding Nearest Neighbors
Mon Aug 28 11:03:05 2023 Finished Nearest Neighbor Search
Mon Aug 28 11:03:08 2023 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

Mon Aug 28 11:03:18 2023 Finished embedding


CPU times: user 19.3 s, sys: 294 ms, total: 19.6 s
Wall time: 17.5 s


In [None]:

df_clus = pd.DataFrame(data = emb_umap1, columns = ['UmapComp1', 'UmapComp2'])
df_clus['F1_Score'] = qa_f1['F1_Score']

df_clus = df_clus.dropna()

plot_func(df_clus, 'F1_Score')

***
# Approach 2: Similarity using Spacy

##Part A: Using context & answer

In [None]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

def return_spacysimilarity(text1, text2):
  doc1 = nlp(text1)
  doc2 = nlp(text2)
  return doc1.similarity(doc2)


In [None]:
%%time

qa_cos = qa_df.copy()
qa_cos['Spacy_Similarity'] = qa_cos.apply(lambda x: return_spacysimilarity(x['context'], x['answer']), axis=1)
qa_cos.head()


[W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.



CPU times: user 21.9 s, sys: 52 ms, total: 21.9 s
Wall time: 23.8 s


Unnamed: 0,context,question,answer,label,Spacy_Similarity
0,Cameron decided to have a barbecue and gathere...,How would Others feel as a result?,like attending,1,0.259624
1,Jan needed to give out jobs for an upcoming pr...,What will Others want to do next?,disagree with Jan,2,0.510784
2,Remy was an expert fisherman and was on the wa...,What will Remy want to do next?,cast the line,1,0.418299
3,Addison gave a hug to Skylar's son when they w...,Why did Addison do this?,better,1,-0.034321
4,Kai found one for sale online but it was too m...,What does Kai need to do before this?,cheaper,2,0.144344


###Testing

In [None]:
qa_simi_label1 = qa_cos[qa_cos["label"].isin(["1"])]

In [None]:
qa_simi_label1.Spacy_Similarity.max()

0.6697917848625341

In [None]:
qa_simi_label1.Spacy_Similarity.min()

-0.10251983667105387

In [None]:
qa_simi_severe = qa_simi_label1[qa_simi_label1['Spacy_Similarity'] < 0.000000]
qa_simi_severe.shape

(10, 5)

In [None]:
qa_simi_bad = qa_simi_label1[qa_simi_label1['Spacy_Similarity'] <= 0.3]
qa_simi_bad.shape

(229, 5)

In [None]:
qa_simi_good = qa_simi_label1[qa_simi_label1['Spacy_Similarity'] >= 0.6]
qa_simi_good.shape

(13, 5)

### Plotting wll same as F1: Convert to UMAP-> combine with similarity score-> plot

##Part B: Using question & answer




In [None]:
%%time

qa_cos1 = qa_df.copy()
qa_cos1['Spacy_Similarity'] = qa_cos1.apply(lambda x: return_spacysimilarity(x['question'], x['answer']), axis=1)
qa_cos1.head()


[W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.



CPU times: user 23.5 s, sys: 62.3 ms, total: 23.6 s
Wall time: 27.7 s


Unnamed: 0,context,question,answer,label,Spacy_Similarity
0,Cameron decided to have a barbecue and gathere...,How would Others feel as a result?,like attending,1,0.127962
1,Jan needed to give out jobs for an upcoming pr...,What will Others want to do next?,disagree with Jan,2,0.2102
2,Remy was an expert fisherman and was on the wa...,What will Remy want to do next?,cast the line,1,0.173748
3,Addison gave a hug to Skylar's son when they w...,Why did Addison do this?,better,1,0.155724
4,Kai found one for sale online but it was too m...,What does Kai need to do before this?,cheaper,2,0.146772


### Testing

In [None]:
qa_cos1_label1 = qa_cos1[qa_cos1["label"].isin(["1"])]

In [None]:
qa_cos1_label1.Spacy_Similarity.max()

0.6552880147520592

In [None]:
qa_cos1_label1.Spacy_Similarity.min()

-0.01571721759074378

In [None]:
qa_cos1_severe = qa_cos1_label1[qa_cos1_label1['Spacy_Similarity'] < 0.000000]
qa_cos1_severe.shape

(4, 5)

In [None]:
qa_cos1_bad = qa_cos1_label1[qa_cos1_label1['Spacy_Similarity'] <= 0.3]
qa_cos1_bad.shape

(330, 5)

In [None]:
qa_cos1_good = qa_cos1_label1[qa_cos1_label1['Spacy_Similarity'] >= 0.6]
qa_cos1_good.shape

(2, 5)

### Plotting wll same as F1: Convert to UMAP-> combine with similarity score-> plot

***
# Approach 3: Sentence similarity between Question & Answer using Huggingface model

Suitable for cases when context is not available

In [None]:
# Preprocessing specifically for this model

qa_hf = qa_df.copy()

# Each input text should start with "query: " or "passage: "
# So question needs to be appended with "query: "
qa_hf['question'] = "query: " + qa_hf['question']#.astype(str)

# predictions i.e. text part of answer needs to be appended with "passage: "
qa_hf['answer'] = "passage: " + qa_hf['answer']#.astype(str)
qa_hf.head()

Unnamed: 0,context,question,answer,label
0,Cameron decided to have a barbecue and gathere...,query: How would Others feel as a result?,passage: like attending,1
1,Jan needed to give out jobs for an upcoming pr...,query: What will Others want to do next?,passage: disagree with Jan,2
2,Remy was an expert fisherman and was on the wa...,query: What will Remy want to do next?,passage: cast the line,1
3,Addison gave a hug to Skylar's son when they w...,query: Why did Addison do this?,passage: better,1
4,Kai found one for sale online but it was too m...,query: What does Kai need to do before this?,passage: cheaper,2


In [None]:
# Download model & initialise tokenizer
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small-v2')
model = AutoModel.from_pretrained('intfloat/e5-small-v2')

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

In [None]:
%%time

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def return_similarity(ques_arr, ans_arr):
  # Note: This array with repeated strings had to be made since the model was not taking single pair of query and passage
  # Each row would be sent in this form to the model
  #e.g. input_arr = ['query: When did Beyonce start becoming popular?', 'passage: in the late 1990s', 'query: When did Beyonce start becoming popular?', 'passage: in the late 1990s']
  input_arr = [ques_arr , ans_arr, ques_arr , ans_arr]

  # Tokenize the input texts
  batch_dict = tokenizer(input_arr, max_length=512, padding=True, truncation=True, return_tensors='pt')

  outputs = model(**batch_dict)
  embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

  # normalize embeddings
  embeddings = F.normalize(embeddings, p=2, dim=1)
  # the scores can be converted to a scale of 0 to 100 by multiplying with 100
  scores = (embeddings[:2] @ embeddings[2:].T) #* 100
  sentence_sim = 0.0 # default

  # Since we are getting scores for 2 rows, we need to access particular index of scores
  if scores != None:
    sentence_sim = scores.tolist()[1][0]
  return sentence_sim

qa_hf['Sentence_Similarity'] = qa_hf.apply(lambda x: return_similarity(x['question'], x['answer']), axis=1)

qa_hf.head()


CPU times: user 2min 8s, sys: 708 ms, total: 2min 9s
Wall time: 2min 57s


Unnamed: 0,context,question,answer,label,Sentence_Similarity
0,Cameron decided to have a barbecue and gathere...,query: How would Others feel as a result?,passage: like attending,1,0.788937
1,Jan needed to give out jobs for an upcoming pr...,query: What will Others want to do next?,passage: disagree with Jan,2,0.732481
2,Remy was an expert fisherman and was on the wa...,query: What will Remy want to do next?,passage: cast the line,1,0.748365
3,Addison gave a hug to Skylar's son when they w...,query: Why did Addison do this?,passage: better,1,0.734098
4,Kai found one for sale online but it was too m...,query: What does Kai need to do before this?,passage: cheaper,2,0.745258


###Testing

In [None]:
qa_hf_label1 = qa_hf[qa_hf["label"].isin(["1"])]

In [None]:
qa_hf_label1.Sentence_Similarity.max()

0.8306239247322083

In [None]:
qa_hf_label1.Sentence_Similarity.min()

0.6715688705444336

In [None]:
qa_hf_bad = qa_hf_label1[qa_hf_label1['Sentence_Similarity'] <= 0.7]
qa_hf_bad.shape

(14, 5)

In [None]:
qa_hf_good = qa_hf_label1[qa_hf_label1['Sentence_Similarity'] >= 0.8]
qa_hf_good.shape

(20, 5)

### Plotting wll same as F1: Convert to UMAP-> combine with similarity score-> plot

***
# Approach 4: Semantic Answer Similarity approach
# Use sentence transformers to generate embeddings, then compare Cosine similarity

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

Downloading (…)925a9/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)1a515925a9/README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading (…)515925a9/config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)925a9/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)1a515925a9/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)15925a9/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
%%time

def get_embeds_similarity(context, pred):
  sentences = [context, pred]
  embeds = model.encode(sentences)
  return util.pytorch_cos_sim(embeds[0], embeds[1])


qa_sen = qa_df.copy()
qa_sen['Similarity'] = qa_sen.apply(lambda x: get_embeds_similarity(x['context'], x['answer']), axis=1)
qa_sen.head()

CPU times: user 2min 45s, sys: 658 ms, total: 2min 45s
Wall time: 3min 35s


Unnamed: 0,context,question,answer,label,Similarity
0,Cameron decided to have a barbecue and gathere...,How would Others feel as a result?,like attending,1,[[tensor(0.2013)]]
1,Jan needed to give out jobs for an upcoming pr...,What will Others want to do next?,disagree with Jan,2,[[tensor(0.2243)]]
2,Remy was an expert fisherman and was on the wa...,What will Remy want to do next?,cast the line,1,[[tensor(0.2315)]]
3,Addison gave a hug to Skylar's son when they w...,Why did Addison do this?,better,1,[[tensor(0.0737)]]
4,Kai found one for sale online but it was too m...,What does Kai need to do before this?,cheaper,2,[[tensor(0.2062)]]


### Testing

In [None]:
qa_cos1_label1 = qa_sen[qa_sen["label"].isin(["1"])]

In [None]:
qa_cos1_label1.Similarity.max()

tensor([[0.8623]])

In [None]:
qa_cos1_label1.Similarity.min()

tensor([[-0.1782]])

In [None]:
qa_cos1_severe = qa_cos1_label1[qa_cos1_label1['Similarity'] < 0.000000]
qa_cos1_severe.shape

(21, 5)

In [None]:
qa_cos1_bad = qa_cos1_label1[qa_cos1_label1['Similarity'] <= 0.3]
qa_cos1_bad.shape

(221, 5)

In [None]:
qa_cos1_good = qa_cos1_label1[qa_cos1_label1['Similarity'] >= 0.6]
qa_cos1_good.shape

(56, 5)

***
Some interesting links:
* https://stackoverflow.com/questions/65199011/is-there-a-way-to-check-similarity-between-two-full-sentences-in-python
* https://memgraph.com/blog/cosine-similarity-python-scikit-learn
* https://stackoverflow.com/questions/37454785/how-to-handle-negative-values-of-cosine-similarities