**Model 1 **: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

**Model 2:** https://huggingface.co/google/flan-t5-xxl

Metric used for Evaluation : Semantic Answer Similarity (SAS)

Semantic Answer Similarity (SAS): https://arxiv.org/abs/2108.06130

In [None]:
pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.98-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.14.0-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.2

In [None]:
from sentence_transformers import SentenceTransformer, util

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-dot-v1')

Downloading (…)58c88/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)e605a58c88/README.md:   0%|          | 0.00/8.41k [00:00<?, ?B/s]

Downloading (…)05a58c88/config.json:   0%|          | 0.00/523 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)c88/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)58c88/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)8c88/train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading (…)e605a58c88/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5a58c88/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

26.777278900146484 atoms are the size of a molecule
16.000530242919922 atomic
12.464305877685547 proportional


In [None]:
#------------Sample Implementation for Sentence Similarity -----------------

#Initialize the inputs
#query = "Atoms are the building blocks that come together through chemical bonding to form what in the universe?"
#answers = ["Molecule", "dummy"]
#
#query = "How do atoms come together?"
#answers = ["chemical bonding", "when they are bonded together"]
#
#query = "What is the model of atoms in the universe?"
#answers = ["molecule", "Carbon (black), hydrogen (white), nitrogen (blue), oxygen (red) and sulfur" , "atoms are made of protons and neutrons"]

query = "What size are the atoms in the model of a molecule?"
answers = ["atoms are the size of a molecule","atomic","proportional"]

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(answers)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(answers, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

In [None]:
# connect your personal google drive to store dataset and trained model
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [None]:
# Read the excel which contains the context, original keyword, generated question and the predicted answer
import pandas as pd
source=pd.read_excel("/content/gdrive/My Drive/CS 677 Project/dataset/predicted_answer.xlsx", sheet_name='Sheet_name_1')

In [None]:
# Spot check the data
source.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3
0,0,Atoms are the building blocks that come togeth...,form molecules,Atoms are the building blocks that come toget...,form molecules
1,1,Atoms are the building blocks that come togeth...,chemical bonding,How do atoms come together?,Chemical bonding
2,2,Atoms are the building blocks that come togeth...,building blocks,What are atoms?,Building blocks
3,3,Atoms are the building blocks that come togeth...,Atoms,What are the building blocks that come togeth...,Atoms
4,4,Atoms are the building blocks that come togeth...,Christian Guthier,Who modified the model of atoms?,Christian Guthier


In [None]:
# Function to calculate similarity score between the original keyword and the predicted answer

def get_similarity_scores(question, answers):
  #Encode query and documents
  query_emb = model.encode(question)
  answer_emb = model.encode(answers)
  #Compute dot score between query and all document embeddings
  scores = util.dot_score(query_emb, answer_emb)[0].cpu().tolist()
  return scores

In [None]:
# Function to compare two similarity scores based on a threshold
threshold = 10

def check_similarity(score1, score2):
  if abs(score1-score2) <=threshold :
    return 'Match'
  else:
    return 'No Match'

In [None]:
# Spot check the function using an example
similarity_scores = get_similarity_scores('Atoms are the building blocks that come together through chemical bonding to do what in the universe?', 
                 ['form molecules','ten form molecules'])

check_similarity(similarity_scores[0], similarity_scores[1])

In [None]:
# Collect Valid QA pairs

QA_pairs = []
for index, row in source.iterrows():
  context = row[0] 
  original_keyword = row[1]
  question = row[2]
  predicted_answer = row[3]
  answers = [original_keyword, predicted_answer]
  similarity_scores = get_similarity_scores(question, answers)
  check = check_similarity(similarity_scores[0], similarity_scores[1])
  if check == 'Match':
    valid_QA_tuple = (context, question, predicted_answer, similarity_scores[1] )
    QA_pairs.append(valid_QA_tuple)

In [None]:
# Spot check valid QA pair
QA_pairs[:2]

[('Atoms are the building blocks that come together through chemical bonding to form molecules in the universe. In this model of a molecule, the atoms of carbon (black), hydrogen (white), nitrogen (blue), oxygen (red), and sulfur (yellow) are in proportional atomic size. The silver rods indicate chemical bonds that hold the atoms together in a specific three-dimensional shape. (credit: modification of work by Christian Guthier).\nThe elements carbon, hydrogen, nitrogen, oxygen, sulfur, and phosphorus are the key building blocks found in all living things. ',
  ' Atoms are the building blocks that come together through chemical bonding to do what in the universe?',
  'form molecules',
  19.808414459228516),
 ('Atoms are the building blocks that come together through chemical bonding to form molecules in the universe. In this model of a molecule, the atoms of carbon (black), hydrogen (white), nitrogen (blue), oxygen (red), and sulfur (yellow) are in proportional atomic size. The silver r

In [None]:
# Select only distinct QA pairs
dist_QA_pairs = list(set(QA_pairs))

# Spot check distinct QA pair
dist_QA_pairs[:2]

[("For example, one gold atom has all the properties of gold, such as being a solid metal at room temperature. A gold coin is simply a very large number of gold atoms molded into the shape of a coin and contains small amounts of other elements known as impurities. We cannot break gold atoms down into anything smaller while still retaining the properties of gold.\nAn atom is composed of two regions. The center of the atom, which is called the nucleus, contains subatomic particles called protons and neutrons. The atom's outermost region holds subatomic particles known as electrons. ",
  ' What type of particles are in the center of an atom?',
  'Protons and neutrons',
  17.009929656982422),
 ("The atom will have no charge because the positive and negative charges cancel each other out.\nMost of an atom's volume, greater than 99 percent, is empty space. With all this empty space, one might ask why solid objects do not just pass through one another. The reason this does not occur is due to

In [None]:
# Write the final output to excel

final_QA_pairs = pd.DataFrame(dist_QA_pairs)
final_QA_pairs.to_excel("/content/gdrive/My Drive/CS 677 Project/dataset/final_QA_pairs.xlsx", sheet_name='Sheet_name_1') 