# semantic_similarity.ipynb
**Author:** Khoi Nguyen

**Date created:** 03/06/2023

**Last modified:** 04/15/2023

**Description:** This notebook evaluates how semantically similar two sentences are, comparing sets of sentences between datasets of the `ada`, `1k_ada`, `10k_ada`, `100k_ada`, and `Curie` models.

In [1]:
import os
import json
import csv
import tqdm
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


### ada
Comparing the results of the `ada` with that of the `Curie` model.

In [2]:
# Load data from data/ada/model_results.json and data/ada/curie_results.json
# Data is stored in a JSON in the format {"sentences" : [{prompt: PROMPT, completion : COMPLETION}, ...]}
with open('data/ada/model_results.json') as json_file:
    model_results_data = json.load(json_file)
    model_results = []
    for sentence in model_results_data['sentences']:
        model_results.append(sentence['completion'])


with open('data/ada/curie_results.json') as json_file:
    curie_results_data = json.load(json_file)
    curie_results = []
    for sentence in curie_results_data['sentences']:
        curie_results.append(sentence['completion'])

print("Model results: ", model_results)
print("Curie results: ", curie_results)

Model results:  ['Love is the fountain of all life.', 'There are many ways to prevent meningococcal disease.', 'Most land breezes occur in temperate regions.', 'The phrase "degradations can have impact" means that degradations can have an impact on the environment.', 'The Western tanagers are an insectivorous species of tanager. They catch insects in flight.', 'Many benefits come from oil.', 'The slowly damage building materials and furnishings as the mold gradually eats away at them.', 'Tusks are ivory.', 'Cyberpunks come in three different genders. They are hackers, crackers and phreakers.', 'Human nature is not. Human nature is\n\n endowed with curiosity.', '"Telecommunications systems provide the infrastructure for communication of electronic data."', 'Clean air can help people with allergies avoid an array of respiratory problems.', 'Social support can help reduce the risk of stress-related health problems, such as cancerous growths and heart disease.', 'Experience teaches traders

In [3]:
with open('results/semantic_score_ada.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['model_sentence', 'curie_sentence', 'semantic_score'])

#Compute semantic score
for model_sentence, curie_sentence in tqdm.tqdm(zip(model_results, curie_results)):
    #Compute embedding for both lists
    embedding_1= model.encode(model_sentence, convert_to_tensor=True)
    embedding_2 = model.encode(curie_sentence, convert_to_tensor=True)

    #Compute semantic score
    semantic_score = util.pytorch_cos_sim(embedding_1, embedding_2)

    # Only get the number of the semantic_score, not the tensor
    semantic_score = semantic_score.item()

    #Write data to CSV
    with open('results/semantic_score_ada.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([model_sentence, curie_sentence, semantic_score])

1000it [00:19, 50.34it/s]


### 1k_ada
Comparing the results of the `1k_ada` with that of the `Curie` model.

### 10k_ada
Comparing the results of the `10k_ada` with that of the `Curie` model.

### 100k_ada
Comparing the results of the `100k_ada` with that of the `Curie` model.