# semantic_similarity.ipynb
**Author:** Khoi Nguyen

**Date created:** 03/06/2023

**Last modified:** 05/12/2023

**Description:** This notebook evaluates how semantically similar two sentences are, comparing sets of sentences between datasets of the `ada`, `1k_ada`, `10k_ada`, `100k_ada`, and `Curie` models.

In [9]:
import os
import json
import csv
import tqdm
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

### ada
Comparing the results of the `ada` with that of the `Curie` model.

In [10]:
# Load data from data/ada/model_results.json and data/ada/curie_results.json
# Data is stored in a JSON in the format {"sentences" : [{prompt: PROMPT, completion : COMPLETION}, ...]}
with open('data/ada/model_results.json') as json_file:
    model_results_data = json.load(json_file)
    model_results = []
    for sentence in model_results_data['sentences']:
        model_results.append(sentence['completion'])


with open('data/ada/curie_results.json') as json_file:
    curie_results_data = json.load(json_file)
    curie_results = []
    for sentence in curie_results_data['sentences']:
        curie_results.append(sentence['completion'])

print("Model results: ", model_results)
print("Curie results: ", curie_results)



In [11]:
with open('results/semantic_score_ada.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['model_sentence', 'curie_sentence', 'semantic_score'])

#Compute semantic score
for model_sentence, curie_sentence in tqdm.tqdm(zip(model_results, curie_results)):
    #Compute embedding for both lists
    embedding_1= model.encode(model_sentence, convert_to_tensor=True)
    embedding_2 = model.encode(curie_sentence, convert_to_tensor=True)

    #Compute semantic score
    semantic_score = util.pytorch_cos_sim(embedding_1, embedding_2)

    # Only get the number of the semantic_score, not the tensor
    semantic_score = semantic_score.item()

    #Write data to CSV
    with open('results/semantic_score_ada.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([model_sentence, curie_sentence, semantic_score])

1000it [00:25, 38.90it/s]


### 1k_ada
Comparing the results of the `1k_ada` with that of the `Curie` model.

In [12]:
# Load data from data/ada/model_results.json and data/ada/curie_results.json
# Data is stored in a JSON in the format {"sentences" : [{prompt: PROMPT, completion : COMPLETION}, ...]}
with open('data/1k_ada/model_results.json') as json_file:
    model_results_data = json.load(json_file)
    model_results = []
    for sentence in model_results_data['sentences']:
        model_results.append(sentence['completion'])


with open('data/1k_ada/curie_results.json') as json_file:
    curie_results_data = json.load(json_file)
    curie_results = []
    for sentence in curie_results_data['sentences']:
        curie_results.append(sentence['completion'])

print("Model results: ", model_results[:10])
print("Curie results: ", curie_results[:10])

Model results:  ['The sentence is saying that artificial teeth are the single most expensive component of a denture.', 'The sentence states that monocots have one cotyledon and dicots have two cotyledons. This is an example of how monocots and dicots differ in their growth and reproduction.', 'Utilities use consumption to acquire and consume materials.', 'The sentence is saying that architecture is an important part of living in the world. Architects are very important because they help to make the world a better place.', 'An elaborate sentence like this is saying that recipes include ingredients that are new to the Everyday Dieter. This could mean that the Everyday Dieter is making their own recipes, or it could mean that the Everyday Dieter is cutting out some ingredients from their regular diets.', 'The sentence means that the meaning of a sentence can involve a relationship with the world.', 'The sentence begins with a definition of eggs. Then it discusses how queens begin laying t

In [13]:
with open('results/semantic_score_1k_ada.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['model_sentence', 'curie_sentence', 'semantic_score'])

#Compute semantic score
for model_sentence, curie_sentence in tqdm.tqdm(zip(model_results, curie_results)):
    #Compute embedding for both lists
    embedding_1= model.encode(model_sentence, convert_to_tensor=True)
    embedding_2 = model.encode(curie_sentence, convert_to_tensor=True)

    #Compute semantic score
    semantic_score = util.pytorch_cos_sim(embedding_1, embedding_2)

    # Only get the number of the semantic_score, not the tensor
    semantic_score = semantic_score.item()

    #Write data to CSV
    with open('results/semantic_score_1k_ada.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([model_sentence, curie_sentence, semantic_score])

1000it [00:24, 41.29it/s]


### 10k_ada
Comparing the results of the `10k_ada` with that of the `Curie` model.

In [14]:
# Load data from data/ada/model_results.json and data/ada/curie_results.json
# Data is stored in a JSON in the format {"sentences" : [{prompt: PROMPT, completion : COMPLETION}, ...]}
with open('data/10k_ada/model_results.json') as json_file:
    model_results_data = json.load(json_file)
    model_results = []
    for sentence in model_results_data['sentences']:
        model_results.append(sentence['completion'])


with open('data/10k_ada/curie_results.json') as json_file:
    curie_results_data = json.load(json_file)
    curie_results = []
    for sentence in curie_results_data['sentences']:
        curie_results.append(sentence['completion'])

print("Model results: ", model_results[:10])
print("Curie results: ", curie_results[:10])

Model results:  ['Saddle soap is used for cleaning, conditioning, and softening leather, and is also used in leather finishing and textiles. It is a solution of soap, water, and a small amount of salt that is mixed with a liquid to create a mild-tasting solution. This solution is wound into a cloth, then brushed or soaked in water to clean and soften the leather. The soap helps to soften the leather and make it more comfortable.', 'The Paramecia regulate water by way of contractile vacuoles, which are vacuoles within the body cavity of the paramecia, to regulate the forces that govern the body and its environment. These contractile vacuoles act to push and pull fluids within the body, allowing the body to move while maintaining a steady state.', 'Fireworks are particularly dangerous when they contain gunpowder. Gunpowder is toxic and can cause serious burns, as well as other effects, such as death and serious injury. Gunpowder can also cause an explosion and damage to buildings, cars, 

In [15]:
with open('results/semantic_score_10k_ada.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['model_sentence', 'curie_sentence', 'semantic_score'])

#Compute semantic score
for model_sentence, curie_sentence in tqdm.tqdm(zip(model_results, curie_results)):
    #Compute embedding for both lists
    embedding_1= model.encode(model_sentence, convert_to_tensor=True)
    embedding_2 = model.encode(curie_sentence, convert_to_tensor=True)

    #Compute semantic score
    semantic_score = util.pytorch_cos_sim(embedding_1, embedding_2)

    # Only get the number of the semantic_score, not the tensor
    semantic_score = semantic_score.item()

    #Write data to CSV
    with open('results/semantic_score_10k_ada.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([model_sentence, curie_sentence, semantic_score])

1000it [00:25, 38.57it/s]


### 100k_ada
Comparing the results of the `100k_ada` with that of the `Curie` model.

In [16]:
# Load data from data/ada/model_results.json and data/ada/curie_results.json
# Data is stored in a JSON in the format {"sentences" : [{prompt: PROMPT, completion : COMPLETION}, ...]}
with open('data/100k_ada/model_results.json') as json_file:
    model_results_data = json.load(json_file)
    model_results = []
    for sentence in model_results_data['sentences']:
        model_results.append(sentence['completion'])


with open('data/100k_ada/curie_results.json') as json_file:
    curie_results_data = json.load(json_file)
    curie_results = []
    for sentence in curie_results_data['sentences']:
        curie_results.append(sentence['completion'])

print("Model results: ", model_results[:10])
print("Curie results: ", curie_results[:10])

FileNotFoundError: [Errno 2] No such file or directory: 'data/100k_ada/model_results.json'

In [None]:
with open('results/semantic_score_100k_ada.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['model_sentence', 'curie_sentence', 'semantic_score'])

#Compute semantic score
for model_sentence, curie_sentence in tqdm.tqdm(zip(model_results, curie_results)):
    #Compute embedding for both lists
    embedding_1= model.encode(model_sentence, convert_to_tensor=True)
    embedding_2 = model.encode(curie_sentence, convert_to_tensor=True)

    #Compute semantic score
    semantic_score = util.pytorch_cos_sim(embedding_1, embedding_2)

    # Only get the number of the semantic_score, not the tensor
    semantic_score = semantic_score.item()

    #Write data to CSV
    with open('results/semantic_score_100k_ada.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([model_sentence, curie_sentence, semantic_score])