## Embedding investigation

I'm curious about the embeddings, and how they work for ollama versus openai. The video I watched had interesting example comparing the distance of "apple" from "orange" vs "apple" from "iphone". I'd kind of like to have a go at building a list of 5 or 6 different words and calculating the distance from each of them to the others w ollama and open ai. 

I'm thinking maybe a set of faceted graphs, faceted by word1, and going through all of the word2s and doing bar graphs or somethiing, with different colours for openai vs ollama embeddings

In [9]:
from langchain_openai import OpenAIEmbeddings
from langchain.evaluation import load_evaluator
# from dotenv import load_dotenv
import openai
import os

# Load environment variables. Assumes that project contains .env file with API keys
# load_dotenv()
#---- Set OpenAI API key 
# Change environment variable name from "OPENAI_API_KEY" to the name given in 
# your .env file.
# openai.api_key = os.environ['OPENAI_API_KEY']

# setting the correct embedding function based on whether you want the openai embeddings or the ollama embeddings

def get_embedding_function(embeddings_description):
    if embeddings_description == "openai_embeddings":
        embedding_function = OpenAIEmbeddings()
    elif embeddings_description == "ollama_embeddings":
        embedding_function = OllamaEmbeddings(model="nomic-embed-text")
    else:
        print("please specify either 'openai_embeddings' or 'ollama_embeddings'")
    return embedding_function


#gets the embedding distance between two words

def compare_vectors(embeddings_description, word1 = "apple", word2='iphone', verbose = False):
    # Get embedding for a word.
    # embedding_function = OpenAIEmbeddings()
    embedding_function =get_embedding_function(embeddings_description)
    vector = embedding_function.embed_query(word1)
    # print(f"Vector for '{word1}': {vector}")
    # print(f"Vector length: {len(vector)}")

    # Compare vector of two words
    evaluator = load_evaluator("pairwise_embedding_distance")
    words = (word1, word2)
    distance = evaluator.evaluate_string_pairs(prediction=words[0], prediction_b=words[1])
    if verbose:
        print(f"Comparing ({words[0]}, {words[1]}): {distance}")
    return distance['score']
    

In [2]:
#Comparing with openai embeddings
compare_vectors('openai_embeddings','tent', 'house', verbose=True)

Vector for 'tent': [0.0024808947928249836, -0.020736441016197205, 0.013696291483938694, -0.01587233692407608, -0.001903198310174048, 0.028187546879053116, -0.01635739952325821, -0.03546348959207535, -0.015441170893609524, -0.02546580508351326, 0.012160258367657661, 0.03284953907132149, 0.0018408813048154116, -0.00624855374917388, -0.0008126488537527621, 0.013575025834143162, 0.05017706751823425, 0.014349779114127159, 0.008690711110830307, -0.03753848373889923, -0.00288848252967, -0.010300850495696068, -0.02306743711233139, 0.013878189958631992, -0.03344239667057991, -0.005355903413146734, 0.01303606666624546, -0.018513236194849014, 0.009910105727612972, -0.014228513464331627, 0.01914651319384575, 0.0039512417279183865, -0.03193331137299538, -0.025048110634088516, -0.02690752036869526, -0.01321122795343399, -0.0007398052257485688, -0.003880503587424755, 0.01478094607591629, -0.012995644472539425, 0.013386390171945095, 0.01364913210272789, -0.001031600870192051, -0.004176930990070105, -0

0.16090864573644614

In [None]:
#Comparing with ollama embeddings
# compare_vectors('ollama_embeddings','tent', 'house', verbose=True)

In [None]:
# OllamaEmbeddings(model="nomic-embed-text")

Function to make a dataframe comparing each word in a list to each other

In [10]:

import itertools
import pandas as pd

def make_distance_comparison_df(list_of_words, embeddings_description):
    combinations = list(itertools.product(list_of_words, repeat=2)) # make an iterator to combine all words with each other and themselves
    words_df = pd.DataFrame(combinations, columns=['word1', 'word2']) #turn it into a dataframe
    words_df['embedding_distance']=words_df.apply(lambda row: compare_vectors(embeddings_description, row['word1'], row['word2']), axis = 1) #apply the compare_vectors function to each row
    return words_df


Creating dataframes from lists

In [11]:
word_list1 = ['shiny','sparkly','glittery','good','great']
words_df1  = make_distance_comparison_df(word_list1, 'openai_embeddings')
word_list2 = ['apple', 'orange', 'iphone', 'call', 'amsterdam', 'netherlands', 'orangutan', 'perplexed', 'allegory']
words_df2  = make_distance_comparison_df(word_list2, 'openai_embeddings')

Making a plotting function      

In [22]:
import plotly.express as px

def plot_embedding_distances(words_df):

    fig = px.scatter(
        words_df, 
        x='word2', 
        y='embedding_distance', 
        color='word2',
        facet_col='word1', 
        facet_col_wrap=3, # 3 max words per row
        facet_row_spacing=0.18,
        title="Embedding Distance Between Words", 
        labels={'embedding_distance': 'Distance', 'word2': 'Compared Word'},
        height=800
    )

    # Remove legend
    fig.update_traces(showlegend=False)

    # Update layout to move x-axis labels to the top row
    fig.update_xaxes(matches='x', showticklabels=True)

    fig.show()



In [23]:
plot_embedding_distances(words_df1)

In [24]:
plot_embedding_distances(words_df2)


Main things from the above are that the embedding distances have less range than I expected. "Totally unrelated" doesn't necessarily have a way higher distance than "practically synonyms". Some words that I would expect to be really correlated (e.g. "orangutan" and "orange") are barely closer than "orangutan" and "allegory", in spite of orangutans being orange in colour.