# Getting hands-on experience with LLMs

It seems like it will plausibly be valuable to be able to run LLMs locally on my laptop, or being able to hook into them for parts of tasks.

Here's the rough kind of idea of what I want to learn in this notebook

- Learn how to download, install and interact with a LLM (Llama3) hosted locally on my computer, sending it text directly and asking it simple questions
- Learn how to interact with ChatGPT via an API, so I can do more automation and use it in coding projects
- Understand environments better (through the course of debugging all this stuff) -- added post hoc lol - I screwed up my base conda installation when I was trying to follow one of the videos at the start of this process
- Understand embeddings better, and build skills in visualisation to illustrate the distance between different words
- Make a basic RAG that can read a larger document and answer basic questions from the text (from a file in .md or .txt format) - following [this video](https://www.youtube.com/watch?v=tcqEUSNCn8I). Try out different embeddings (Ollama embeddings and OpenAI embeddings)
    - [This video](https://www.youtube.com/watch?v=2TJxpyO3ei4) might help for the local version 
- extend to be able to read pdfs or arbitrary filetypes, using [this video](https://www.youtube.com/watch?v=2TJxpyO3ei4) then maybe [this video](https://www.youtube.com/watch?v=svzd5d1LXGk) -- or maybe another one entirely. [This documentation](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/) might also be helpful

[this link](https://github.com/langchain-ai/langchain/issues/14872) might help if I get Chroma readonly issues again

I have already downloaded llama3 (the 4GB version - the 40GB version is way way too slow, basically doesn't run). Now I want to see if I can interact with it with the llama package

In [7]:
import ollama
# note -- extremely bizarre that "import ollama" failed
# after a successful-looking "conda install ollama" and required 
# me to "pip install ollama" in order to work??
import os # will need this later

In [8]:
# I noticed that llama tends to print really long lines so I need to scroll sideways. I'm not enjoying that, so I'm making
# a wrapped print function to fix it

import textwrap

def wprint(text, width = 120):
  wrapped_text = textwrap.fill(text, width=width)
  print(wrapped_text)
  

## Interacting with Llama3 (no embedding)
This next part is just me trying to interact with the model directly and feeding it a text file (no embedding etc), to see whether it'll provide sensible responses. 

I've downloaded a transcript of a YouTube video essay about how sound design is used in the recent Batman movie (see `data/personal/batman_sound_video_essay.txt`), and am prompting the model to answer basic questions about it

In [11]:
def llama_read_and_respond(input_file, question, print_prompt_with_data = False):
    with open(input_file,'r') as file:
        data = file.read()


    #debugging statement to confirm file was loaded
    if data:
        print("File loaded successfully")
    else:
        print("Load in a file")


    prompt_01 = f"{data} #### From this text, {question}"

    if print_prompt_with_data:
        wprint("Prompt: "+prompt_01)

    print("Generating a response: ")



    response = ollama.chat(model = 'llama3',
                            messages = [{
                            "role":"user",
                            #    "content":"tell me about a cool species of frog"
                            "content": prompt_01
                        }])

    wprint(response["message"]["content"])



In [12]:
llama_read_and_respond(input_file='data/personal/batman_sound_video_essay.txt', 
                       question = 'tell me about how the sound of rain is used in the movie')


File loaded successfully
Generating a response: 


According to the video, in one scene where gunfire breaks out, the sound of rain suddenly falls away into silence. The
narrator then provides a modified version of the scene where the sound of rain remains audible, and the gunshots
actually sound quieter. This demonstrates how the filmmakers used the sound design to create an impressionistic and
expressionist effect, rather than aiming for realism. The goal is not to accurately represent the sounds of reality but
to create a certain atmosphere or mood that enhances the storytelling.


In [13]:
llama_read_and_respond(input_file='data/personal/batman_sound_video_essay.txt', 
                       question = 'who sponsored the video?')

File loaded successfully
Generating a response: 


According to the text, Nebula, a streaming platform created by and for independent content creators like the speaker, is
sponsoring the video.


Ok, this seems to be working. Now I'd like it to try reading something from my CV, because it seemed to be struggling with that when I was running it from the terminal. I've just changed the extension from a .tex file to .txt, and I want to see if it can answer basic questions (e.g. about dates of employment). This might be harder for it to do because it's still got all of these latex formatting things in there

In [14]:
llama_read_and_respond(input_file='data/personal/Nik_Mitchell_CV_2024_07_21.txt', 
                       question = 'what is the most recent job on that list, and what did I do in that job?')

File loaded successfully
Generating a response: 


The most recent job listed is:  **NZ Royal Commission Inquiry - COVID-19 Lessons Learned**  As a **Principal Data
Analyst**, from May 2024 to July 2024, you:  * Created high-quality visualisations to support the Inquiry * Conducted
analyses and created visualisations to highlight the disparate impact of COVID-19 on Māori and Pacific ethnic groups and
people living in areas of higher socioeconomic deprivation * Worked closely with the Chair of the Commission to discuss
how to tell the story of the COVID pandemic through the above visualisations in a way that draws out lessons for future
pandemics.


This is the correct answer but it seems to really directly copy-paste exactly what I wrote in my bullet points here. Next, asking it to be more concise & summarise a bit.

In [15]:
llama_read_and_respond(input_file='data/personal/Nik_Mitchell_CV_2024_07_21.txt', 
                       question = 'what is the most recent job on that list, and what did I do in that job? Please be concise and summarise the responsibilities rather than copying the whole description')

File loaded successfully
Generating a response: 
The most recent job listed is "Principal Data Analyst" at NZ Royal Commission Inquiry - COVID-19 Lessons Learned (May
2024 - July 2024). As a Principal Data Analyst, my responsibilities included:  * Creating high-quality visualizations to
support the inquiry * Conducting analyses and creating visualizations to highlight pandemic trends and disparities in
impact on Māori and Pacific ethnic groups and people living in areas of higher socioeconomic deprivation * Working
closely with the Chair of the Commission to discuss how to tell the story of the COVID-19 pandemic through
visualizations


This has shaved off a few words without changing the meaning.

## Interacting with ChatGPT via API

This feels like it will be useful in a bunch of different projects. I've done some exploration of this (chat completion, embedding, image generation and text-to-speech) in `openai-test.ipynb`

# RAG (Retrieval-Augmented Generator)

Why would we want to create a RAG? The above seemed to work just fine.

I have a suspicion that the issue here is to do with context windows. When making a RAG, we're first going to create a database by chunking up all the inputs into manageable-sized pieces (with overlap between chunks) and then using particular embeddings to encode the meaning of the chunks as vectors. Once we have that, we can use the same embeddings on the input question, and then retrieve the top few chunks that have the most similar meaning vectors (e.g. smallest euclidean distance apart) and use this subset of data to construct the answer from.

I suspect that the reason for creating a RAG is this is a context window limitation. The LLM needs to know which information to focus on, so having a method for retrieving the most relevant data allows it to work much more efficiently with a large amount of data.

Now working through [this video](https://www.youtube.com/watch?v=tcqEUSNCn8I)(about how to make a RAG) - will use OpenAI embeddings here rather than Llama.

Has an associated [git repo](https://github.com/pixegami/langchain-rag-tutorial) - might clone this.

I've grabbed a version of the Wizard of Oz from the Gutenberg Project website [link](https://www.gutenberg.org/ebooks/55)

### Creating the database

First loading in a bunch of packages I'll need


In [3]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
# from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import openai 
# from dotenv import load_dotenv
import os
import shutil


openai.api_key = os.environ['OPENAI_API_KEY']

CHROMA_PATH = "chroma"
DATA_PATH = "data/wizard_of_oz"




Next going to define some functions to use to load the data, split it into chunks, and then turn it into a Chroma database

In [4]:

def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks


def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")


Running the functions to create the database

In [None]:
documents = load_documents()
chunks = split_text(documents)
save_to_chroma(chunks)

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import openai
import os
import shutil

# Load environment variables. Assumes that project contains .env file with API keys
# Set OpenAI API key
openai.api_key = os.environ['OPENAI_API_KEY']

CHROMA_PATH = "chroma"
DATA_PATH = "data/wizard_of_oz"




def generate_data_store():
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)


def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks


def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, OllamaEmbeddings(model="nomic-embed-text"), persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")


generate_data_store()

Split 1 documents into 1127 chunks.
Introduction
{'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 1870}


ValueError: Error raised by inference API HTTP code: 500, {"error":"error loading model /Users/nikmitchell/.ollama/models/blobs/sha256:970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfe"}

In [None]:
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=OllamaEmbeddings(model="nomic-embed-text"))
db.similarity_search_with_relevance_scores("scarecrow needs", k=3)



[(Document(metadata={'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 37936}, page_content='“It must be inconvenient to be made of flesh,” said the Scarecrow\nthoughtfully, “for you must sleep, and eat and drink. However, you have\nbrains, and it is worth a lot of bother to be able to think properly.”'),
  -248.2220163170763),
 (Document(metadata={'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 104484}, page_content='“Why should I do this for you?” asked the Lady.\n\n“Because you are wise and powerful, and no one else can help me,”\nanswered the Scarecrow.'),
  -249.9570692234033),
 (Document(metadata={'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 78361}, page_content='“Nothing that I know of,” answered the Woodman; but the Scarecrow, who\nhad been trying to think, but could not because his head was stuffed\nwith straw, said, quickly, “Oh, yes; you can save our friend, the\nCowardly Lion, who is asleep in the poppy bed.”'),
  -257.101614165088

## Planning investigation

I'm kinda curious to try to build something a bit more flexible here, and use that to investigate a few questions
- Does it make a difference if you use OpenAIEmbeddings() or OllamaEmbeddings()?
- Can I build several different chromadbs with different embeddings for different datasets
    - Wizard of Oz
    - Alice in Wonderland
    - My personal files (CV, batman video essay)
        - does it matter if I mash these together into a single database, even though they're about totally different things?
- do you get better performance with bigger chunks?
- can I extend this to read PDF files?

I'm a bit worried about doing this if it's not on the mainline to being able to do AI safety work, but I also think that just being curious and following my nose and making functions to output different things and label files and folders appropriately in python etc is going to be valuable.



In [6]:
def get_chroma_path(data_description, embeddings_description):
    CHROMA_PATH=os.path.join("chroma",data_description, embeddings_description)
    return CHROMA_PATH

def get_data_path(data_description):
    DATA_PATH =os.path.join("data",data_description)
    return DATA_PATH

def get_embedding_function(embeddings_description):
    if embeddings_description == "openai_embeddings":
        embedding_function = OpenAIEmbeddings()
    elif embeddings_description == "ollama_embeddings":
        embedding_function = OllamaEmbeddings(model="nomic-embed-text")
    else:
        print("please specify either 'openai_embeddings' or 'ollama_embeddings'")
    return embedding_function



def generate_data_store(data_description, embeddings_description):

    CHROMA_PATH= get_chroma_path(data_description, embeddings_description)
    DATA_PATH =  get_data_path(data_description)

    print(f"Data source: {data_description}, Embeddings: {embeddings_description}")

    # print(f"CHROMA_PATH is {CHROMA_PATH}")
    # print(f"DATA_PATH is {DATA_PATH}")
    
    


    documents = load_documents(data_path=DATA_PATH)
    chunks = split_text(documents)
    save_to_chroma(chunks, get_embedding_function(embeddings_description), chroma_path= CHROMA_PATH)


def load_documents(data_path):
    loader = DirectoryLoader(data_path, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")
    ## print test example
    # document = chunks[10]
    # print(document.page_content)
    # print(document.metadata)

    return chunks


def save_to_chroma(chunks: list[Document], embedding_function, chroma_path):
    # Clear out the database first.
    if os.path.exists(chroma_path):
        shutil.rmtree(chroma_path)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, embedding_function, persist_directory=chroma_path
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {chroma_path}.")


# generate_data_store(data_description       = data_descriptions[0],
#                     embeddings_description = embeddings_descriptions[1])

In [None]:
import itertools
# data_descriptions = ["wizard_of_oz","alice_in_wonderland","personal"] ## Commenting out personal for now because it does't use markdown files
data_descriptions = ["wizard_of_oz","alice_in_wonderland"]
embeddings_descriptions = ["openai_embeddings","ollama_embeddings"]

for data_description, embeddings_description in itertools.product(data_descriptions, embeddings_descriptions):
    generate_data_store(data_description, embeddings_description)

Data source: wizard_of_oz, Embeddings: openai_embeddings
Split 1 documents into 1127 chunks.
Saved 1127 chunks to chroma/wizard_of_oz/openai_embeddings.
Data source: wizard_of_oz, Embeddings: ollama_embeddings
Split 1 documents into 1127 chunks.
Saved 1127 chunks to chroma/wizard_of_oz/ollama_embeddings.
Data source: alice_in_wonderland, Embeddings: openai_embeddings
Split 1 documents into 801 chunks.
Saved 801 chunks to chroma/alice_in_wonderland/openai_embeddings.
Data source: alice_in_wonderland, Embeddings: ollama_embeddings
Split 1 documents into 801 chunks.
Saved 801 chunks to chroma/alice_in_wonderland/ollama_embeddings.


Yay, that works. This is exciting. I should get the question-asking part running up soon too.


### Answering questions

Initially we just have the code to do the openai embeddings and chat to openAI. Is it possible to use the OpenAI embeddings and generate the response with Llama3? My guess is yes, but also that the quality of the answers will depend primarily on the quality of the embeddings, since the AI model can't answer correctly if the correct information isn't retrieved.

In [None]:
## commented version of query_data.py

# import argparse
# # from dataclasses import dataclass
# from langchain_community.vectorstores import Chroma
# from langchain_openai import OpenAIEmbeddings
# from langchain_openai import ChatOpenAI
# from langchain.prompts import ChatPromptTemplate

# CHROMA_PATH = "chroma"

# PROMPT_TEMPLATE = """
# Answer the question based only on the following context:

# {context}

# ---

# Answer the question based on the above context: {question}
# """


# def main():
#     # Create CLI.
#     parser = argparse.ArgumentParser()
#     parser.add_argument("query_text", type=str, help="The query text.")
#     args = parser.parse_args()
#     query_text = args.query_text

#     # Prepare the DB.
#     embedding_function = OpenAIEmbeddings()
#     db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

#     # Search the DB.
#     results = db.similarity_search_with_relevance_scores(query_text, k=3)
#     if len(results) == 0 or results[0][1] < 0.7:
#         print(f"Unable to find matching results.")
#         return

#     context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
#     prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
#     prompt = prompt_template.format(context=context_text, question=query_text)
#     print(prompt)

#     model = ChatOpenAI()
#     response_text = model.predict(prompt)

#     sources = [doc.metadata.get("source", None) for doc, _score in results]
#     formatted_response = f"Response: {response_text}\nSources: {sources}"
#     print(formatted_response)


# if __name__ == "__main__":
#     main()


##### Getting libraries

In [7]:
import argparse
# from dataclasses import dataclass
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate


PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""



In [None]:
def get_query_list(data_description):
    if data_description=="wizard_of_oz":
        query_text = "How does Dorothy get back home?"
    elif data_description=="alice_in_wonderland":
        query_text = "How does Alice end up in Wonderland?"
    return query_text

def answer_query_from_database(data_description, embeddings_description, query_text):
    # don't need a CLI any more
    # parser = argparse.ArgumentParser()
    # parser.add_argument("query_text", type=str, help="The query text.")
    # args = parser.parse_args()
    # query_text = args.query_text

    # Prepare the DB.
    # embedding_function = OpenAIEmbeddings()
    chroma_path = get_chroma_path(data_description, embeddings_description)
    embedding_function=get_embedding_function(embeddings_description)

    db = Chroma(persist_directory=chroma_path, 
                embedding_function=embedding_function)
    
    print(f"loading in the chroma database from {chroma_path},  using the {embedding_function} embedding function")

    # Search the DB.
    results = db.similarity_search_with_relevance_scores(query_text, k=3)
    if len(results) == 0 or results[0][1] < 0.7:
        print(f"Unable to find matching results.")
        return

    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)
    print(prompt)

    model = ChatOpenAI()
    response_text = model.predict(prompt)

    sources = [doc.metadata.get("source", None) for doc, _score in results]
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    print(formatted_response)


#### Testing openai embeddings

In [None]:
answer_query_from_database(data_description="wizard_of_oz", 
                           embeddings_description="openai_embeddings",
                           query_text = get_query_list("wizard_of_oz"))

loading in the chroma database from chroma/wizard_of_oz/openai_embeddings,  using the client=<openai.resources.embeddings.Embeddings object at 0x13c242870> async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x1485fd340> model='text-embedding-ada-002' dimensions=None deployment='text-embedding-ada-002' openai_api_version='' openai_api_base=None openai_api_type='' openai_proxy='' embedding_ctx_length=8191 openai_api_key=SecretStr('**********') openai_organization=None allowed_special=None disallowed_special=None chunk_size=1000 max_retries=2 request_timeout=None headers=None tiktoken_enabled=True tiktoken_model_name=None show_progress_bar=False model_kwargs={} skip_empty=False default_headers=None default_query=None retry_min_seconds=4 retry_max_seconds=20 http_client=None http_async_client=None check_embedding_ctx_length=True embedding function
Human: 
Answer the question based only on the following context:

“How can I get to her castle?” asked Dorothy.

---

“How can 

In [1]:
answer_query_from_database(data_description="alice_in_wonderland", 
                           embeddings_description="openai_embeddings",
                           query_text = get_query_list("alice_in_wonderland"))

NameError: name 'answer_query_from_database' is not defined

In [66]:
answer_query_from_database(data_description="wizard_of_oz", 
                           embeddings_description="ollama_embeddings", 
                           query_text = get_query_list("wizard_of_oz")
                           )

loading in the chroma database from chroma/wizard_of_oz/ollama_embeddings,  using the base_url='http://localhost:11434' model='nomic-embed-text' embed_instruction='passage: ' query_instruction='query: ' mirostat=None mirostat_eta=None mirostat_tau=None num_ctx=None num_gpu=None num_thread=None repeat_last_n=None repeat_penalty=None temperature=None stop=None tfs_z=None top_k=None top_p=None show_progress=False headers=None model_kwargs=None embedding function
Unable to find matching results.




 <span style="color: red; font-weight: bold;">The ollama embeddings don't seem to be working at all - getting these negative relevance scores, which makes me feel like I haven't done the original embeddings for the database properly (or I'm not embedding the query properly)</span>

## Embedding investigation

I'm also curious now about the embeddings, and how they work for ollama versus openai. The compare_embeddings.py file has an interesting example comparing the distance of "apple" from "orange" vs "apple" from "iphone". I'd kind of like to have a go at building a list of 5 or 6 different words and calculating the distance from each of them to the others w ollama and open ai. 

I'm thinking maybe a set of faceted graphs, faceted by word1, and going through all of the word2s and doing bar graphs or somethiing, with different colours for openai vs ollama embeddings

In [None]:
## original code from compare_embeddings.py
# from langchain_openai import OpenAIEmbeddings
# from langchain.evaluation import load_evaluator
# from dotenv import load_dotenv
# import openai
# import os

# # Load environment variables. Assumes that project contains .env file with API keys
# load_dotenv()
# #---- Set OpenAI API key 
# # Change environment variable name from "OPENAI_API_KEY" to the name given in 
# # your .env file.
# openai.api_key = os.environ['OPENAI_API_KEY']

# def main():
#     # Get embedding for a word.
#     embedding_function = OpenAIEmbeddings()
#     vector = embedding_function.embed_query("apple")
#     print(f"Vector for 'apple': {vector}")
#     print(f"Vector length: {len(vector)}")

#     # Compare vector of two words
#     evaluator = load_evaluator("pairwise_embedding_distance")
#     words = ("apple", "iphone")
#     x = evaluator.evaluate_string_pairs(prediction=words[0], prediction_b=words[1])
#     print(f"Comparing ({words[0]}, {words[1]}): {x}")


# if __name__ == "__main__":
#     main()


In [69]:
from langchain_openai import OpenAIEmbeddings
from langchain.evaluation import load_evaluator
# from dotenv import load_dotenv
import openai
import os

# Load environment variables. Assumes that project contains .env file with API keys
# load_dotenv()
#---- Set OpenAI API key 
# Change environment variable name from "OPENAI_API_KEY" to the name given in 
# your .env file.
# openai.api_key = os.environ['OPENAI_API_KEY']

def compare_vectors(embeddings_description, word1 = "apple", word2='iphone', verbose = False):
    # Get embedding for a word.
    # embedding_function = OpenAIEmbeddings()
    embedding_function =get_embedding_function(embeddings_description)
    vector = embedding_function.embed_query(word1)
    # print(f"Vector for '{word1}': {vector}")
    # print(f"Vector length: {len(vector)}")

    # Compare vector of two words
    evaluator = load_evaluator("pairwise_embedding_distance")
    words = (word1, word2)
    distance = evaluator.evaluate_string_pairs(prediction=words[0], prediction_b=words[1])
    if verbose:
        print(f"Comparing ({words[0]}, {words[1]}): {distance}")
    return distance['score']
    

In [70]:
compare_vectors('openai_embeddings','tent', 'house', verbose=True)

Comparing (tent, house): {'score': 0.16090864573644614}


0.16090864573644614

In [71]:
compare_vectors('ollama_embeddings','tent', 'house', verbose=True)

ValueError: Error raised by inference API HTTP code: 500, {"error":"error loading model /Users/nikmitchell/.ollama/models/blobs/sha256:970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfe"}

In [None]:
OllamaEmbeddings(model="nomic-embed-text")

In [61]:
import itertools
import pandas as pd

# word_list = ['apple', 'orange', 'iphone', 'call', 'amsterdam', 'netherlands', 'orangutan', 'perplexed', 'allegory']


def make_distance_comparison_df(list_of_words):
    combinations = list(itertools.product(list_of_words, repeat=2)) # make an iterator to combine all words with each other and themselves
    words_df = pd.DataFrame(combinations, columns=['word1', 'word2']) #turn it into a dataframe
    words_df['embedding_distance']=words_df.apply(lambda row: compare_vectors(row['word1'], row['word2']), axis = 1) #apply the compare_vectors function to each row
    return words_df


In [62]:
word_list1 = ['shiny','sparkly','glittery','good','great']
words_df1  = make_distance_comparison_df(word_list1)
word_list2 = ['apple', 'orange', 'iphone', 'call', 'amsterdam', 'netherlands', 'orangutan', 'perplexed', 'allegory']
words_df2  = make_distance_comparison_df(word_list2)

In [66]:
import plotly.express as px

def plot_embedding_distances(words_df):

    fig = px.scatter(
        words_df, 
        x='word2', 
        y='embedding_distance', 
        color='word2',
        facet_col='word1', 
        facet_col_wrap=3, # 3 max words per row
        facet_row_spacing = 0.1,
        title="Embedding Distance Between Words", 
        labels={'embedding_distance': 'Distance', 'word2': 'Compared Word'},
        height=800
    )

    # Update layout to move x-axis labels to the top row
    fig.update_xaxes(matches='x', showticklabels=True)

    fig.show()



In [67]:
plot_embedding_distances(words_df1)

In [65]:
plot_embedding_distances(words_df2)

Main things from the above are that the embedding distances have less range than I expected. "Totally unrelated" doesn't necessarily have a way higher distance than "practically synonyms". Some words that I would expect to be really correlated (e.g. "orangutan" and "orange") are barely closer than "orangutan" and "allegory", in spite of orangutans being orange in colour.