# Getting hands-on experience with LLMs

It seems like it will plausibly be valuable to be able to run LLMs locally on my laptop, or being able to hook into them for parts of tasks.

This might be a basic thing that I could do that I could put on a blog or something


Here's the rough kind of idea I had based on watching various youtube videos, will need to assess whether it's going to be actually useful

- Download latest Llama model and see if I can interact with it using Ollama through an interface in the terminal
- try to use it (or maybe the OpenAI embeddings) to make a basic RAG that can read a document and answer basic questions from the text (from a file in .txt format) - following [this video](https://www.youtube.com/watch?v=tcqEUSNCn8I)
- extend to be able to read pdfs or arbitrary filetypes, using [this video](https://www.youtube.com/watch?v=2TJxpyO3ei4) then maybe [this video](https://www.youtube.com/watch?v=svzd5d1LXGk)


I initially tried to follow [this video](https://www.youtube.com/watch?v=ztBJqzBU5kc), but I found myself running into a lot of issues with package installation. I screwed up my base conda installation, needed to reset it and make a new environment (which I titled 'llama') for this project

I have already downloaded llama3 (the 4GB version - the 40GB version is way way too slow, basically doesn't run). Now I want to see if I can interact with it with the llama package

In [22]:
import ollama
# note -- extremely bizarre that "import ollama" failed
# after a successful-looking "conda install ollama" and required 
# me to "pip install ollama" in order to work??
import os # will need this later

In [2]:
# I noticed that llama tends to print really long lines so I need to scroll sideways. I'm not enjoying that, so I'm making
# a wrapped print function to fix it

import textwrap

def wprint(text, width = 120):
  wrapped_text = textwrap.fill(text, width=width)
  print(wrapped_text)
  

## Interacting with Llama3 (no embedding)
This next part is just me trying to interact with the model directly and feeding it a text file (no embedding etc), to see whether it'll provide sensible responses. 

In [4]:

with open('data/personal/batman_sound_video_essay.txt','r') as file:
    data = file.read()


#debugging statement to confirm file was loaded
if data:
    print("File loaded successfully")
else:
    print("Load in a file")

File loaded successfully


In [5]:
prompt_01 = f"{data} #### From this text, tell me about how the sound of rain is used in the movie"

wprint("<agent_01> Prompt: "+prompt_01)

print("<agent_01> Generating a response: ")


<agent_01> Prompt: Why The Batman's Sound is Different - YouTube https://www.youtube.com/watch?v=_AQkQ4a1yJ8
Transcript: (00:01) listen to this this otherworldly sound happens to be a justin bieber song slowed down 800 percent
and it's this contorted version of the song that was responsible in part for inspiring what would eventually become the
sound of the batmobile in the latest batman film [Music] this video is sponsored by curiositystream watch the exclusive
companion video on nebula when you sign up for the curiosity stream nebula bundle today watch this clip on mute a truck
passes in front of the frame and imagine with me for a moment what sound (00:48) you think will accompany that truck got
it okay here it is i must choose my targets carefully instead of a truck we hear a train but there's no visual sign of a
train in this shot or any of the shots in close proximity the film establishes trains as part of the setting but when
the sound plays we don't see a train the sound of the

In [6]:

response = ollama.chat(model = 'llama3',
                         messages = [{
                           "role":"user",
                        #    "content":"tell me about a cool species of frog"
                        "content": prompt_01
                       }])

wprint(response["message"]["content"])
# print(response)
# print(response["message"])

According to the video transcript, the sound of rain is used in the movie "The Batman" to create a sense of realism and
atmosphere. The narrator explains that when the rain starts falling during a scene, it's initially audible, but then
suddenly stops once gunfire breaks out. To demonstrate this, the narrator provides a modified version of the scene where
the sound of the rain remains audible, even after the gunshots start.  This technique is used to create an
impressionistic and expressionist sound design, rather than trying to accurately represent reality. The goal is not to
make the sound realistic, but to create a certain feeling or atmosphere that enhances the emotional impact of the scene.
By leaving the sound of the rain at its original volume while muting the gunfire, the filmmakers are able to achieve
this effect and draw the audience's attention to the contrast between the peaceful rain and the intense action.


The above feels like a pretty great answer actually.

### Function-ising and further checks

I want this to be a bit more re-usable within a function, where I can just specify the data to read from, and the question I want it to answer.

In [7]:
def llama_read_and_respond(input_file, question, print_prompt_with_data = False):
    with open(input_file,'r') as file:
        data = file.read()


    #debugging statement to confirm file was loaded
    if data:
        print("File loaded successfully")
    else:
        print("Load in a file")


    prompt_01 = f"{data} #### From this text, {question}"

    if print_prompt_with_data:
        wprint("Prompt: "+prompt_01)

    print("Generating a response: ")



    response = ollama.chat(model = 'llama3',
                            messages = [{
                            "role":"user",
                            #    "content":"tell me about a cool species of frog"
                            "content": prompt_01
                        }])

    wprint(response["message"]["content"])



In [8]:
llama_read_and_respond(input_file='data/personal/batman_sound_video_essay.txt', 
                       question = 'tell me about how the sound of rain is used in the movie')


File loaded successfully
Generating a response: 
According to the video, in one scene where gunfire breaks out, the sound of rain initially audible suddenly falls away
into silence. The narrator suggests that this is not because the rain actually stopped, but rather because the sound
designers intentionally muted the sound effect to create a sense of sudden chaos and emphasis on the gunshots.  The
narrator also provides a modified version of the scene where the sound of rain remains audible, even after the gunfire
starts. This altered version makes the gunshots sound quieter than they would if the rain were actually present in
reality. This example illustrates how the sound design in The Batman is not necessarily trying to accurately represent
the real world, but rather create a subjective and impressionistic experience for the audience.


In [9]:
llama_read_and_respond(input_file='data/personal/batman_sound_video_essay.txt', 
                       question = 'who sponsored the video?')

File loaded successfully
Generating a response: 


The video was sponsored by Curiosity Stream.


Ok, this seems to be working. Now I'd like it to try reading something from my CV, because previously it seemed to be struggling with that. I've just changed the extension from a .tex file to .txt, and I want to see if it can answer basic questions (e.g. about dates of employment). This might be harder for it to do because it's still got all of these latex formatting things in there

In [10]:
llama_read_and_respond(input_file='data/personal/Nik_Mitchell_CV_2024_07_21.txt', 
                       question = 'what is the most recent job on that list, and what did I do in that job?')

File loaded successfully
Generating a response: 


Based on the provided LaTeX code, the most recent job listed is:  **NZ Royal Commission Inquiry - COVID-19 Lessons
Learned**  As a **Principal Data Analyst**, you worked from May 2024 to July 2024. In this role, you:  * Created high-
quality visualisations to support the Inquiry * Conducted analyses and created visualisations to highlight the disparate
impact of COVID-19 on Māori and Pacific ethnic groups and people living in areas of higher socioeconomic deprivation *
Worked closely with the Chair of the Commission to discuss how to tell the story of the COVID pandemic through the above
visualisations in a way that draws out lessons for future pandemics.


Interesting, this is the correct answer, and way better than what I got when I tried to just run ollama in the terminal. But also, it seems to really directly copy-paste exactly what I wrote in my bullet points here (possibly because it was in bullet point form?). Next, asking it to be more concise & summarise a bit.

In [11]:
llama_read_and_respond(input_file='data/personal/Nik_Mitchell_CV_2024_07_21.txt', 
                       question = 'what is the most recent job on that list, and what did I do in that job? Please be concise and summarise the responsibilities rather than copying the whole description')

File loaded successfully
Generating a response: 


The most recent job listed is "Principal Data Analyst" at NZ Royal Commission Inquiry - COVID-19 Lessons Learned (May
2024 - July 2024).  In this role, my main responsibilities included:  * Creating high-quality visualizations to support
the inquiry * Conducting analyses and creating visualizations to highlight pandemic trends in New Zealand and other
countries * Working closely with the Chair of the Commission to discuss how to tell the story of the COVID pandemic
through visualizations.


That seemed to work. I'm still curious about this bullet point thing though.

#### Bullet point formatting investigation

I have a hypothesis that this being in bullet points convinced llama to repeat it more exactly. I have a subquestion in here about whether it's specifically all the tex formatting or whether bullet points in normal text would have the same effect. 

To investigate this, I'm going to present llama3 with three different versions of that one section of my CV
- One with all the latex formatting (baseline)
- One with latex formatting removed, but in bullet point format still
- One where I've rewritten the information in sentence format

and see whether they have different summaries.

In [47]:
# TeX formatting only
llama_read_and_respond(input_file='data/personal/Nik_Mitchell_CV_principal_only_tex_formatting.txt', 
                       question = 'what is the title of the job on that list, and what did I do in that job?',
                       print_prompt_with_data = True)

File loaded successfully
Prompt: \subsection*{Experience}  \begin{itemize}   \parskip=0.1em        \item   %  \headerrow   {\textbf{NZ Royal
Commission Inquiry - COVID-19 Lessons Learned}}   %  {\textbf{Wellington, NZ}}   \\   \headerrow   {\emph{Principal Data
Analyst}}   {\emph{May 2024 -- July 2024}}   {\emph{Created high-quality visualisations to support the Inquiry}}
\begin{itemize*}         \item Created visualisations that contextualised pandemic trends (COVID-19 cases,
hospitalisations, deaths and vaccinations) in New Zealand against policy decisions (e.g. lockdowns, border closing) and
pandemic trends in other countries         \item Also conducted analyses and created visualisations to highlight the
disparate impact of COVID-19 on M\a=aori and Pacific ethnic groups and people living in areas of higher socioeconomic
deprivation         \item Worked closely with the Chair of the Commission to discuss how to tell the story of the COVID
pandemic through the above visualisations i

In [48]:
# bullets but not in tex only
llama_read_and_respond(input_file='data/personal/Nik_Mitchell_CV_principal_only_bullets.txt', 
                       question = 'what is the title of the job on that list, and what did I do in that job?',
                       print_prompt_with_data = True)

File loaded successfully
Prompt: Experience  NZ Royal Commission Inquiry - COVID-19 Lessons Learned (Wellington, NZ)  Principal Data Analyst (May
2024 -- July 2024) - Created high-quality visualisations to support the Inquiry   - Created visualisations that
contextualised pandemic trends (COVID-19 cases, hospitalisations, deaths and vaccinations) in New Zealand against policy
decisions (e.g. lockdowns, border closing) and pandemic trends in other countries   - Also conducted analyses and
created visualisations to highlight the disparate impact of COVID-19 on M\a=aori and Pacific ethnic groups and people
living in areas of higher socioeconomic deprivation   - Worked closely with the Chair of the Commission to discuss how
to tell the story of the COVID pandemic through the above visualisations in a way that draws out lessons for future
pandemics #### From this text, what is the title of the job on that list, and what did I do in that job?
Generating a response: 


According to the text, the title of the job is "Principal Data Analyst". As a Principal Data Analyst, you:  * Created
high-quality visualisations to support the NZ Royal Commission Inquiry * Conducted analyses and created visualisations
to:         + Contextualise pandemic trends (COVID-19 cases, hospitalisations, deaths, and vaccinations) in New Zealand
against policy decisions (e.g. lockdowns, border closures) and pandemic trends in other countries         + Highlight
the disparate impact of COVID-19 on Māori and Pacific ethnic groups and people living in areas of higher socioeconomic
deprivation * Worked closely with the Chair of the Commission to discuss how to tell the story of the COVID pandemic
through visualisations and draw out lessons for future pandemics.


In [49]:
# bullets but not in tex only
llama_read_and_respond(input_file='data/personal/personalNik_Mitchell_CV_principal_only_text.txt', 
                       question = 'what is the title of the job on that list, and what did I do in that job?',
                       print_prompt_with_data = True)

File loaded successfully
Prompt: Experience  NZ Royal Commission Inquiry - COVID-19 Lessons Learned (Wellington, NZ)  Principal Data Analyst (May
2024 -- July 2024) Created high-quality visualisations to support the Inquiry.  Created visualisations that
contextualised pandemic trends (COVID-19 cases, hospitalisations, deaths and vaccinations) in  New Zealand against
policy decisions (e.g. lockdowns, border closing) and pandemic trends in other countries. Also conducted  analyses and
created visualisations to highlight the disparate impact of COVID-19 on M\a=aori and Pacific ethnic groups and  people
living in areas of higher socioeconomic deprivation. Worked closely with the Chair of the Commission to discuss how  to
tell the story of the COVID pandemic through the above visualisations in a way that draws out lessons for future
pandemics. #### From this text, what is the title of the job on that list, and what did I do in that job?
Generating a response: 
The title of the job is "Princ

Hypothesis seems to be falsified by this (bullets-only gave an answer with a bit of a summary at the end, whereas tex formatting and text-only versions stuck to the text)

# RAG (Retrieval-Augmented Generator)

Why would we want to create a RAG? The above seemed to work just fine.

I have a suspicion that the issue here is to do with context windows. When making a RAG, we're first going to create a database by chunking up all the inputs into manageable-sized pieces (with overlap between chunks) and then using particular embeddings to encode the meaning of the chunks as vectors. Once we have that, we can use the same embeddings on the input question, and then retrieve the top few chunks that have the most similar meaning vectors (e.g. smallest euclidean distance apart) and use this subset of data to construct the answer from.

I suspect that the reason for creating a RAG is this is a context window limitation. The LLM needs to know which information to focus on, so having a method for retrieving the most relevant data allows it to work much more efficiently with a large amount of data.

Now working through [this video](https://www.youtube.com/watch?v=tcqEUSNCn8I)(about how to make a RAG) - will use OpenAI embeddings here rather than Llama.

Has an associated [git repo](https://github.com/pixegami/langchain-rag-tutorial) - might clone this.

I've grabbed a version of the Wizard of Oz from the Gutenberg Project website [link](https://www.gutenberg.org/ebooks/55)

### Creating the database (based on create_database.py)

First loading in a bunch of packages I'll need


In [12]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
# from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import openai 
# from dotenv import load_dotenv
import os
import shutil

# Load environment variables. Assumes that project contains .env file with API keys
# load_dotenv()
#---- Set OpenAI API key 
# Change environment variable name from "OPENAI_API_KEY" to the name given in 
# your .env file.
openai.api_key = os.environ['OPENAI_API_KEY']

CHROMA_PATH = "chroma"
DATA_PATH = "data/wizard_of_oz"




Next going to define some functions to use to load the data, split it into chunks, and then turn it into a Chroma database

In [13]:

def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks


def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")


Running the functions to create the database

In [14]:
documents = load_documents()

In [15]:
chunks = split_text(documents)

Split 1 documents into 1127 chunks.
Introduction
{'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 1870}


In [16]:
save_to_chroma(chunks)

Saved 1127 chunks to chroma.


  warn_deprecated(


In [18]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import openai
import os
import shutil

# Load environment variables. Assumes that project contains .env file with API keys
# Set OpenAI API key
openai.api_key = os.environ['OPENAI_API_KEY']

CHROMA_PATH = "chroma"
DATA_PATH = "data/wizard_of_oz"




def generate_data_store():
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)


def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks


def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, OllamaEmbeddings(model="nomic-embed-text"), persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")


generate_data_store()

Split 1 documents into 1127 chunks.
Introduction
{'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 1870}
Saved 1127 chunks to chroma.


In [20]:
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=OllamaEmbeddings(model="nomic-embed-text"))
db.similarity_search_with_relevance_scores("scarecrow needs", k=3)



[(Document(metadata={'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 37936}, page_content='“It must be inconvenient to be made of flesh,” said the Scarecrow\nthoughtfully, “for you must sleep, and eat and drink. However, you have\nbrains, and it is worth a lot of bother to be able to think properly.”'),
  -248.2220163170763),
 (Document(metadata={'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 104484}, page_content='“Why should I do this for you?” asked the Lady.\n\n“Because you are wise and powerful, and no one else can help me,”\nanswered the Scarecrow.'),
  -249.9570692234033),
 (Document(metadata={'source': 'data/wizard_of_oz/wizard_of_oz.md', 'start_index': 78361}, page_content='“Nothing that I know of,” answered the Woodman; but the Scarecrow, who\nhad been trying to think, but could not because his head was stuffed\nwith straw, said, quickly, “Oh, yes; you can save our friend, the\nCowardly Lion, who is asleep in the poppy bed.”'),
  -257.101614165088

## Planning investigation

I'm kinda curious to try to build something a bit more flexible here, and use that to investigate a few questions
- Does it make a difference if you use OpenAIEmbeddings() or OllamaEmbeddings()?
- Can I build several different chromadbs with different embeddings for different datasets
    - Wizard of Oz
    - Alice in Wonderland
    - My personal files (CV, batman video essay)
        - does it matter if I mash these together into a single database, even though they're about totally different things?
- can I extend this to read PDF files?

I'm a bit worried about doing this if it's not on the mainline to being able to do AI safety work, but I also think that just being curious and following my nose and making functions to output different things and label files and folders appropriately in python etc is going to be valuable.

__One thing that I think will be super useful is just getting set up to interact with ChatGPT via API - maybe come back to that after??__

Also going to shove all this in git now, because I want to be able to roll it back at some point if I screw it up, and restarting would be a bit of a pain


In [34]:
def generate_data_store(data_description, embeddings_description):

    CHROMA_PATH=os.path.join("chroma",data_description, embeddings_description)
    DATA_PATH =os.path.join("data",data_description)

    print(f"Data source: {data_description}, Embeddings: {embeddings_description}")

    # print(f"CHROMA_PATH is {CHROMA_PATH}")
    # print(f"DATA_PATH is {DATA_PATH}")
    
    if embeddings_description == "openai_embeddings":
        embedding_function = OpenAIEmbeddings()
    elif embeddings_description == "ollama_embeddings":
        embedding_function = OllamaEmbeddings(model="nomic-embed-text")



    documents = load_documents(data_path=DATA_PATH)
    chunks = split_text(documents)
    save_to_chroma(chunks, embedding_function, chroma_path= CHROMA_PATH)


def load_documents(data_path):
    loader = DirectoryLoader(data_path, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")
    ## print test example
    # document = chunks[10]
    # print(document.page_content)
    # print(document.metadata)

    return chunks


def save_to_chroma(chunks: list[Document], embedding_function, chroma_path):
    # Clear out the database first.
    if os.path.exists(chroma_path):
        shutil.rmtree(chroma_path)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, embedding_function, persist_directory=chroma_path
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {chroma_path}.")


# generate_data_store(data_description       = data_descriptions[0],
#                     embeddings_description = embeddings_descriptions[1])

In [35]:
import itertools
# data_descriptions = ["wizard_of_oz","alice_in_wonderland","personal"] ## Commenting out personal for now because it does't use markdown files
data_descriptions = ["wizard_of_oz","alice_in_wonderland"]
embeddings_descriptions = ["openai_embeddings","ollama_embeddings"]

for data_description, embeddings_description in itertools.product(data_descriptions, embeddings_descriptions):
    generate_data_store(data_description, embeddings_description)

Data source: wizard_of_oz, Embeddings: openai_embeddings
Split 1 documents into 1127 chunks.
Saved 1127 chunks to chroma/wizard_of_oz/openai_embeddings.
Data source: wizard_of_oz, Embeddings: ollama_embeddings
Split 1 documents into 1127 chunks.
Saved 1127 chunks to chroma/wizard_of_oz/ollama_embeddings.
Data source: alice_in_wonderland, Embeddings: openai_embeddings
Split 1 documents into 801 chunks.
Saved 801 chunks to chroma/alice_in_wonderland/openai_embeddings.
Data source: alice_in_wonderland, Embeddings: ollama_embeddings
Split 1 documents into 801 chunks.
Saved 801 chunks to chroma/alice_in_wonderland/ollama_embeddings.


Yay, that works. This is exciting. I should get the question-asking part running up soon too.

I'm also curious now about the embeddings, and how they work for ollama versus openai. The compare_embeddings.py file has an interesting example comparing the distance of "apple" from "orange" vs "apple" from "iphone". I'd kind of like to have a go at building a list of 5 or 6 different words and calculating the distance from each of them to the others w ollama and open ai. 

I'm thinking maybe a set of faceted graphs, faceted by word1, and going through all of the word2s and doing bar graphs or somethiing, with different colours for openai vs ollama embeddings