# Day 4

## Tokenizing with code

In [1]:
# Note: tiktoken is OpenAI's tokenizer. For Ollama models (like llama3.2), 
# we can still use tiktoken with compatible encodings, or use the model's native tokenizer.
# Here we'll use tiktoken with a compatible encoding for demonstration.

import tiktoken

# For llama models, we can use cl100k_base encoding (same as GPT-3.5/GPT-4)
# or o200k_base for newer models. Let's use cl100k_base which works well.
encoding = tiktoken.get_encoding("cl100k_base")

tokens = encoding.encode("Hi my name is Ed and I like banoffee pie")

In [2]:
tokens

[13347, 856, 836, 374, 3279, 323, 358, 1093, 9120, 21869, 4447]

In [3]:
for token_id in tokens:
    token_text = encoding.decode([token_id])
    print(f"{token_id} = {token_text}")

13347 = Hi
856 =  my
836 =  name
374 =  is
3279 =  Ed
323 =  and
358 =  I
1093 =  like
9120 =  ban
21869 = offee
4447 =  pie


In [6]:
encoding.decode([13347])

'Hi'

# And another topic!

### The Illusion of "memory"

Many of you will know this already. But for those that don't -- this might be an "AHA" moment!

In [7]:
import requests

# Check if Ollama is running
try:
    response = requests.get("http://localhost:11434", timeout=2)
    if response.status_code == 200:
        print("Ollama is running!")
    else:
        print("Ollama server responded but with unexpected status. Please check if Ollama is running properly.")
except requests.exceptions.RequestException:
    print("Ollama is not running. Please open a terminal and run 'ollama serve'")
    print("Then in another terminal, run 'ollama pull llama3.2' to download the model")

Ollama is running!


### You should be very comfortable with what the next cell is doing!

_I'm creating a new instance of the OpenAI Python Client library, a lightweight wrapper around making HTTP calls to an endpoint. We're using it to call Ollama (which provides an OpenAI-compatible endpoint) running locally on your machine!_

In [8]:
from openai import OpenAI

# Set up Ollama client using OpenAI-compatible interface
OLLAMA_BASE_URL = "http://localhost:11434/v1"
ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')

### A message to an LLM (via OpenAI-compatible API) is a list of dicts

In [9]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hi! I'm Ed!"}
    ]

In [10]:
response = ollama.chat.completions.create(model="llama3.2", messages=messages)
response.choices[0].message.content

"Hi Ed! It's nice to meet you. Is there anything I can help you with today? Do you need assistance with a particular question or topic, or do you just want to chat? I'm all ears!"

### OK let's now ask a follow-up question

In [11]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What's my name?"}
    ]

In [12]:
response = ollama.chat.completions.create(model="llama3.2", messages=messages)
response.choices[0].message.content

"I don't have any information about your name. We just started our conversation, and I don't retain any personal data. Would you like to share your name with me? I can help with anything related to it!"

### Wait, wha??

We just told you!

What's going on??

Here's the thing: every call to an LLM is completely STATELESS. It's a totally new call, every single time. As AI engineers, it's OUR JOB to devise techniques to give the impression that the LLM has a "memory".

In [13]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hi! I'm Ed!"},
    {"role": "assistant", "content": "Hi Ed! How can I assist you today?"},
    {"role": "user", "content": "What's my name?"}
    ]

In [14]:
response = ollama.chat.completions.create(model="llama3.2", messages=messages)
response.choices[0].message.content

"Your name is Ed. Or, at least, that's what we established earlier!"

## To recap

With apologies if this is obvious to you - but it's still good to reinforce:

1. Every call to an LLM is stateless
2. We pass in the entire conversation so far in the input prompt, every time
3. This gives the illusion that the LLM has memory - it apparently keeps the context of the conversation
4. But this is a trick; it's a by-product of providing the entire conversation, every time
5. An LLM just predicts the most likely next tokens in the sequence; if that sequence contains "My name is Ed" and later "What's my name?" then it will predict.. Ed!

The ChatGPT product uses exactly this trick - every time you send a message, it's the entire conversation that gets passed in.

"Does that mean we have to pay extra each time for all the conversation so far"

With cloud APIs like OpenAI, yes - you pay for all the tokens in the conversation each time. With Ollama running locally, you don't pay API fees, but you still use compute resources (CPU/GPU) to process the entire conversation each time. The principle is the same: we want the LLM to predict the next tokens in the sequence, looking back on the entire conversation. We want that compute to happen!

