# Day 4

## 0. Concepts

- **Transformers**: Neural network architecture using self-attention to process sequences of tokens. Captures long-range dependencies efficiently. Model capacity is defined by **parameters (weights)** â€” larger models with billions of parameters can learn more complex patterns but require more compute.

- **Tokens**: Discrete units of text or code that the model processes. Text is split into subword units, which determines how sequences are represented internally. Token counts are important because they affect both the model's **context window** and the cost of API usage.

- **Context Window**: Maximum number of tokens the model can attend to in a single pass. Sequences longer than this limit are truncated or require special handling (e.g., sliding windows or chunking).

- **API Costs**: Most LLM APIs charge per token processed or generated. Understanding tokenization helps estimate costs accurately and optimize requests.

- **Tokenizing with Code**: Practice converting text or code into tokens programmatically. Measure token lengths and analyze how different inputs affect model performance and API usage.


## 1. Tokenizing with code
[tiktoken](https://github.com/openai/tiktoken) is a fast _BPE_ (byte-pair encoding) tokeniser for use with OpenAI's models.

In [None]:
import tiktoken

# tiktoken is OpenAI's tokenizer - it converts text into tokens
# Different models use different tokenizers, so we specify which model
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

tokens = encoding.encode("Hi my name is Manu and I like migas")

In [None]:
tokens

In [None]:
# Let's see what each token ID represents
for token_id in tokens:
    token_text = encoding.decode([token_id])
    print(f"{token_id} = {token_text}")

In [None]:
encoding.decode([164313])

## 2. The Illusion of "memory"

Now let's see how LLMs actually work - they don't remember anything between calls!

In [None]:
# Setting up Gemini
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv(override=True)

api_key = os.getenv("GEMINI_API_KEY")
base_url = os.getenv("GEMINI_BASE_URL", "https://generativelanguage.googleapis.com/v1beta/openai/")
model = os.getenv("GEMINI_MODEL", "gemini-3-flash-preview")

if not api_key:
    raise ValueError("No GEMINI_API_KEY found in .env file")
else:
    print("API key found!")
    client = OpenAI(base_url=base_url, api_key=api_key)

In [None]:
# First message (list of dicts)
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hi! I'm Manu!"}
]

response = client.chat.completions.create(model=model, messages=messages)
response.choices[0].message.content

In [None]:
# Second message that tries to ask a follow-up question
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What's my name?"}
    ]

response = client.chat.completions.create(model=model, messages=messages)
response.choices[0].message.content

In [None]:
# Wait, it doesn't know my name! That's because each API call is STATELESS
# Every call is completely independent - the LLM has no memory between calls

In [None]:
# Now including the full conversation history
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hi! I'm Manu!"},
    {"role": "assistant", "content": "Hi Manu! How can I help you today?"},
    {"role": "user", "content": "What's my name?"}
]

response = client.chat.completions.create(model=model, messages=messages)
print(response.choices[0].message.content)

In [None]:
# Key takeaway: LLMs are stateless!
# Each API call is independent - no memory between calls
# To create "memory", we pass the entire conversation history each time