# Initial Setup of Environment
We will use uv to install the dependencies for the project.  This means that uv needs to be installed on the system.  Use the following command to install uv:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

When you set up a project, you should follow the following steps:

```bash
mkdir my_project
cd my_project
uv init
```

This will create a new directory called `my_project` and initialize a new uv project in it.  You can then add the dependencies for the project.  We will need mlx-lm.  

```bash
uv add mlx-lm
```

If we look inside the uv.lock file after adding the dependency above, we will see that mlx and mlx-metal have been installed without us having to add them individually.  This is one of the nice features of uv.  It takes care of all the transitive dependencies for us.
To execute cells in the notebook, you must also add ipykernel to the project dependencies, and to avoid some warning, we will also add ipywidgets.  These can be added a part of the development environment:

```bash
uv add --dev ipykernel ipywidgets
```




# Load the Model

Let's import the load function from the mlx-lm package, and then load the model.  We will use the `Qwen3-4B-Instruct-2507-4bit` model, which is a small 4-bit quantized model that will fit in minimal RAM.

In [None]:
from mlx_lm import load
MODEL_ID = "mlx-community/Qwen3-4B-Instruct-2507-4bit" 
print("Loading model... (first time may take a while)")
model, tokenizer = load(MODEL_ID)

# Generate a response

Let's generate a response from the model.  We will use the `generate` function to generate a response to the question "What is the capital of the United States?"  We are not setting up a specific prompt template - just the question will go straight in.  You will notice that the model wanders arount and adds information beyond what we asked.

In [None]:
from mlx_lm import generate
response = generate(model, tokenizer, "What is the capital of the United States?")
print(response)

## Adding a Prompt Template
Let's see what happens if we add a prompt template to the model.  We will use the `apply_chat_template` function to add a prompt template to the model.  We will use the `chat_template` attribute of the tokenizer to get the prompt template.  We will then use the `apply_chat_template` function to add the prompt template to the model.  Note that we are not tokenizing the prompt!

In [None]:
prompt = "What is the capital of the United States?"
if tokenizer.chat_template is not None:
    messages = [{"role":"user", "content":prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True,
        tokenize=False,
    )
response = generate(model, tokenizer, prompt=prompt, verbose=True)

## Is this model smart?

Let's ask the model for the capital of the United Moons, and see if it can handle it.

In [None]:
prompt = "What is the capital of the United Moons?"
if tokenizer.chat_template is not None:
    messages = [{"role":"user", "content":prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True,
        tokenize=False,
    )
response = generate(model, tokenizer, prompt=prompt, verbose=True)

## OPTIONAL:Tweaking the generate method

We can change sampling parameters by passing a sampler object to the generate method. 


In [None]:
from mlx_lm.sample_utils import make_sampler

# Here is a sampler with the default values
sampler = make_sampler(
    temp= 0.0,
    top_p= 0.0,
    min_p= 0.0,
    min_tokens_to_keep= 1,
    top_k= 0,
    xtc_probability= 0.0,
    xtc_threshold= 0.0,
    xtc_special_tokens= [],
)

# Pass the sampler to generate
response = generate(
    model, 
    tokenizer, 
    prompt=prompt, 
    verbose=True,
    max_tokens=25,
    sampler=sampler,  # Pass the sampler object here
)

## Chatting with the Model
### No Memory
To have a chat with a model, we need some method for the model to remember the conversation history.  Here is an example that shows what happens if we just ask two sequential prompts. In my first prompt, I tell the model my name.  In the second prompt, I ask for my name.  The model has no memory of the first prompt!

In [None]:
# User turn one
user_message = "Hi my name is Mike Dean."
print(f"First user message: {user_message}\n")
messages = [{"role": "user", "content": user_message}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=False,
)

print(f"First prompt response: {response}\n")

# User turn two
user_message = "What's my name?"
messages = [{"role": "user", "content": user_message}]
print(f"Second user message: {user_message}\n")
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize = False,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=False,
)
print(f"Second prompt response: {response}\n")

### Make a history mechanism

We can manually setup a history mechanism.  We initialize an empty list, and then put in the initial user message.  We feed the whole history (initially it is just the initial user message) into the tokenizer to get a prompt that includes everything in the history.  After a response is generated, we append the assistant response to the history, and then feed the whole history into the tokenizer to get a new prompt.  We continue in this way until we are done.

In [None]:
# Initialize history
history = []

# User turn one
user_message = "Hi my name is Mike Dean."
messages = [{"role": "user", "content": user_message}]
history.append(messages[0])
prompt = tokenizer.apply_chat_template(
    history,
    add_generation_prompt=True,
    tokenize=False
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
)
history.append({"role":"assistant", "content":response})

# User turn two
user_message = "What's my name?"
messages = [{"role": "user", "content": user_message}]
history.append(messages[0])
prompt = tokenizer.apply_chat_template(
    history,
    add_generation_prompt=True,
    tokenize=False
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
)

history.append({"role":"assistant", "content":response})

### Examine the history object
Since we initialized history, we have sent two messages.  The first is the initial user message, and the second is the assistant response.  We can see this by examining the history object.

In [None]:
# Using pretty print to make the object easier to read
from pprint import pprint
pprint(history)


### Easier to read printing

In [None]:
def print_history(history):
    for i, msg in enumerate(history):
        role = msg['role'].upper()
        content = msg['content']
        print(f"[{i}] {role}:")
        print(f"    {content}")
        print()

print_history(history)

## Using Prompt Caching

The MLX framework has a prompt caching mechanism that can be used in a similar manner to the history mechanism that we just created.  Here is an example of how to use it. This is adaptedfrom Apple's documentation on the subject.

In [None]:
"""
An example of a multi-turn chat with prompt caching.
"""

from mlx_lm.models.cache import load_prompt_cache, make_prompt_cache, save_prompt_cache
from pathlib import Path

# MODEL_ID = "mlx-community/Qwen3-4B-Instruct-2507-4bit" 
# model, tokenizer = load(MODEL_ID)

# Make the initial prompt cache for the model
prompt_cache = make_prompt_cache(model)

# Create the cache files directory 
cache_dir = Path("cache_files")
cache_dir.mkdir(exist_ok=True)
model_name = MODEL_ID.split("/")[-1]
cache_file = cache_dir/f"{model_name}.safetensors"

# User turn
prompt = "Hi my name is Mike Dean."
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

# User turn
prompt = "What's my name?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

# User turn
prompt = "What's your name?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

# User turn
prompt = "Can you give me some advice about cooking rice?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)
# Save the prompt cache to disk to reuse it at a later time
save_prompt_cache(cache_file, prompt_cache)

# Load the prompt cache from disk
prompt_cache = load_prompt_cache(cache_file)

## Crude Memory Implementation

You will notice in the earlier code that we saved the prompt cache to disk.  We can use that to implement a crude memory mechanism.  Here is an example:

In [None]:
# User turn
prompt = "Summarize what we have discussed, but do not repeat everything."
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

## Impact of Token Limits

In the previous code, we did not explicitly set max_tokens, and the default was used (256).  If you examine the outputs from my request for advice about cooking rice, you will see that the response is cut off.  Let's try again with a larger max_tokens value.

In [None]:
# User turn
prompt = "Tell me the recipe again. Don't summarize it - I want the original version."
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    max_tokens=2048,
    prompt_cache=prompt_cache,
)

## Migrating to scripts
Jupyter notebooks are great for interactive development, but we will need to migrate to scripts when we are ready to deploy.  

The first step will be to put relevant code into a single cell and make sure it will work after I restart the kernel. Let's do that now.

In [None]:
%reset -f

In [None]:
from mlx_lm import load, generate
from mlx_lm.models.cache import load_prompt_cache, make_prompt_cache, save_prompt_cache
from pathlib import Path

MODEL_ID = "mlx-community/Qwen3-4B-Instruct-2507-4bit" 
print("Loading model... (first time may take a while)")
model, tokenizer = load(MODEL_ID)

response = generate(model, tokenizer, "What is the capital of the United States?")
print(response)

prompt = "What is the capital of the United States?"
if tokenizer.chat_template is not None:
    messages = [{"role":"user", "content":prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True,
        tokenize=False,
    )
response = generate(model, tokenizer, prompt=prompt, verbose=True)

prompt = "What is the capital of the United Moons?"
if tokenizer.chat_template is not None:
    messages = [{"role":"user", "content":prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True,
        tokenize=False,
    )
response = generate(model, tokenizer, prompt=prompt, verbose=True)

# User turn one
user_message = "Hi my name is Mike Dean."
print(f"First user message: {user_message}\n")
messages = [{"role": "user", "content": user_message}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=False,
)

print(f"First prompt response: {response}\n")

# User turn two
user_message = "What's my name?"
messages = [{"role": "user", "content": user_message}]
print(f"Second user message: {user_message}\n")
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize = False,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=False,
)
print(f"Second prompt response: {response}\n")

"""
An example of a multi-turn chat with prompt caching.
"""

from mlx_lm.models.cache import load_prompt_cache, make_prompt_cache, save_prompt_cache
from pathlib import Path

# MODEL_ID = "mlx-community/Qwen3-4B-Instruct-2507-4bit" 
# model, tokenizer = load(MODEL_ID)

# Make the initial prompt cache for the model
prompt_cache = make_prompt_cache(model)

# Create the cache files directory 
cache_dir = Path("cache_files")
cache_dir.mkdir(exist_ok=True)
model_name = MODEL_ID.split("/")[-1]
cache_file = cache_dir/f"{model_name}.safetensors"

# User turn
prompt = "Hi my name is Mike Dean."
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

# User turn
prompt = "What's my name?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

# User turn
prompt = "What's your name?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

# User turn
prompt = "Can you give me some advice about cooking rice?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)
# Save the prompt cache to disk to reuse it at a later time
save_prompt_cache(cache_file, prompt_cache)

# Load the prompt cache from disk
prompt_cache = load_prompt_cache(cache_file)

# User turn
prompt = "Summarize what we have discussed, but do not repeat everything."
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    prompt_cache=prompt_cache,
)

# User turn
prompt = "Tell me the recipe again. Don't summarize it - I want the original version."
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Assistant response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    max_tokens=2048,
    prompt_cache=prompt_cache,
)

## Final Test
I have copied all the code from the previous cell into setup.py.  Let's test it.  Open a terminal, navigate to the directory containing setup.py, and run it.  

```bash
uv run python setup.py
```

