
We will start with code that we developed in the setup.ipynb notebook and refactor it into a more modular structure. Let's take a look at the code that we will be refactoring.

```python
from mlx_lm import load, generate
from mlx_lm.models.cache import load_prompt_cache, make_prompt_cache, save_prompt_cache
from pathlib import Path

MODEL_ID = "mlx-community/Qwen3-4B-Instruct-2507-4bit" 
print("Loading model... (first time may take a while)")
model, tokenizer = load(MODEL_ID)
```


In [None]:
from mlx_lm import load, generate
from mlx_lm.models.cache import load_prompt_cache, make_prompt_cache, save_prompt_cache
from pathlib import Path

MODEL_ID = "mlx-community/Qwen3-4B-Instruct-2507-4bit" 
print("Loading model... (first time may take a while)")
model, tokenizer = load(MODEL_ID)

``` python
response = generate(model, tokenizer, "What is the capital of the United States?")
print(response)

prompt = "What is the capital of the United States?"
if tokenizer.chat_template is not None:
    messages = [{"role":"user", "content":prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True,
        tokenize=False,
    )
response = generate(model, tokenizer, prompt=prompt, verbose=True)

prompt = "What is the capital of the United Moons?"
if tokenizer.chat_template is not None:
    messages = [{"role":"user", "content":prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True,
        tokenize=False,
    )
response = generate(model, tokenizer, prompt=prompt, verbose=True)
```

### Repetitive Code

In the above code, we have a lot of repetitive code that we will want to refactor into functions.  Let's start by creating a function that will generate a response to a user message. We will include some print statements that will help us understand the output.

In [None]:
def generate_response(model, tokenizer,user_message, prompt_cache=None, **kwargs):
    
    if tokenizer.chat_template is not None:
        messages = [{"role":"user", "content":user_message}]
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
        )
    else:
        prompt = user_message
    print(f"User message: {user_message}\n")
    print(f"{generate(model, tokenizer, prompt=prompt, verbose=True, prompt_cache=prompt_cache)}\n")

In [None]:

generate_response(model,tokenizer, "What is the capital of the United States?")
generate_response(model,tokenizer, "What is the capital of the United Moons?")
generate_response(model,tokenizer, "Hi my name is Mike Dean.")
generate_response(model,tokenizer, "What is my name?")



## Adding prompt caching
We wrote the generate_response function to take an optional prompt_cache argument.  Let's use that to add prompt caching to our code.  All have to do is pass the prompt_cache object to the generate_response function.

In [None]:
from mlx_lm.models.cache import load_prompt_cache, make_prompt_cache, save_prompt_cache
from pathlib import Path

# Make the initial prompt cache for the model
prompt_cache = make_prompt_cache(model)

# Create the cache files directory 
cache_dir = Path("cache_files")
cache_dir.mkdir(exist_ok=True)
model_name = MODEL_ID.split("/")[-1]
cache_file = cache_dir/f"{model_name}.safetensors"

generate_response(model,tokenizer, "Hi my name is Mike Dean.", prompt_cache)
generate_response(model,tokenizer, "What's my name?", prompt_cache)
generate_response(model,tokenizer, "What's your name?", prompt_cache)
generate_response(model,tokenizer, "Can you give me some advice about cooking rice?", prompt_cache)

Now let's save the prompt cache, and then reset it and make sure it's empty.  The model should not know my name.  Then I load the saved prompt cache and ask the same question.  The model should now know my name.

In [None]:
# Save the prompt cache to disk to reuse it at a later time
save_prompt_cache(cache_file, prompt_cache)

# Make a new prompt cache but don't save it or it will overwrite what we just saved
prompt_cache = make_prompt_cache(model)

generate_response(model,tokenizer, "What's my name?", prompt_cache)

prompt_cache = load_prompt_cache(cache_file)

generate_response(model,tokenizer, "What's my name?", prompt_cache)

In [None]:
generate_response(model, tokenizer, "Summarize what we have discussed, but do not repeat everything.", prompt_cache)
generate_response(model, tokenizer, "Tell me the recipe again. Don't summarize it - I want the original version.", prompt_cache, max_tokens=2048)


## Migration Cell containing the refactored code in one cell

In [None]:
from mlx_lm import load, generate
from mlx_lm.models.cache import load_prompt_cache, make_prompt_cache, save_prompt_cache
from pathlib import Path

MODEL_ID = "mlx-community/Qwen3-4B-Instruct-2507-4bit" 
print("Loading model... (first time may take a while)")
model, tokenizer = load(MODEL_ID)

def generate_response(model, tokenizer,user_message, prompt_cache=None, **kwargs):
    
    if tokenizer.chat_template is not None:
        messages = [{"role":"user", "content":user_message}]
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
        )
    else:
        prompt = user_message
    print(f"User message: {user_message}\n")
    print(f"{generate(model, tokenizer, prompt=prompt, verbose=True, prompt_cache=prompt_cache)}\n")

generate_response(model,tokenizer, "What is the capital of the United States?")
generate_response(model,tokenizer, "What is the capital of the United Moons?")
generate_response(model,tokenizer, "Hi my name is Mike Dean.")
generate_response(model,tokenizer, "What is my name?")

# Create prompt_cache
prompt_cache = make_prompt_cache(model)

# Create the cache files directory 
cache_dir = Path("cache_files")
cache_dir.mkdir(exist_ok=True)
model_name = MODEL_ID.split("/")[-1]
cache_file = cache_dir/f"{model_name}.safetensors"

generate_response(model,tokenizer, "Hi my name is Mike Dean.", prompt_cache)
generate_response(model,tokenizer, "What's my name?", prompt_cache)
generate_response(model,tokenizer, "What's your name?", prompt_cache)
generate_response(model,tokenizer, "Can you give me some advice about cooking rice?", prompt_cache)

# Save the prompt cache to disk to reuse it at a later time
save_prompt_cache(cache_file, prompt_cache)

# Make a new prompt cache but don't save it or it will overwrite what we just saved
prompt_cache = make_prompt_cache(model)

generate_response(model,tokenizer, "What's my name?", prompt_cache)

prompt_cache = load_prompt_cache(cache_file)

generate_response(model,tokenizer, "What's my name?", prompt_cache)

generate_response(model, tokenizer, "Summarize what we have discussed, but do not repeat everything.", prompt_cache)
generate_response(model, tokenizer, "Tell me the recipe again. Don't summarize it - I want the original version.", prompt_cache, max_tokens=2048)


# Modular Design
In the code above, we defined the generate_response function in the same cell as the other code.  Let's move it to a separate file so that we can use it in other notebooks, or more importantly, in a web application.

We will create a utils.py file inside of a utilities subfolder, so that we can later add other utility functions to it.  That subfolder must have a __init__.py file to make it a Python package.

In [None]:


from mlx_lm.models.cache import load_prompt_cache, save_prompt_cache
from utilities import get_model, create_cache, generate_response, generate_response_with_system, list_available_models

model, tokenizer, MODEL_ID = get_model("qwen")

list_available_models()

generate_response(model,tokenizer, "What is the capital of the United States?", MODEL_ID)
generate_response(model,tokenizer, "What is the capital of the United Moons?")
generate_response(model,tokenizer, "Hi my name is Mike Dean.")
generate_response(model,tokenizer, "What is my name?")

prompt_cache, cache_file = create_cache(model, MODEL_ID)
print(cache_file)
generate_response(model,tokenizer, "Hi my name is Mike Dean.", prompt_cache)
generate_response(model,tokenizer, "What's my name?", prompt_cache)
generate_response(model,tokenizer, "What's your name?", prompt_cache)
generate_response(model,tokenizer, "Can you give me some advice about cooking rice?", prompt_cache)

# Save the prompt cache to disk to reuse it at a later time
save_prompt_cache(cache_file, prompt_cache)

# Make a new prompt cache but don't save it or it will overwrite what we just saved
prompt_cache, _ = create_cache(model, MODEL_ID)
print(cache_file)
generate_response(model,tokenizer, "What's my name?", prompt_cache)

prompt_cache = load_prompt_cache(cache_file)

generate_response(model,tokenizer, "What's my name?", prompt_cache)

generate_response(model, tokenizer, "Summarize what we have discussed, but do not repeat everything.", prompt_cache)
generate_response(model, tokenizer, "Tell me the recipe again. Don't summarize it - I want the original version.", prompt_cache, max_tokens=2048)


# Summary
We have refactored a file with 194 lines of code (setup.py) to a file with 33 lines of code, and this file then calls the refactored code in the utilities folder.  This is a more modular design that allows us to reuse the code in other projects.