# Explore large language models on Apple silicon with MLX

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"]="false"

### Demo 1: Running DeepSeek AI’s latest model with 670 billion parameters.
* Note 1: This example requires Mac Studio M3 Ultra with 512 GB of unified memory.
* Note 2: Copy paste the line below and run it in the terminal, since Jupyter Notebook output doesn't allow turn-by-turn chat iteraction

In [None]:
# Run this command in the terminal to chat with `DeepSeek-V3-0324-4bit`
#mlx_lm.chat --model mlx-community/DeepSeek-V3-0324-4bit

### Using the `mlx_lm.generate` command

Easiest way to generate text with LLMs is to use the `mlx_lm.generate` command

In [None]:
!mlx_lm.generate --model "mlx-community/Mistral-7B-Instruct-v0.3-4bit" \
                 --prompt "Write a quick sort in Swift"

You can tweak the behavior of the model by adding flags for things like sampling temperature, top-p, or max tokens; just like with any standard text generation setup.

In [None]:
!mlx_lm.generate --model "mlx-community/Mistral-7B-Instruct-v0.3-4bit" \
                 --prompt "Write a quick sort in Swift" \
                 --top-p 0.5 \
                 --temp 0.2 \
                 --max-tokens 1024

 ### Using the Python API
 Using the flexible Python API for fine-grained control and to integrate generation into a larger workflows.

In [None]:
# Using MLX LM from Python

from mlx_lm import load, generate

# Load the model and tokenizer directly from HF
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

# Prepare the prompt for the model
prompt = "Write a quick sort in Swift"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

# Generate the text
text = generate(model, tokenizer, prompt=prompt, verbose=True)

### Inspecting an mlx_lm model and exploring its architecture

In [None]:
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

In [None]:
print(model)

In [None]:
print(model.parameters())

In [None]:
print(model.layers[0].self_attn)

### Generation with KV cache

In [None]:
from mlx_lm import load, generate
from mlx_lm.models.cache import make_prompt_cache

# Load the model and tokenizer directly from HF
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

# Prepare the prompt for the model
prompt = "Write a quick sort in Swift"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

cache = make_prompt_cache(model)

# Generate the text
text = generate(model, tokenizer, prompt=prompt, prompt_cache=cache, verbose=True)

#### Let's ask a follow up question that the model can respond to using context in the KV cache

In [None]:
prompt = "how can I explain it to a five year old?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

# Use the cache to maintain context
text = generate(model, tokenizer, prompt=prompt, prompt_cache=cache, verbose=True)

### Model quantization
So far we've been using the 4-bit quantized version of the `Mistral-7b-Instruct-v0.3` model directly from the mlx-community on Hugging Face.
Now we'll see how you can quantize the model using the `mlx_lm.convert` command.
This tool takes care of downloading a model from Hugging Face, converting it to a different precision, and saving it locally — all in one step. 

In [None]:
import os
mlx_path="./mistral-7b-v0.3-4bit"
if not os.path.exists(mlx_path):
    !mlx_lm.convert --hf-path "mlx-community/Mistral-7B-Instruct-v0.3" \
                --mlx-path "./mistral-7b-v0.3-4bit" \
                --dtype float16 \
                --quantize --q-bits 4 --q-group-size 64

In [None]:
import subprocess

def get_directory_size_mb(path):
    result = subprocess.run(['du', '-sm', path], stdout=subprocess.PIPE, text=True)
    size_mb = int(result.stdout.split()[0])
    size_gb = size_mb / 1024
    return size_gb


directory_path = os.path.expanduser('~/.cache/huggingface/hub/models--mlx-community--Mistral-7B-Instruct-v0.3')
print("Size of original bfloat16 model")
print("===============================")
print(f"{get_directory_size_mb(directory_path):2.4f} GB")
print()
directory_path = './mistral-7b-v0.3-4bit'
print("Size of quantized model")
print("===============================")
print(f"{get_directory_size_mb(directory_path):2.4f} GB")

#### Apply different quantization settings to different parts of the model, all from Python

In [None]:
# Model quantization with MLX LM in Python

from mlx_lm.convert import convert

# We can choose a different quantization per layer
def mixed_quantization(layer_path, layer, model_config):
    if "lm_head" in layer_path or "embed_tokens" in layer_path:
        return {"bits": 6, "group_size": 64}
    elif hasattr(layer, "to_quantized"):
        return {"bits": 4, "group_size": 64}
    else:
        return False

# Convert can be used to change precision, quantize and upload models to HF
mlx_path="./mistral-7b-v0.3-mixed-4-6-bit"
if not os.path.exists(mlx_path):
    convert(
        hf_path="mistralai/Mistral-7B-Instruct-v0.3",
        mlx_path="./mistral-7b-v0.3-mixed-4-6-bit",
        quantize=True,
        quant_predicate=mixed_quantization
    )

print()
print("Size of mixed 4-6-bit quantized model")
print("============================")
print(f"{get_directory_size_mb(mlx_path):2.4f} GB")

### Model fine-tuning
Let's use the mistral-7b-v0.3-4bit model we just quantized who won the latest Super Bowl. As expected, the answer is correct but outdated.

In [None]:
!mlx_lm.generate --model "./mistral-7b-v0.3-4bit" \
    --prompt "Who played in the latest super bowl?"

Let's train on a small dataset with questions and answers about the latest Super Bowl, we can update the model’s knowledge and have it answer accurately.

In [None]:
# !mlx_lm.lora --model "./mistral-7b-v0.3-4bit" --train --data ./data --iters 300 --batch-size 8 --mask-prompt --learning-rate 1e-5

if os.path.exists("./adapters"):
    print("Size of adapters")
    print("================")
    print(f"{get_directory_size_mb("./adapters")*1024:2.2f} MB")

We can now ask the model the same question and it will provide us with an answer using new knowledge from the adapter.

In [None]:
!mlx_lm.generate --model "./mistral-7b-v0.3-4bit" \
                 --prompt "Who played in the latest super bowl?" \
                 --adapter "adapters"

After the training is complete you can fuse the adapter into the model using the `mlx_lm.fuse` command.

In [None]:
!mlx_lm.fuse --model "./mistral-7b-v0.3-4bit" \
            --adapter-path "adapters" \
            --save-path "fused-mistral-7b-v0.3-4bit" \

And we can test the fused model again for knowledge it has learned from the fine-tuning process

In [None]:
!mlx_lm.generate --model "./fused-mistral-7b-v0.3-4bit" \
                 --prompt "Who played in the latest super bowl?" \
                 --temp 0.6