# Explore large language models on Apple silicon with MLX

In [1]:
import os
os.environ["TOKENIZERS_PARALLELISM"]="false"

### Demo 1: Running DeepSeek AI’s latest model with 670 billion parameters.
* Note 1: This example requires Mac Studio M3 Ultra with 512 GB of unified memory.
* Note 2: Copy paste the line below and run it in the terminal, since Jupyter Notebook output doesn't allow turn-by-turn chat iteraction

In [2]:
# Run this command in the terminal to chat with `DeepSeek-V3-0324-4bit`
#mlx_lm.chat --model mlx-community/DeepSeek-V3-0324-4bit

### Using the `mlx_lm.generate` command

Easiest way to generate text with LLMs is to use the `mlx_lm.generate` command

In [3]:
!mlx_lm.generate --model "mlx-community/Mistral-7B-Instruct-v0.3-4bit" \
                 --prompt "Write a quick sort in Swift"

Fetching 7 files: 100%|███████████████████████| 7/7 [00:00<00:00, 120328.39it/s]
Here's a simple implementation of the QuickSort algorithm in Swift. This version uses Swift's built-in `swapAt()` function to swap elements in an array.

```swift
func quickSort(_ array: inout [Int], _ low: Int, _ high: Int) {
    if low < high {
        let pivotIndex = partition(array, low, high)
        quickSort(&array, low, pivot
Prompt: 12 tokens, 78.111 tokens-per-sec
Generation: 100 tokens, 32.263 tokens-per-sec
Peak memory: 4.138 GB


You can tweak the behavior of the model by adding flags for things like sampling temperature, top-p, or max tokens; just like with any standard text generation setup.

In [4]:
!mlx_lm.generate --model "mlx-community/Mistral-7B-Instruct-v0.3-4bit" \
                 --prompt "Write a quick sort in Swift" \
                 --top-p 0.5 \
                 --temp 0.2 \
                 --max-tokens 1024

Fetching 7 files: 100%|███████████████████████| 7/7 [00:00<00:00, 100205.22it/s]
Here's a simple implementation of the QuickSort algorithm in Swift. This version uses Swift's built-in `swapAt()` function to swap elements in an array.

```swift
func quickSort(_ array: inout [Int], _ low: Int, _ high: Int) {
    if low < high {
        let pivotIndex = partition(array, low, high)
        quickSort(&array, low, pivotIndex - 1)
        quickSort(&array, pivotIndex + 1, high)
    }
}

func partition(_ array: inout [Int], _ low: Int, _ high: Int) -> Int {
    let pivot = array[high]
    var i = low
    for j in low..<high {
        if array[j] < pivot {
            swapAt(&array, i, j)
            i += 1
        }
    }
    swapAt(&array, i, high)
    return i
}

func swapAt(_ array: inout [Int], _ i: Int, _ j: Int) {
    let temp = array[i]
    array[i] = array[j]
    array[j] = temp
}

// Example usage:
var arr = [3,6,8,5,2,1,9,7,4]
quickSort(&arr, 0, arr.count - 1)
print(arr) // Output: [

 ### Using the Python API
 Using the flexible Python API for fine-grained control and to integrate generation into a larger workflows.

In [5]:
# Using MLX LM from Python

from mlx_lm import load, generate

# Load the model and tokenizer directly from HF
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

# Prepare the prompt for the model
prompt = "Write a quick sort in Swift"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

# Generate the text
text = generate(model, tokenizer, prompt=prompt, verbose=True)

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

Here's a simple implementation of the QuickSort algorithm in Swift. This version uses Swift's built-in `swapAt()` function to swap elements in an array.

```swift
func quickSort(_ array: inout [Int], _ low: Int, _ high: Int) {
    if low < high {
        let pivotIndex = partition(array, low, high)
        quickSort(&array, low, pivotIndex - 1)
        quickSort(&array, pivotIndex + 1, high)
    }
}

func partition(_ array: inout [Int], _ low: Int, _ high: Int) -> Int {
    let pivot = array[high]
    var i = low
    for j in low..<high {
        if array[j] < pivot {
            swapAt(&array, i, j)
            i += 1
        }
    }
    swapAt(&array, i, high)
    return i
}

func swapAt(_ array: inout [Int], _ i: Int, _ j: Int) {
    let temp
Prompt: 12 tokens, 78.600 tokens-per-sec
Generation: 256 tokens, 31.893 tokens-per-sec
Peak memory: 4.184 GB


### Inspecting an mlx_lm model and exploring its architecture

In [6]:
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
print(model)

In [None]:
print(model.parameters())

In [9]:
print(model.layers[0].self_attn)

Attention(
  (q_proj): QuantizedLinear(input_dims=4096, output_dims=4096, bias=False, group_size=64, bits=4)
  (k_proj): QuantizedLinear(input_dims=4096, output_dims=1024, bias=False, group_size=64, bits=4)
  (v_proj): QuantizedLinear(input_dims=4096, output_dims=1024, bias=False, group_size=64, bits=4)
  (o_proj): QuantizedLinear(input_dims=4096, output_dims=4096, bias=False, group_size=64, bits=4)
  (rope): RoPE(128, traditional=False)
)


### Generation with KV cache

In [10]:
from mlx_lm import load, generate
from mlx_lm.models.cache import make_prompt_cache

# Load the model and tokenizer directly from HF
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

# Prepare the prompt for the model
prompt = "Write a quick sort in Swift"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

cache = make_prompt_cache(model)

# Generate the text
text = generate(model, tokenizer, prompt=prompt, prompt_cache=cache, verbose=True)

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

Here's a simple implementation of the QuickSort algorithm in Swift. This version uses Swift's built-in `swapAt()` function to swap elements in an array.

```swift
func quickSort(_ array: inout [Int], _ low: Int, _ high: Int) {
    if low < high {
        let pivotIndex = partition(array, low, high)
        quickSort(&array, low, pivotIndex - 1)
        quickSort(&array, pivotIndex + 1, high)
    }
}

func partition(_ array: inout [Int], _ low: Int, _ high: Int) -> Int {
    let pivot = array[high]
    var i = low
    for j in low..<high {
        if array[j] < pivot {
            swapAt(&array, i, j)
            i += 1
        }
    }
    swapAt(&array, i, high)
    return i
}

func swapAt(_ array: inout [Int], _ i: Int, _ j: Int) {
    let temp
Prompt: 12 tokens, 76.085 tokens-per-sec
Generation: 256 tokens, 31.792 tokens-per-sec
Peak memory: 8.155 GB


#### Let's ask a follow up question that the model can respond to using context in the KV cache

In [11]:
prompt = "how can I explain it to a five year old?"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

# Use the cache to maintain context
text = generate(model, tokenizer, prompt=prompt, prompt_cache=cache, verbose=True)

Imagine you have a big box full of toys. You want to sort them so that all the red toys are together, all the blue toys are together, and all the green toys are together.

1. First, you pick one toy (this is your pivot toy).
2. Then, you look at all the other toys one by one. If a toy is not red, you move it to the left if it's not red, and if it's blue, you move it to the right. You keep doing this until you have looked at all the toys.
3. Now, you have a group of toys on the left that are red or blue, and a group of toys on the right that are green or blue. You swap the pivot toy with one of the toys in the group on the left or right, depending on whether you want red toys on the left or right.
4. Now, you repeat the same process with the group of toys on the left and the group of toys on the right, until all the toys are sorted!

This is a quick way to sort a big box of toys, and it's called QuickSort!
Prompt: 16 tokens, 116.542 tokens-per-sec
Generation: 245 tokens, 29.632 tokens-p

### Model quantization
So far we've been using the 4-bit quantized version of the `Mistral-7b-Instruct-v0.3` model directly from the mlx-community on Hugging Face.
Now we'll see how you can quantize the model using the `mlx_lm.convert` command.
This tool takes care of downloading a model from Hugging Face, converting it to a different precision, and saving it locally — all in one step. 

In [12]:
import os
mlx_path="./mistral-7b-v0.3-4bit"
if not os.path.exists(mlx_path):
    !mlx_lm.convert --hf-path "mlx-community/Mistral-7B-Instruct-v0.3" \
                --mlx-path "./mistral-7b-v0.3-4bit" \
                --dtype float16 \
                --quantize --q-bits 4 --q-group-size 64

[INFO] Loading
Fetching 9 files: 100%|███████████████████████| 9/9 [00:00<00:00, 161319.38it/s]
[INFO] Using dtype: float16
[INFO] Quantizing
[INFO] Quantized model with 4.500 bits per weight.


In [13]:
import subprocess

def get_directory_size_mb(path):
    result = subprocess.run(['du', '-sm', path], stdout=subprocess.PIPE, text=True)
    size_mb = int(result.stdout.split()[0])
    size_gb = size_mb / 1024
    return size_gb

directory_path = './mistral-7b-v0.3-4bit'
print("Size of original bfloat16 model")
print("===============================")
print(f"{get_directory_size_mb(directory_path):2.4f} GB")
print()
directory_path = os.path.expanduser('~/.cache/huggingface/hub/models--mlx-community--Mistral-7B-Instruct-v0.3')
print("Size of quantized model")
print("===============================")
print(f"{get_directory_size_mb(directory_path):2.4f} GB")

Size of original bfloat16 model
3.8174 GB

Size of quantized model
13.5049 GB


#### Apply different quantization settings to different parts of the model, all from Python

In [14]:
# Model quantization with MLX LM in Python

from mlx_lm.convert import convert

# We can choose a different quantization per layer
def mixed_quantization(layer_path, layer, model_config):
    if "lm_head" in layer_path or "embed_tokens" in layer_path:
        return {"bits": 6, "group_size": 64}
    elif hasattr(layer, "to_quantized"):
        return {"bits": 4, "group_size": 64}
    else:
        return False

# Convert can be used to change precision, quantize and upload models to HF
mlx_path="./mistral-7b-v0.3-mixed-4-6-bit"
if not os.path.exists(mlx_path):
    convert(
        hf_path="mistralai/Mistral-7B-Instruct-v0.3",
        mlx_path="./mistral-7b-v0.3-mixed-4-6-bit",
        quantize=True,
        quant_predicate=mixed_quantization
    )

print()
print("Size of mixed 4-6-bit quantized model")
print("============================")
print(f"{get_directory_size_mb(mlx_path):2.4f} GB")

[INFO] Loading


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

consolidated.safetensors:  42%|####1     | 10.5G/25.0G [00:00<?, ?B/s]

[INFO] Using dtype: bfloat16
[INFO] Quantizing
[INFO] Quantized model with 4.574 bits per weight.


README.md:   0%|          | 0.00/7.82k [00:00<?, ?B/s]


Size of mixed 4-6-bit quantized model
3.8799 GB


### Model fine-tuning
Let's use the mistral-7b-v0.3-4bit model we just quantized who won the latest Super Bowl. As expected, the answer is correct but outdated.

In [15]:
!mlx_lm.generate --model "./mistral-7b-v0.3-4bit" \
    --prompt "Who played in the latest super bowl?"

The latest Super Bowl, Super Bowl LV (55), was played on February 7, 2021, between the Kansas City Chiefs and the Tampa Bay Buccaneers. The Tampa Bay Buccaneers, led by quarterback Tom Brady, won the game, making it his seventh Super Bowl victory. This made Tom Brady the most successful quarterback in Super Bowl history.
Prompt: 11 tokens, 8.131 tokens-per-sec
Generation: 87 tokens, 31.385 tokens-per-sec
Peak memory: 4.137 GB


Let's train on a small dataset with questions and answers about the latest Super Bowl, we can update the model’s knowledge and have it answer accurately.

In [16]:
# !mlx_lm.lora --model "./mistral-7b-v0.3-4bit" --train --data ./data --iters 300 --batch-size 8 --mask-prompt --learning-rate 1e-5

if os.path.exists("./adapters"):
    print("Size of adapters")
    print("================")
    print(f"{get_directory_size_mb("./adapters")*1024:2.2f} MB")

We can now ask the model the same question and it will provide us with an answer using new knowledge from the adapter.

In [20]:
!mlx_lm.generate --model "./mistral-7b-v0.3-4bit" \
                 --prompt "Who played in the latest super bowl?" \
                 --adapter "adapters"

In the latest Super Bowl, the Philadelphia Eagles soared to victory, claiming their championship title with a resounding 40-22 win over the Kansas City Chiefs. The Eagles' triumphant flight was led by their fearless leader, Jalen Hurts, who not only secured his place in the annals of Super Bowl history but also etched his name into the hearts of Eagles fans everywhere. This wasn't just any Super Bowl; it was Super Bowl
Prompt: 11 tokens, 28.533 tokens-per-sec
Generation: 100 tokens, 30.986 tokens-per-sec
Peak memory: 4.151 GB


After the training is complete you can fuse the adapter into the model using the `mlx_lm.fuse` command.

In [21]:
!mlx_lm.fuse --model "./mistral-7b-v0.3-4bit" \
            --adapter-path "adapters" \
            --save-path "fused-mistral-7b-v0.3-4bit" \

Loading pretrained model


And we can test the fused model again for knowledge it has learned from the fine-tuning process

In [22]:
!mlx_lm.generate --model "./fused-mistral-7b-v0.3-4bit" \
                 --prompt "Who played in the latest super bowl?" \
                 --temp 0.6

The latest Super Bowl, Super Bowl LIX, was played between the Philadelphia Eagles and the Kansas City Chiefs. The Philadelphia Eagles emerged victorious, with Jalen Hurts leading the charge for the Eagles.
Prompt: 11 tokens, 11.760 tokens-per-sec
Generation: 46 tokens, 32.194 tokens-per-sec
Peak memory: 4.137 GB
