### MLX + Mistral (Fast LoRA Fine-Tuning) Workflow

In this notebook, we demonstrate how to use Apple’s MLX framework to fine-tune the Mistral-7B model on a MacBook using LoRA (Low-Rank Adaptation).
This is a fast, resource-efficient fine-tuning approach, perfect for local experiments in Step 1 of the thesis.

#### 📌 Notebook Summary

In this notebook, we demonstrate how to perform fast, resource-efficient fine-tuning of the Mistral-7B model on a MacBook using Apple’s MLX framework with LoRA (Low-Rank Adaptation). The goal is to explore a lightweight fine-tuning workflow that produces tangible artifacts such as converted model files, LoRA adapter checkpoints (adapters.npz), and training logs. These artifacts are central to the thesis focus on deduplication-aware computation reuse, since repeated fine-tuning runs naturally generate redundant files. By setting up baseline inference, applying LoRA fine-tuning, and then testing inference again with the trained adapters, we gain hands-on understanding of how modern LLM fine-tuning works on consumer hardware. This workflow is not only useful for quickly experimenting with LLM behavior changes on a MacBook, but also provides the foundation for tracking, analyzing, and optimizing redundant computations in larger Kubernetes-based ML pipelines later in the thesis.

#### 🔑 Key Components in This Notebook
1. **MLX Framework (mlx_lm)**
* Apple’s library for running and fine-tuning large language models on MacBooks with Apple Silicon (M1/M2/M3).
* Provides efficient inference and LoRA fine-tuning using Metal GPU backend (instead of CPU).
* Why important: Lets you experiment quickly on your Mac without needing expensive NVIDIA GPUs.

2. **Mistral-7B Model**
* A modern large language model (LLM) with ~7 billion parameters.
* Pretrained on massive text corpora and instruction-tuned (Mistral-7B-Instruct-v0.2).
* Why important: Serves as the base model we fine-tune — showing how LLMs can be adapted to custom tasks or personas (e.g., “ShawGPT”).

3. **Model Conversion (convert.py)**
* Converts a Hugging Face model (PyTorch weights) into MLX format. Models like Mistral-7B are usually released on Hugging Face Hub in PyTorch format (.bin weight files, often >10–20 GB). Apple’s MLX framework doesn’t read PyTorch weights directly. So convert.py translates those weights into a format MLX understands → .npz files (NumPy arrays stored efficiently).
* By default, models store weights in 16-bit or 32-bit precision (FP16 / FP32). Quantization reduces that to 4-bit integers (int4).
    * **Example:**
        * **Original weight:** 0.123456 (FP32, ~32 bits)
        * **Quantized weight:** 0.12 (INT4, ~4 bits)
        * This shrinks the file size and speeds up computation — at the cost of a tiny bit of accuracy.
* With `-q` flag, it also quantizes the model (reduces precision to 4-bit).
* Why important: Quantization makes the model smaller and faster → critical for running Mistral on a MacBook.

4. **LoRA (Low-Rank Adaptation) Fine-Tuning (lora.py)**
* What is LoRA (Low-Rank Adaptation)?
* Big models like Mistral-7B have billions of parameters.
* Normally, fine-tuning means updating all those parameters → which needs huge GPU memory (like your lab’s H100s).
* LoRA is a clever trick:
    * Instead of changing the original weights, it inserts small trainable adapter layers into certain parts of the model (e.g., attention layers).
    * During fine-tuning, only these adapters are updated.
    * The original big model stays frozen.
* A method for parameter-efficient fine-tuning.
* Instead of updating all billions of parameters, it trains only small adapter layers (adapters.npz).
* Why important: Much faster and feasible on Mac hardware. Produces small artifacts that are perfect for your deduplication experiments.

5. **Prompt Builder**
* Wraps user input in an instruction-tuned prompt format ([INST] ... [/INST]).
* Guides the LLM to behave like “ShawGPT” (your role-playing assistant).
* Why important: Lets you control model behavior consistently before and after fine-tuning.

6. **Helper Functions**
* run_command_with_live_output: Runs shell commands (e.g., training) and streams logs live.
* construct_shell_command: Makes shell commands easy to copy-paste.
* Why important: Training takes time — these helpers keep you informed of progress without waiting until the job ends.

7. **Artifacts (Outputs of Training)**
* Converted model files (.npz) → quantized base model.
* LoRA adapter file (adapters.npz) → trained weights from fine-tuning.
* Logs & metrics → training/evaluation progress.
* Why important: These are the redundant artifacts your thesis system will later deduplicate and reuse to save computation & storage.

#### 1. Import libraries and helper functions
* `subprocess`: allows us to run shell commands (like converting or training models).
* `mlx_lm.load`: loads MLX models (optimized for Apple Silicon).
* `mlx_lm.generate`: runs inference (text generation).

In [1]:
! python -c "import mlx_lm; print('Success!')"

Success!


In [2]:
import sys
print(sys.executable)
print(sys.version)

/Users/sanjeeb/Coding/HSSL/qlora-mlx/.venv/bin/python
3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)]


In [3]:
import subprocess                      #Allow to run terminal commands
from mlx_lm import load, generate

#### 2. Run shell commands with live output
* Runs any shell command.
* Prints the output line by line instead of waiting until the end.
* Very useful for training loops that take minutes to hours.

In [4]:
def run_command_with_live_output(command: list[str]) -> None:
    """
    Courtesy of ChatGPT:
    Runs a command and prints its output line by line as it executes.

    Args:
        command (List[str]): The command and its arguments to be executed.

    Returns:
        None
    """
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

    # Print the output line by line
    while True:
        output = process.stdout.readline()
        if output == '' and process.poll() is not None:
            break
        if output:
            print(output.strip())
        
    # Print the error output, if any
    err_output = process.stderr.read()
    if err_output:
        print(err_output)

#### 3. Format shell commands for easier copy/paste
* Converts a Python list command into a clean string.
* Example: `['python', 'scripts/convert.py', '--hf-path', 'model'] → "python scripts/convert.py --hf-path model"`

In [5]:
def construct_shell_command(command: list[str]) -> str:
    
    return str(command).replace("'","").replace("[","").replace("]","").replace(",","")

#### 4. Build prompts for testing inference
* Defines a role prompt for instruction-tuned models.
* Example: If a user says “Great content, thank you!”, the prompt tells the model how to respond like “ShawGPT”.
* prompt_builder wraps user comments in this instruction format.

In [6]:
# prompt format
intstructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

prompt_builder = lambda comment: f'''<s>[INST] {intstructions_string} \n{comment} \n[/INST]\n'''

#### 5. Convert Hugging Face model → MLX format / Quantize Model (optional)
* Downloads Hugging Face Mistral model.
* Converts it to MLX format (.npz files) for Apple Silicon.
* -q quantizes the model → smaller & faster.
* Prints the runnable command so you can also run it directly in the terminal.

In [7]:
hf_model_path = "mistralai/Mistral-7B-Instruct-v0.2"

In [8]:
# define command to convert hf model to mlx format and save locally (-q flag quantizes model)
command = ['python', 'scripts/convert.py', '--hf-path', hf_model_path, '-q']

# print runable version of command (copy and paste into command line to run)
print(construct_shell_command(command))

python scripts/convert.py --hf-path mistralai/Mistral-7B-Instruct-v0.2 -q


#### 6. Load quantized MLX model & test inference / Run inference with quantized model
* Loads the 4-bit quantized Mistral model.
* Builds a test prompt with prompt_builder.
* Runs inference with generate.
* max_tokens=140 → limits response length.
* ✅ Baseline inference before fine-tuning.

In [9]:
model_path = "mlx-community/Mistral-7B-Instruct-v0.2-4bit"
prompt = prompt_builder("Great content, thank you!")
max_tokens = 140

In [10]:
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.2-4bit")
response = generate(model, tokenizer, prompt=prompt, max_tokens = max_tokens,verbose=True)

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

–ShawGPT: I'm glad you're finding the content helpful and enjoyable! If you have any specific questions or topics you'd like me to cover in more depth, feel free to ask. Otherwise, I'll keep providing clear and accessible explanations for all things data science. Thanks for tuning in!
Prompt: 121 tokens, 178.845 tokens-per-sec
Generation: 69 tokens, 36.007 tokens-per-sec
Peak memory: 4.547 GB


#### 7. Run LoRA Fine-Tuning
* Trains LoRA adapters on the Mistral model.
* `--iters 100` → number of training iterations.
* `--steps-per-eval 10` → evaluate every 10 steps.
* `--learning-rate 1e-5` → learning rate.
* `--lora-layers 16` → how many layers to apply LoRA to.
* Uses `run_command_with_live_output` so progress prints in real time.
* ✅ Fine-tuning step.

In [11]:
num_iters = "100"
steps_per_eval = "10"
val_batches = "-1" # use all
learning_rate = "1e-5" # same as default
num_layers = 16 # same as default
# no dropout or weight decay :(

In [12]:
# define command
command = ['python', 'scripts/lora.py', '--model', model_path, '--train', '--iters', num_iters, '--steps-per-eval', steps_per_eval, '--val-batches', val_batches, '--learning-rate', learning_rate, '--lora-layers', str(num_layers)]
# command = ['python', 'scripts/lora.py', '--model', model_path, '--train', '--iters', num_iters, '--steps-per-eval', steps_per_eval, '--val-batches', val_batches, '--learning-rate', learning_rate, '--lora-layers', num_layers, '--test']

# run command and print results continuously (doesn't print loss during training)
run_command_with_live_output(command)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
Total parameters 1243.189M
Trainable parameters 0.852M
Loading datasets
Training
Iter 1: Val loss 4.246, Val took 14.100s
Iter 10: Train loss 4.020, It/sec 0.119, Tokens/sec 98.488
Iter 10: Val loss 3.050, Val took 13.604s
Iter 20: Train loss 2.705, It/sec 0.097, Tokens/sec 78.026
Iter 20: Val loss 2.190, Val took 13.788s
Iter 30: Train loss 1.691, It/sec 0.100, Tokens/sec 79.308
Iter 30: Val loss 1.612, Val took 15.934s
Iter 40: Train loss 1.349, It/sec 0.106, Tokens/sec 85.719
Iter 40: Val loss 1.491, Val took 13.669s
Iter 50: Train loss 1.350, It/sec 0.078, Tokens/sec 67.519
Iter 50: Val loss 1.445, Val took 16.082s
Iter 60: Train loss 1.188, It/sec 0.100, Tokens/sec 80.591
Iter 60: Val loss 1.425, Val took 13.843s
Iter 70: Train loss 1.116, It/sec 0.105, Tokens/sec 84.141
Iter 70: Val loss 1.418, Val took 13.729s
Iter 80: Train loss 1.176, It/sec 0.083, Tokens/sec 71.427
Iter 80: Val loss 1.413, Val took 15.683s
Iter 90: Train loss 1.157, It/sec 0.077, Toke

In [13]:
# print command to run in command line directly
print(construct_shell_command(command))

python scripts/lora.py --model mlx-community/Mistral-7B-Instruct-v0.2-4bit --train --iters 100 --steps-per-eval 10 --val-batches -1 --learning-rate 1e-5 --lora-layers 16


#### 8. Use fine-tuned model (LoRA adapters)
* Loads the trained LoRA adapters (adapters.npz).
* Runs inference again on the same prompt.
* This shows how the model’s behavior changes after fine-tuning.
* ✅ Post fine-tuning inference.

In [14]:
adapter_path = "adapters.npz" # same as default
max_tokens_str = str(max_tokens)

In [15]:
# define command
command = ['python', 'scripts/lora.py', '--model', model_path, '--adapter-file', adapter_path, '--max-tokens', max_tokens_str, '--prompt', prompt]

# run command and print results continuously
run_command_with_live_output(command)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
Total parameters 1243.189M
Trainable parameters 0.852M
Loading datasets
Generating
<s>[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.

Great content, thank you!
[/INST]
Glad you like it and happy to help. -ShawGPT

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]
Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 139147.53it/s]



---

#### a harder comment

In [16]:
comment = "I discovered your channel yesterday and I am hucked, great job. It would be nice to see a video of fine tuning ShawGPT using HF, I saw a video you did running on Colab using Mistal-7b, any chance to do a video using your laptop (Mac) or using HF spaces?"
prompt = prompt_builder(comment)

In [17]:
# define command
command = ['python', 'scripts/lora.py', '--model', model_path, '--adapter-file', adapter_path, '--max-tokens', max_tokens_str, '--prompt', prompt]

# run command and print results continuously
run_command_with_live_output(command)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
Total parameters 1243.189M
Trainable parameters 0.852M
Loading datasets
Generating
<s>[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.

I discovered your channel yesterday and I am hucked, great job. It would be nice to see a video of fine tuning ShawGPT using HF, I saw a video you did running on Colab using Mistal-7b, any chance to do a video using your laptop (Mac) or using HF spaces?
[/INST]
Thanks, glad you liked the channel! I'm planning to do a video on fine-tuning LLMs on HF, but I think it would be worth an additio