# Fine-tuning the latest Google Gemma model locally using MLX

In this notebook, we will be running and fine-tuning the latest [Google Gemma model](https://blog.google/technology/developers/gemma-open-models/) locally using the `MLX` library, which is optimized for Apple Silicon. I hope that sharing the process and the challenges I encountered will be helpful.


source: https://gist.github.com/alexweberk/635431b5c5773efd6d1755801020429f

memory consumption - about 8 Gb
20 steps 4 minutes

## Preparation

Install the necessary libraries.
Also, a Mac with Apple Silicon is required. In this case, I used a MacBook Pro equipped with M3 Max 128GB.


In [2]:
!pip install -Uqq mlx mlx_lm transformers datasets

42.99s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


In [9]:
from huggingface_hub import notebook_login
notebook_login()

ImportError: The `notebook_login` function can only be used in a notebook (Jupyter or Colab) and you need the `ipywidgets` module: `pip install ipywidgets`.

## Using MLX to Run Inference with Gemma Model using MLX

There are about 4 versions of the released Gemma, but this time we will use the instruction-tuned `gemma-7b-it`.

We will use the `mlx_lm` library that utilizes an mlx backend.


In [None]:
model_name="google/gemma-3-1b-it"

In [None]:
from mlx_lm import generate, load

model, tokenizer = load(model_name)

Fetching 13 files:   8%|▊         | 1/13 [00:05<01:11,  5.95s/it]

Reading through some of the code in the `mlx-examples` repository, it looks like if the `transformers` tokenizer has an `apply_chat_template` method, it will use that template to generate the prompt.

Therefore, when generating, we will input a prompt that only includes the question itself.

https://github.com/ml-explore/mlx-examples/blob/47dd6bd17f3cc7ef95672ea16e443e58ce5eb1bf/llms/mlx_lm/generate.py#L98

This is what the tokenizer will do internally in the `generate()` method:


In [15]:
messages = [{"role": "user", "content": "Why is the sky blue?"}]
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

'<bos><start_of_turn>user\nWhy is the sky blue?<end_of_turn>\n<start_of_turn>model\n'

In [17]:
# Generating without adding a prompt template manually
prompt = """
Why is the sky blue?
""".strip()
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,  # Set to True to see the prompt and response
    max_tokens=256,
)



The sky appears blue because of a phenomenon called Rayleigh scattering. Here's a breakdown:

* **Sunlight is made of all colors:** White sunlight is actually a mixture of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, violet).
* **Air molecules scatter light:** As sunlight enters the Earth's atmosphere, it collides with tiny air molecules (mostly nitrogen and oxygen).
* **Blue light is scattered more:**  Rayleigh scattering is the scattering of electromagnetic radiation (like light) by particles of a much smaller wavelength. Blue and violet light have shorter wavelengths than other colors, so they are scattered much more strongly by these air molecules.
* **We see the scattered blue light:** Because blue light is scattered all over the sky, it reaches our eyes from all directions, making the sky appear blue.

**Why isn't the sky violet?**

Violet light is scattered even more than blue light, but the sun emits less violet light, and our eyes are less sensiti

Success! We were able to generate a response from the Gemma model using MLX.

Now that we've seen how to generate a response, let's try fine-tuning the model to some custom data.


## Fine-tuning the Gemma model with LoRA using MLX

We'll be fine-tuning on a cool dataset from teknium to see what we can produce. Since this is just an example, we'll only fine-tune it for 600 iterations.


### Preparing the dataset

We'll format the dataset to follow the format shown in the `mlx-examples` repository. Basically, we need to prepare a train.jsonl and valid.jsonl. Each line should have a `text` key with the string to train as the value. Here is an example:

`{"text": "Q: What is the capital of France?\nA: Paris is the capital of France."}`

However, the value should follow the format of the prompt Gemma was trained on, which means we need to transform it to something like this:

`{"text": "<bos><start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\nParis is the capital of France.<end_of_turn><eos>"}`

Let's first load the dataset and see what it looks like. We will be using legendary [@teknium](https://twitter.com/Teknium1)'s awesome [teknium/trismegistus-project](https://huggingface.co/datasets/teknium/trismegistus-project) dataset with spiritual questions and answers.


In [18]:
from datasets import load_dataset

dataset = load_dataset("teknium/trismegistus-project")
dataset

Generating train split: 100%|██████████| 13528/13528 [00:00<00:00, 37104.77 examples/s]


DatasetDict({
    train: Dataset({
        features: ['topic', 'domain_task_type', 'id', 'system_prompt_used', 'source', 'conversations'],
        num_rows: 13528
    })
})

Since the dataset is small enough, I will just use pandas to format it.


In [19]:
# convert the dataset to pandas dataframe
import pandas as pd

df = pd.DataFrame(dataset["train"])
df.head()

Unnamed: 0,topic,domain_task_type,id,system_prompt_used,source,conversations
0,'Big Man' society,Task,570a8404-3270-4aba-a47c-660359440835,"You are a master of the esoteric, occult, 'Big...",DomainExpert_Occult,"[{'from': 'human', 'value': 'Compose a compreh..."
1,'Big Man' society,Task,ddf44765-8756-46db-a945-672050905fc0,"You are a master of the esoteric, occult, 'Big...",DomainExpert_Occult,"[{'from': 'human', 'value': 'Develop an intric..."
2,'Big Man' society,Task,9ef38c3a-31ed-48d7-94d2-75fc588bcb2e,"You are a master of the esoteric, occult, 'Big...",DomainExpert_Occult,"[{'from': 'human', 'value': 'Write an extensiv..."
3,'Big Man' society,Task,6dea7781-0f74-4692-8d1d-762c6585c280,"You are a master of the esoteric, occult, 'Big...",DomainExpert_Occult,"[{'from': 'human', 'value': 'Develop an intric..."
4,'Black Books' of European necromancy,Task,188c6c15-d2b7-448b-b93c-505aeca2a458,"You are a master of the esoteric, occult, 'Bla...",DomainExpert_Occult,"[{'from': 'human', 'value': 'Devise an intrica..."


In [20]:
print(df.iloc[0]["system_prompt_used"], "\n", df.iloc[0]["conversations"])

You are a master of the esoteric, occult, 'Big Man' society, you complete tasks to the absolute best of your ability, no matter if you think you were not trained to do the task specifically, you will attempt to do it anyways, since you have performed the tasks you are given with great mastery, accuracy, and deep understanding of what is requested. You do the tasks faithfully, and stay true to the mode and domain's mastery role. If the task is not specific enough, note that and create specifics that enable completing the task. 
 [{'from': 'human', 'value': "Compose a comprehensive biography of a renowned figure in the 'Big Man' society, delving into their connections to the occult and esoteric realms. Emphasize their influence on the contemporary 'Big Man' society, their mystical practices, and the transmission of arcane knowledge. Ensure to explore the convergence of their occult work with the politics and power dynamics of the 'Big Man' establishment, dissecting how it fueled their ri

We see that the `conversations` holds the text for Q and A. We will format this in Gemma's prompt format and save it to a jsonl file.


In [21]:
# Split the quetion and answer into separate columns
df[["question", "answer"]] = pd.DataFrame(df["conversations"].tolist(), index=df.index)

# Only keep the 'value' portion of the JSON
df["question"] = df["question"].apply(lambda x: x["value"])
df["answer"] = df["answer"].apply(lambda x: x["value"])

df[["system_prompt_used", "question", "answer"]]

Unnamed: 0,system_prompt_used,question,answer
0,"You are a master of the esoteric, occult, 'Big...",Compose a comprehensive biography of a renowne...,Title: The Mystifying Odyssey of Eliphas Black...
1,"You are a master of the esoteric, occult, 'Big...",Develop an intricate numerology system that de...,I. Foundational Numerology\n\nThe 'Big Man' so...
2,"You are a master of the esoteric, occult, 'Big...",Write an extensive biography of a prominent oc...,Title: Nathaniel Ziester: A Life in Shadows - ...
3,"You are a master of the esoteric, occult, 'Big...","Develop an intricate system of numerology, inc...",Title: The Numerological Riddles of the Big Ma...
4,"You are a master of the esoteric, occult, 'Bla...",Devise an intricate multi-step process for the...,Step 1: Assess the condition of the grimoire\n...
...,...,...,...
13523,"You are a master of the esoteric, occult, Reap...",In the context of the Reappropriated Goddess m...,Answer: To regain women's empowerment and infl...
13524,"You are a master of the esoteric, occult, Reap...",Write a section of a grimoire explaining the c...,Title: The Reappropriated Goddess in the Occul...
13525,"You are a master of the esoteric, occult, Reap...",Write a section of a grimoire specifically foc...,Title: The Reappropriated Goddess: A Journey i...
13526,"You are a master of the esoteric, occult, Reap...",Create a detailed introductory section for a g...,Title: The Reappropriated Goddess: A Grimoire ...


Since Gemma doesn't seem to have been trained with a separate system prompt, let's create a separate format for separating the system prompt and the user prompt, as below.


In [22]:
def generate_prompt(row: pd.Series) -> str:
    "Format to Gemma's chat template"
    return """<bos><start_of_turn>user
## Instructions
{}
## User
{}<end_of_turn>
<start_of_turn>model
{}<end_of_turn><eos>""".format(row["system_prompt_used"], row["question"], row["answer"])


df["text"] = df.apply(generate_prompt, axis=1)

# Let's see what the model will be trained on
print(df["text"].iloc[0])

<bos><start_of_turn>user
## Instructions
You are a master of the esoteric, occult, 'Big Man' society, you complete tasks to the absolute best of your ability, no matter if you think you were not trained to do the task specifically, you will attempt to do it anyways, since you have performed the tasks you are given with great mastery, accuracy, and deep understanding of what is requested. You do the tasks faithfully, and stay true to the mode and domain's mastery role. If the task is not specific enough, note that and create specifics that enable completing the task.
## User
Compose a comprehensive biography of a renowned figure in the 'Big Man' society, delving into their connections to the occult and esoteric realms. Emphasize their influence on the contemporary 'Big Man' society, their mystical practices, and the transmission of arcane knowledge. Ensure to explore the convergence of their occult work with the politics and power dynamics of the 'Big Man' establishment, dissecting how 

Let's save the data in two separate jsonl formatted files.

- Train set: `data/train.jsonl`
- Valid set: `data/valid.jsonl`


In [23]:
from pathlib import Path

Path("data").mkdir(exist_ok=True)

split_ix = int(len(df) * 0.9)
# shuffle data
data = df.sample(frac=1, random_state=42)
train, valid = data[:split_ix], data[split_ix:]

# Save train and valid dataset as jsonl files
train[["text"]].to_json("data/train.jsonl", orient="records", lines=True, force_ascii=False)
valid[["text"]].to_json("data/valid.jsonl", orient="records", lines=True, force_ascii=False)

!head -n 1 data/train.jsonl

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
1337.66s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


{"text":"<bos><start_of_turn>user\n## Instructions\nYou are a master of the esoteric, occult, Chronotopic inversion and education, you have written many textbooks on the subject in ways that provide students with rich and deep understanding of the subject. You are being asked to write textbook-like sections on a topic and you do it with full context, explainability, and reliability in accuracy to the true facts of the topic at hand, in a textbook style that a student would easily be able to learn from, in a rich, engaging, and contextual way. Always include relevant context (such as formulas and history), related concepts, and in a way that someone can gain deep insights from.\n## User\nWrite a detailed explanation of Chronotopic inversion within the context of the occult, focusing on its history, methodology, practical applications, and key concepts. Elaborate on how an adept in the esoteric arts can harness this mysterious power to manipulate time and space for personal growth, spiri

### Running LoRA Fine-tuning with MLX

Now that are data is ready, let's start fine-tuning!

When running LoRA with `mlx_lm`, you can use the following command to see various options.


In [24]:
!python -m mlx_lm.lora --help

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
1351.90s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


usage: lora.py [-h] [--model MODEL] [--train] [--data DATA]
               [--fine-tune-type {lora,dora,full}] [--optimizer {adam,adamw}]
               [--mask-prompt] [--num-layers NUM_LAYERS]
               [--batch-size BATCH_SIZE] [--iters ITERS]
               [--val-batches VAL_BATCHES] [--learning-rate LEARNING_RATE]
               [--steps-per-report STEPS_PER_REPORT]
               [--steps-per-eval STEPS_PER_EVAL]
               [--resume-adapter-file RESUME_ADAPTER_FILE]
               [--adapter-path ADAPTER_PATH] [--save-every SAVE_EVERY]
               [--test] [--test-batches TEST_BATCHES]
               [--max-seq-length MAX_SEQ_LENGTH] [-c CONFIG]
               [--grad-checkpoint] [--seed SEED]

LoRA or QLoRA finetuning.

options:
  -h, --help            show this help message and exit
  --model MODEL         The path to the local model directory or Hugging Face
                        repo.
  --train               Do training
  --data DATA           Directory with {

And here's how we run the training. This will take a while to finish. Let's run the training.


I have added some things like batch-size 1, for not getting out of memory

In [None]:
!TOKENIZERS_PARALLELISM=false python -m mlx_lm.lora \
    --model model_name \
    --train \
    --iters 50 \
    --data data \
    --adapter-path ./checkpoints \
    --save-every 100 \
    --batch-size 1 \
    --grad-checkpoint

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
9660.38s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Loading pretrained model
Fetching 8 files: 100%|████████████████████████| 8/8 [00:00<00:00, 27869.13it/s]
Loading datasets
Training
Trainable parameters: 0.035% (0.459M/1301.876M)
Starting training..., iters: 200
Iter 1: Val loss 4.320, Val took 55.992s
Iter 10: Train loss 4.148, Learning Rate 1.000e-05, It/sec 0.161, Tokens/sec 155.971, Trained Tokens 9684, Peak mem 7.961 GB
Iter 20: Train loss 3.709, Learning Rate 1.000e-05, It/sec 0.137, Tokens/sec 132.130, Trained Tokens 19314, Peak mem 8.154 GB
Iter 30: Train loss 3.542, Learning Rate 1.000e-05, It/sec 0.127, Tokens/sec 133.058, Trained Tokens 29794, Peak mem 8.154 GB
Iter 40: Train loss 3.211, Learning Rate 1.000e-05, It/sec 0.174, Tokens/sec 166.531, Trained Tokens 39390, Peak mem 8.154 GB
Iter 50: Train loss 3.288, Learning Rate 1.000e-05, It/sec 0.117, Tokens/sec 130.636, Trained Tokens 50508, Peak mem 8.160 GB
Iter 60: Train loss 3.059, Learning Rate 1.000e-05, It/sec 0.148, Tokens/sec 149.112, Trained Tokens 60586, Peak mem 

### Running Inference with the Fine-tuned Gemma model using MLX

The following script can be used to perform inference with LoRA weights from the command line.

```
!python -m mlx_lm.generate --model "google/gemma-7b-it" \
               --adapter-file checkpoints/600_adapters.npz \
               --max-tokens 256 \
               --prompt "Why is the sky blue?" \
               --seed 69
```

However, since we fine-tuned using a specific prompt format, we should probably use this everytime we prompt the model.

Since we still use the same tokenizer with `apply_chat_template`, we should prepare the prompt without what will be supplied by `apply_chat_template`.

Let's create a simple function to format our prompts.


In [64]:
# I thought this system prompt was cool, so let's use this one
system_prompt = df["system_prompt_used"].unique()[-2]
# system_prompt = "You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end."
print(system_prompt)

You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.


In [65]:
question = "Why is the sky blue?"


def format_prompt(system_prompt: str, question: str) -> str:
    "Format the question to the format of the dataset we fine-tuned to."
    return """<bos><start_of_turn>user
## Instructions
{}
## User
{}<end_of_turn>
<start_of_turn>model
""".format(
        system_prompt, question
    )


print(format_prompt(system_prompt, question))

<bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model



In [None]:
# Load the fine-tuned model with LoRA weights
model_lora, _ = load(
    model_name,
    adapter_path="./checkpoints",  # adapters.npz is the final checkpoint saved at the end of training
)

Fetching 8 files: 100%|██████████| 8/8 [00:00<00:00, 25078.05it/s]


In [67]:
response = generate(
    model_lora,
    tokenizer,
    prompt=format_prompt(system_prompt, question),
    verbose=True,
    max_tokens=256,
)

The sky's blue hue is a fascinating phenomenon rooted in the interaction of light and the Earth's atmosphere. It's a complex process that involves several factors, but the primary reason for the blue color is Rayleigh scattering. Let's break down the process step by step.

**1. The Basics of Rayleigh Scattering**

Rayleigh scattering is the phenomenon where electromagnetic radiation (like sunlight) is scattered by particles of a much smaller wavelength. In the case of the sky, the primary scattering agent is nitrogen and oxygen molecules in the Earth's atmosphere. These molecules are much smaller than the wavelengths of visible light.

**2. The Physics Behind Rayleigh Scattering**

Rayleigh scattering is more effective at shorter wavelengths of light. This means that blue and violet light are scattered more strongly than other colors like red and orange. The amount of scattering is inversely proportional to the fourth power of the wavelength. This means that blue light is scattered abo

## Fusing LoRA Weights

Finally, let's try merging the trained LoRA weights into the model itself.

The command below shows available options:


In [68]:
!python -m mlx_lm.fuse --help

python(90723) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
14019.84s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Loading pretrained model
usage: fuse.py [-h] [--model MODEL] [--save-path SAVE_PATH]
               [--adapter-path ADAPTER_PATH] [--hf-path HF_PATH]
               [--upload-repo UPLOAD_REPO] [--de-quantize] [--export-gguf]
               [--gguf-path GGUF_PATH]

Fuse fine-tuned adapters into the base model.

options:
  -h, --help            show this help message and exit
  --model MODEL         The path to the local model directory or Hugging Face
                        repo.
  --save-path SAVE_PATH
                        The path to save the fused model.
  --adapter-path ADAPTER_PATH
                        Path to the trained adapter weights and config.
  --hf-path HF_PATH     Path to the original Hugging Face model. Required for
                        upload if --model is a local directory.
  --upload-repo UPLOAD_REPO
                        The Hugging Face repo to upload the model to.
  --de-quantize         Generate a de-quantized model.
  --export-gguf         Export model

In [None]:
!python -m mlx_lm.fuse \
    --model model_name \
    --adapter-path ./checkpoints \
    # --upload-repo alexweberk/gemma-7b-it-trismegistus \
    # --hf-path google/gemma-7b-it

python(91054) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
14106.93s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Loading pretrained model
Fetching 8 files: 100%|████████████████████████| 8/8 [00:00<00:00, 11188.54it/s]


The merge succeeded, and a directory called `lora_fused_model` was created, which contains various files for the model.


For uploading the fused model to Huggingface, you can first create a new model repo on Huggingface, get the model_id, and then run the script below. In my case, I created a model id called `alexweberk/gemma-7b-it-trismegistus`.

- `--upload-repo` is the name of the repo to upload to.
- `--hf-path` is the name of the original model to give credit to.

Unfortunately, at the time of writing, the `.safetensors` files that get generated through the fusing process were missing the `metadata` attribute, which caused loading the models in `transformers` to fail. I have opened an issue on the `mlx` repository to address this.

If you want, you can tweak the library code like below in <path_to_your_site-packages>/mlx_lm/utils.py (Mine was /Users/alexishida/miniforge3/envs/py311/lib/python3.11/site-packages/mlx_lm/utils.py) by replacing `mx.save_safetensors(str(shard_path), shard)` with `mx.save_safetensors(str(shard_path), shard, metadata={"format":"pt"})` to output fused weights with the metadata attribute.


Another way is to rewrite the `.safetensors` files with the metadata attribute using the script below. (Given the way `safetensors` is implemented, a for loop did not work when saving the files (need to take care of removing all references to the tensors, etc...), so let's simply rewrite the few safetensors files manually.)


In [70]:
import mlx.core as mx

# use mx.load() to load the safetensors
tensors = mx.load(
    "lora_fused_model/model-00001-of-00004.safetensors",  # Change this path and run the cell, one by one for all .safetensors files
    format="safetensors",
)

# use mx.save_safetensors() to save the safetensors with "format" metadata
mx.save_safetensors(
    "lora_fused_model/model-00001-of-00004.safetensors",
    tensors,
    metadata={"format": "pt"},
)

RuntimeError: [load_safetensors] Failed to open file lora_fused_model/model-00001-of-00004.safetensors

### Model Uploading Process

You will need to have a Huggingface write token saved before being able to upload your model. To set an access token, you can:

- Create one here (Make sure you create a "Write" token): https://huggingface.co/settings/tokens
- Download the `huggingface-cli` tool, and run ``huggingface-cli login`
- Paste the token when prompted.


Now let's upload the updated safetensors files to Huggingface. This takes a while to finish...


In [41]:
!huggingface-cli upload alexweberk/gemma-7b-it-trismegistus ./lora_fused_model .

Here's the uploaded model:
https://huggingface.co/alexweberk/gemma-7b-it-trismegistus


## Loading the Fused Model and Running Inference

Let's try loading the local fused model and run inference with it.


In [14]:
from mlx_lm import generate, load

fused_model, fused_tokenizer = load("./lora_fused_model/")

In [15]:
response = generate(
    fused_model,
    fused_tokenizer,
    prompt=format_prompt(system_prompt, question),
    verbose=True,  # Set to True to see the prompt and response
    temp=0.0,
    max_tokens=512,
)

Prompt: <bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model

The question of "why the sky is blue" is a multifaceted one, and the answer will depend on the specific context in which the question is posed. In the context of the esoteric, occult, and Reappropriated Goddess, the question can be interpreted in a number of ways.

In the first instance,

It looks like the fused model was able to generate a response.

The generation speed of the fused model is significantly faster than running it with LoRA weights without fusing them.

- LoRA Generation: 6.002 tokens-per-sec
- Fused Model Generation: 17.849 tokens-per-sec


## Loading the Fused Model from Huggingface

Let's download the uploaded model from Huggingface and run inference with it, just to make sure it uploaded correctly.


In [16]:
from mlx_lm import generate, load

model_, tokenizer_ = load("alexweberk/gemma-7b-it-trismegistus")
response = generate(
    model_,
    tokenizer_,
    prompt=format_prompt(system_prompt, question),
    verbose=True,  # Set to True to see the prompt and response
    temp=0.0,
    max_tokens=512,
)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Prompt: <bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model

The question of "Why is the sky blue?" is not related to the topic of Reappropriated Goddess. It is a question of science and physics. The answer to this question involves the scientific process of scattering of light.

The sky appears blue because of a phenomenon called Rayleigh scatter

We can also run it using `transformers` directly, although without the benefit of utilizing MLX/Apple Silicon to the fullest.


In [17]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# repo_id = "google/gemma-7b-it"
repo_id = "alexweberk/gemma-7b-it-trismegistus"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
model.to("mps")

input_text = format_prompt(system_prompt, question)
input_ids = tokenizer(input_text, return_tensors="pt").to("mps")

outputs = model.generate(
    **input_ids,
    max_new_tokens=256,
)
print(tokenizer.decode(outputs[0]))

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

<bos><bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model
The question of "why is the sky blue?" is a complex one that requires a multifaceted answer. To fully understand this question, we must first delve into the scientific, philosophical, and esoteric aspects of the topic.

Scientifically, the sky appears blue due to a phenomenon called scatteri

## Conclusion

Hope this was helpful!
If you liked this content, please [follow me on Twitter(X)](https://twitter.com/morningcoder).

Notebook Gist: https://gist.github.com/alexweberk/635431b5c5773efd6d1755801020429f
