# Text Completion / Raw Text Training
### This is a notebook by [Mark Lord](https://twitter.com/priontific); lead designer / dev of [ValleyDAO](https://valleydao.bio)'s Curriculum Protocol.
#### It borrows heavily from the Unsloth team's [fantastic Colab notebooks](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing#scrollTo=IqM-T1RTzY6C)!

#### This Notebook is designed for training an LLM with the default loss function of the MLX Lora.py script. This means you can finetune on any dataset and let your model act as a text completion model, like for novel writing.
----
#### We'll be training on [Tiny Stories](https://huggingface.co/datasets/roneneldan/TinyStories) which is a collection of small stories. For example:

`Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun.
Beep was a healthy car because he always had good fuel....`

#### To run this Notebook, first make sure you've installed everything in the requirements.txt file (recommended that you use a virtualenv), then press "Run" in the top left and select "Run All Cells".
#####

In [26]:
# Remember to install all requirements. Strongly recommended that you set up a virtual environment!

## Set your model here 👇

#### (I believe there's some way to download pre-quantised models to save on internet bandwidth but honestly I never got that code to work properly.)

In [27]:
# Set the github path to the model. If you've already got a model on disk, you'll have to edit this notebook to point to that instead.

github_path = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
print(f"\nModel set to {github_path}.\n")


Model set to TinyLlama/TinyLlama-1.1B-Chat-v1.0.



## Now download it and convert it to 4-bit.
#### (Unless you successfully downloaded a pre-quantised model, in which case **you do not need to run this step.**)

In [28]:
# Convert a model of choice. The -q flag quantizes the model to 4-bit by default.

!python3 -m mlx_lm.convert \
--hf-path {github_path} \
--mlx-path {github_path} \
-q

print(f"\nModel converted.\n")

[INFO] Loading
Fetching 8 files: 100%|████████████████████████| 8/8 [00:00<00:00, 18236.10it/s]
[INFO] Quantizing

Model converted.



### By default, MLX determines how many entries are seen using iters and batches rather than epochs. 

#### The code for this is extremely simple, and you can run it just using the configuration in the code block below.

#### Training in epochs however is more desirable for me, as I'd prefer to make sure that every example is seen as many times as every other, so the block below is disabled. If you would rather use iters and batch size, un-comment out the block below and put it into the later epoch-based training code instead.

----
# Set training parameters here 👇
#### The TinyStories' dataset is a 1GB install; depending on your internet speed, you might be here a while.

In [29]:
import os
import json
from datasets import load_dataset

dataset = load_dataset("roneneldan/TinyStories", split = "train[:500]")
valid_dataset = load_dataset("roneneldan/TinyStories", split="train[501:601]")

print(f"\nDatasets loaded.\n")

Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.



Datasets loaded.



In [30]:
import os
import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(github_path)
EOS_TOKEN = tokenizer.eos_token

# Set the path for your modified training file
data_dir = './data'  # Directory for the modified training file
if not os.path.exists(data_dir):
    os.makedirs(data_dir)  # Create the data directory if it doesn't exist

modified_train_file = f'{data_dir}/train.jsonl'
valid_file = f'{data_dir}/valid.jsonl'

# Open a new file and write each story with the EOS token appended
with open(modified_train_file, 'w') as file:
    for item in dataset:
        story_with_eos = item['text'] + EOS_TOKEN  # Append the EOS token to each story
        # Write this story as a JSON object to the file
        file.write(json.dumps({"text": story_with_eos}) + '\n')

# Same but for the validation file, though not strictly necessary
with open(valid_file, 'w') as file:
    for item in dataset:
        story_with_eos = item['text'] + EOS_TOKEN  # Append the EOS token to each story
        # Write this story as a JSON object to the file
        file.write(json.dumps({"text": story_with_eos}) + '\n')

print(f"\nEOS token added to each entry.\n")


EOS token added to each entry.



In [31]:
print(f"\nBelow are your entries with the EOS token appended.\n")

with open(modified_train_file, 'r') as file:
    for i, line in enumerate(file):
        if i >= 5:  # Only read the first 5 entries
            break
        entry = json.loads(line)  # Parse the JSON string into a Python dictionary
        print("=========================")
        print(entry["text"])  # Print the "text" field of the entry



Below are your entries with the EOS token appended.

One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.</s>
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.

One day, Beep was driving in the park when he saw

In [32]:
batch_size = 1                    # Change this as needed; but more batches takes more memory, and my tests show no increases in speed.
epochs = 1                        # For newbies, this means how many times the model gets to see the dataset. 1 means it goes through once and that's it.
train_set = "./data"              # Change this to point to the folder containing your train.JSONL file.
lora_layers = 22                  # Set the number of layers e.g. 4, 16, 32. Less layers = less effect on style, but saves on RAM.
context_length = 1024              # Set the context length during training. More context takes more memory.
learning_rate = 2e-5              # Set the learning rate. Higher e.g. 2e-5 can make your model go a bit nuts. Lower e.g. 2e-6 and it'll learn real slow.


'''MLX doesn't have a native flag for training epochs, so I've implemented some code below to convert iters to epochs.'''

train_file = "./data/train.jsonl"
iters_per_epoch = -(-sum(1 for _ in open(train_file)) // batch_size)
iters = iters_per_epoch * epochs # Swap this out to iters = 100 or whatever you like if you'd prefer to train in iters.

'''The code below counts the number of characters in your training dataset to figure out how long training will take. It's not high accuracy.'''
import sentencepiece as spm
import json

# Use the tokenizer to split your dataset into tokens:
model_path = f"{github_path}/tokenizer.model"
sp_processor = spm.SentencePieceProcessor()
sp_processor.load(model_path)

def count_tokens_in_file(sp_processor, file_path):
    total_tokens = 0
    with open(file_path, 'r') as f:
        for line in f:
            entry = json.loads(line)
            text = entry.get('text', '')  # Make sure 'text' is the correct key for your JSONL entries
            tokens = sp_processor.encode_as_pieces(text)
            total_tokens += len(tokens)
    return total_tokens

# Use the correct variable for the SentencePiece processor here
total_tokens = count_tokens_in_file(sp_processor, train_file) * epochs
multiplier_for_layers = 1/(1+(((32-lora_layers)/32)*1.5)) # Really approximate maths to get a multiplier based on number of layers.
training_rate = 500 // multiplier_for_layers
estimated_total_time = int(total_tokens // training_rate)
estimated_minutes = int(estimated_total_time // 60)
estimated_seconds = int(estimated_total_time % 60)
slow_time = estimated_total_time * 7
slow_minutes = int(slow_time // 60)
slow_seconds = int(slow_time % 60)

print(f"\nAutomatically detected {iters_per_epoch} data entries.")
print(f"For {epochs} epoch(s) with a batch size of {batch_size}, we will set iters to: {iters}")
print(f"Total number of tokens in the JSONL file: {total_tokens}")
print(f"Estimated training rate in tokens/second if fits in GPU: {training_rate}")
print(f"\nIf model fits in GPU: Estimated time for {epochs} epoch(s) with {lora_layers} LoRA layer(s) with a token amount of {total_tokens}: \n{estimated_minutes} minutes and {estimated_seconds} seconds")
print(f"\nElse if model doesn't fit in GPU, could be up to:\n{slow_minutes} minutes and {slow_seconds} seconds\n")


Automatically detected 5000 data entries.
For 1 epoch(s) with a batch size of 1, we will set iters to: 5000
Total number of tokens in the JSONL file: 1126679
Estimated training rate in tokens/second if fits in GPU: 734.0

If model fits in GPU: Estimated time for 1 epoch(s) with 22 LoRA layer(s) with a token amount of 1126679: 
25 minutes and 34 seconds

Else if model doesn't fit in GPU, could be up to:
178 minutes and 58 seconds



#### Remember, the (very roughly) estimated time 👆 is for an M1 Max - you'll have to scale this based on the processor you have.
-----
#### To work out how long it'll take your machine, you can apply a rough calculation:

##### M(1/2/3) Pro: 2x as long (double the estimated time)
##### M(1/2/3) Base: 4x as long (quadruple the estimated time)
##### M(1/2/3) Ultra: 1/2 as long (halve the estimated time)


### The cell below 👇 will train and produce your LoRA adapters as an adapters.npz file.

##### Make sure that your data is present as a train.JSONL file in the data folder of this repo. The model will automatically be in the mlx_model folder if you downloaded it using the code at the beginning.

###### (By default, a checkpoint will be saved every 100 iterations, though you can change this by adding a '--save-every 100' flag and changing the number.)

In [33]:
adapters = "trial1.npz"
print(f"\nAdapter file set to {adapters}.\n")


Adapter file set to trial1.npz.



In [34]:
# Uncomment out this cell to delete your adapters if your model generation has gone hay-wire.
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # or "true" if you want to enforce parallelism

!rm -f {adapters}
!rm -rf checkpoints  # Be careful: This will delete the directory and all its contents!
!mkdir checkpoints
print(f"\nAny previous adapter file {adapters} cleared (and all checkpoints deleted).\n")


Any previous adapter file trial1.npz cleared (and all checkpoints deleted).



In [35]:
%%time
!python3 -m mlx_lm.lora \
--train \
--model {github_path} \
--data {train_set} \
--batch-size {batch_size} \
--lora-layers {lora_layers} \
--iters {iters} \
--max-seq-length {context_length} \
--learning-rate {learning_rate} \
--adapter-file {adapters} \
--save-every 100

print(f"\nModel training complete.\n")
print(f"\nIf you're getting a strange error and the training isn't happening (will be obvious as it'll end instantly).\n")

Loading pretrained model
Total parameters 228.383M
Trainable parameters 1.126M
Loading datasets
Training
Starting training..., iters: 5000
Iter 1: Val loss 1.677, Val took 3.820s
Iter 10: Train loss 1.566, Learning Rate 2.000e-05, It/sec 4.018, Tokens/sec 817.344, Trained Tokens 2034
Iter 20: Train loss 1.569, Learning Rate 2.000e-05, It/sec 3.488, Tokens/sec 732.439, Trained Tokens 4134
Iter 30: Train loss 1.611, Learning Rate 2.000e-05, It/sec 3.616, Tokens/sec 726.731, Trained Tokens 6144
Iter 40: Train loss 1.494, Learning Rate 2.000e-05, It/sec 3.445, Tokens/sec 732.391, Trained Tokens 8270
Iter 50: Train loss 1.457, Learning Rate 2.000e-05, It/sec 3.781, Tokens/sec 717.663, Trained Tokens 10168
Iter 60: Train loss 1.504, Learning Rate 2.000e-05, It/sec 3.139, Tokens/sec 753.024, Trained Tokens 12567
Iter 70: Train loss 1.442, Learning Rate 2.000e-05, It/sec 2.806, Tokens/sec 755.340, Trained Tokens 15259
Iter 80: Train loss 1.474, Learning Rate 2.000e-05, It/sec 3.474, Tokens/sec

### Congratulations! You've trained a model! 
#### Well done! 🎉 
#### You'll now have an adapters.npz file in the directory of this notebook. We can now apply this to the model in order to generate some text in the style of our training data. 

In [36]:
# Test the model's generation.

from mlx_lm import load, generate

prompt = "Below is a story about star wars\n"

print(f"\n**Without your trained adapters:**")
model, tokenizer = load(github_path)
response = generate(model, tokenizer, prompt= prompt, verbose=True)

print(f"\n**And now with your trained adapters:**")
model, tokenizer = load(github_path, adapter_file=adapters)
response = generate(model, tokenizer, prompt= prompt, verbose=True)


**Without your trained adapters:**
Prompt: Below is a story about star wars

Star Wars is a series of films that have been released by Lucasfilm Ltd. The first film was released in 1977 and the most recent one was released in 2015. The series has been a huge success and has been a major part of pop culture for over 40 years. The series has been adapted into various forms of media, including books, comics, video games, and even a theme park. The series has been a huge influence on pop culture
Prompt: 142.209 tokens-per-sec
Generation: 90.163 tokens-per-sec

**And now with your trained adapters:**


FileNotFoundError: The adapter file does not exist: trial1.npz

### If you found this Notebook useful, I'd love to know - feel free to reach out to me on my [Twitter](https://twitter.com/priontific) or come chat with me on the [ValleyDAO](https://valleydao.bio) Discord.
#### As MLX functionality continues to expand (different loss functions to enable RLHF, instruct fine-tuning etc.) I'll make more of these Notebooks.

Well done on fine-tuning TinyLlama! 🎉  