To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [1]:
%%capture
!pip install unsloth "xformers==0.0.28.post2" #"transformers==4.45.2"
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ihughes15234/phi_3_5_mini_tictactoe1200",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.11.6: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/164 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/447 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep


In [5]:
!pip install huggingface_hub
!pip install wandb



In [6]:
from google.colab import userdata

from huggingface_hub import login

hf_token = userdata.get('HF_HUB_TOKEN') #   wanddb
login(hf_token)

In [7]:
# This code block is used to load a tokenizer from the Hugging Face Model Hub.

# 'tokenizer_id' is set to the 'model_id', which is the identifier for the pre-trained model.
# This assumes that the tokenizer associated with the model has the same identifier as the model.
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("ihughes15234/ttt_dpo_phi_v2_all_other", split = "train")

tokenizer_id = 'microsoft/Phi-3.5-mini-instruct'

# 'AutoTokenizer.from_pretrained' is a method that loads a tokenizer from the Hugging Face Model Hub.
# 'tokenizer_id' is passed as an argument to specify which tokenizer to load.
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

# 'tokenizer.padding_side' is a property that specifies which side to pad when the input sequence is shorter than the maximum sequence length.
# Setting it to 'right' means that padding tokens will be added to the right (end) of the sequence.
# This is done to prevent warnings that can occur when the padding side is not explicitly set.
tokenizer.pad_token = tokenizer.unk_token  # use unk rather than eos token to prevent endless generation
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'

README.md:   0%|          | 0.00/428 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.47M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9888 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
# dataset_chatml = dataset.train_test_split(test_size=0.05, seed=1234)

In [8]:
import pprint
row = dataset[8]
pprint.pprint(row["prompt"])
pprint.pprint(row["chosen"])
pprint.pprint(row["rejected"])

('<|system|>\n'
 'You are a powerful gaming agent who can make proper decisions to beat the '
 'user in gaming tasks. You are a helpful assistant that strictly follows the '
 "user's instructions.<|end|>\n"
 '<|user|>\n'
 'Tic Tac Toe is a two-player game played on a grid. Players take turns '
 "marking a space with their respective symbols. The goal is to get 3 of one's "
 'own symbols in a row, either horizontally, vertically, or diagonally, before '
 'the opponent does. If all nine squares are filled and no player has three in '
 'a row, the game is a draw. The Tic Tac Toe game is played on a 3 by 3 grid, '
 'with the winning length as 3.\n'
 'Each move is represented by a string consisting of two parts: the column (C) '
 'and the row (R), in that order. For instance, C1R2 means the movement at the '
 'position of the first column and the second row of the grid. You are playing '
 'this game with the user (opponent).\n'
 'Your opponent has finished actions: <C1R2>, <C2R2>, <C2R3>. Y

In [9]:
# 'import wandb' is used to import the wandb library.
import wandb

# 'wandb.login()' is a method that logs you into your Weights & Biases account.
# If you're not already logged in, it will prompt you to log in.
# Once you're logged in, you can use Weights & Biases to track and visualize your experiments.
wandb.login()
#e81469b1d30c323f5c83168b8a8d6f9d35a61c42

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [10]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [11]:
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 16,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps = 2,
        warmup_ratio = 0.1,
        num_train_epochs = 5,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = dataset,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 1024,
)


Extracting prompt from train dataset:   0%|          | 0/9888 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/9888 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/9888 [00:00<?, ? examples/s]

In [12]:
# This code block is used to initialize a Weights & Biases (wandb) run.

# 'project_name' is set to the name of the project in Weights & Biases.
project_name = "phi35-tictactoedpo5_v7"

# 'wandb.init' is a method that initializes a new Weights & Biases run.
# 'project' is set to 'project_name', meaning that the run will be associated with this project.
# 'name' is set to "phi-3-mini-ft-py-3e", which is the name of the run.
# Each run has a unique name which can be used to identify it in the Weights & Biases dashboard.
wandb.init(project=project_name, name = "phi35-tictactoedpo5_v7")

[34m[1mwandb[0m: Currently logged in as: [33mihughes15234[0m ([33mihughes[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [13]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9,888 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 2
\        /    Total batch size = 32 | Total steps = 1,545
 "-____-"     Number of trainable parameters = 119,537,664
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.6931,0.0,0.0,0.0,0.0,-54.312195,-44.562912,-2.024329,-2.715652
2,0.6931,0.0,0.0,0.0,0.0,-53.72163,-45.158916,-1.734022,-2.415032
3,0.7027,-0.021115,-0.00421,0.40625,-0.016905,-53.465012,-45.655479,-2.038172,-2.661421
4,0.7008,-0.006035,0.006858,0.4375,-0.012893,-53.375862,-45.816101,-2.450992,-2.925493
5,0.6759,0.023447,-0.016314,0.625,0.039761,-53.915443,-44.685459,-2.375151,-2.777798
6,0.6939,-0.012555,-0.013236,0.5625,0.000681,-52.615486,-43.966095,-2.113403,-2.663145
7,0.691,0.012441,0.006363,0.5625,0.006078,-56.900707,-43.209927,-1.567185,-2.466465
8,0.694,-0.006086,-0.00719,0.5625,0.001104,-53.501907,-46.195541,-2.261281,-2.739496
9,0.7076,0.016476,0.042696,0.375,-0.02622,-49.584267,-44.285706,-2.221039,-2.662038
10,0.7084,-0.028486,-0.000733,0.375,-0.027754,-56.531208,-43.485756,-1.860611,-2.645531


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.6931,0.0,0.0,0.0,0.0,-54.312195,-44.562912,-2.024329,-2.715652
2,0.6931,0.0,0.0,0.0,0.0,-53.72163,-45.158916,-1.734022,-2.415032
3,0.7027,-0.021115,-0.00421,0.40625,-0.016905,-53.465012,-45.655479,-2.038172,-2.661421
4,0.7008,-0.006035,0.006858,0.4375,-0.012893,-53.375862,-45.816101,-2.450992,-2.925493
5,0.6759,0.023447,-0.016314,0.625,0.039761,-53.915443,-44.685459,-2.375151,-2.777798
6,0.6939,-0.012555,-0.013236,0.5625,0.000681,-52.615486,-43.966095,-2.113403,-2.663145
7,0.691,0.012441,0.006363,0.5625,0.006078,-56.900707,-43.209927,-1.567185,-2.466465
8,0.694,-0.006086,-0.00719,0.5625,0.001104,-53.501907,-46.195541,-2.261281,-2.739496
9,0.7076,0.016476,0.042696,0.375,-0.02622,-49.584267,-44.285706,-2.221039,-2.662038
10,0.7084,-0.028486,-0.000733,0.375,-0.027754,-56.531208,-43.485756,-1.860611,-2.645531


TrainOutput(global_step=1545, training_loss=0.18941204969456243, metrics={'train_runtime': 8726.803, 'train_samples_per_second': 5.665, 'train_steps_per_second': 0.177, 'total_flos': 0.0, 'train_loss': 0.18941204969456243, 'epoch': 5.0})

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

text = "Connect 4 is a two-player connection board game, where the players choose a color and then take turns dropping colored discs into a vertically suspended grid. The pieces fall straight down, occupying the next available space within the column. The objective of the game is to be the first to form a horizontal, vertical, or diagonal line of four of one's own discs. You are a gaming agent that aims to beat me in Connect 4 games. Each move is represented by a string consisting of two parts: the column (C) and the row (R), in that order. For instance, C1 means the first column. Your opponent has finished moves: <C6>,<C6>,<C1>,<C1>,<C3>,<C5>,<C4>,<C4> You have finished moves: <C4>,<C4>,<C6>,<C1>,<C7>,<C3>,<C3>,<C6> Currently, the legal positions are <C1>,<C2>,<C3>,<C4>,<C5>,<C6>,<C7> You must choose an legal action to set up advantages. \n \n Your output must be in the following format: \n \n Action: \n Your action wrapped with <>, <Cx>, e.g., <C1>, <C7> \n \n Please return your answer without explanation!"
text2 = "Breakthrough is a two-player game played on a rectangular board. Players take turns moving their pieces, which can move one space straight or diagonally forward if the target square is empty. A piece can also move diagonally forward to capture an opponent's piece. Capturing is optional, and a player can only capture one piece per turn. The goal is to be the first to reach the opponent's home row, the farthest row from the player. If all of a player's pieces are captured, they lose. The game does not allow draws, as pieces can only move forward or be captured.The Breakthrough board is identified by columns labeled start from A (from left to right) and rows numbered 1 to 8 (from bottom to top). The intersection of a column and a row specifies a unique square on the board.\n\nThe board now looks like :\n['8bbb', '7b..', '6...', '5...', '4bwb', '3.w.', '2w..', '1ww.'] \nAmong which, the letter 'b' represents black piece, while the letter 'w' represents white piece.\n And the character '.' represents vacant space.\n And the numbers in the board are the indexes of the rows.\nYour opponent has finished actions: <c2->b3> and <b3->a4> and <c1->c2> and <b2->a3> and <c2->b3> and <a3->b4>. You have finished actions: <b7->c6>, <c7->b6>, <b6->c5>, <c6->b5>, <b5->a4*>, <c5->c4>.\n\nCurrently, the legal actions are: <a8->b7> or <b8->b7> or <b8->c7> or <c8->b7> or <c8->c7> or <a7->a6> or <a7->b6> or <a4->a3> or <a4->b3*> or <c4->b3*> or <c4->c3>.\nYou must choose an legal action to set up advantages.\n\nYour output must be in the following format:\n\nAction:\nYour action wrapped with <>, <[a-c][1-8]->[a-c][1-8]>, e.g., <a7->a6>\n\nPlease return your answer without explanation!"
messages = [
    {"from": "human", "value": text2},# dataset_chatml['train'][63]["text"]
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<s> [INST] Breakthrough is a two-player game played on a rectangular board. Players take turns moving their pieces, which can move one space straight or diagonally forward if the target square is empty. A piece can also move diagonally forward to capture an opponent's piece. Capturing is optional, and a player can only capture one piece per turn. The goal is to be the first to reach the opponent's home row, the farthest row from the player. If all of a player's pieces are captured, they lose. The game does not allow draws, as pieces can only move forward or be captured.The Breakthrough board is identified by columns labeled start from A (from left to right) and rows numbered 1 to 8 (from bottom to top). The intersection of a column and a row specifies a unique square on the board.\n\nThe board now looks like :\n['8bbb', '7b..', '6...', '5...', '4bwb', '3.w.', '2w..', '1ww.'] \nAmong which, the letter 'b' represents black piece, while the letter 'w' represents white piece.\n And the c

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [14]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged("phi35_tictactoe_dpo5epoch_v7", tokenizer, save_method = "merged_16bit")#, token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 7.6G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 49.57 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 71.18it/s]


Unsloth: Saving to organization with address ihughes15234/phi35_tictactoe_dpo5epoch_v7
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving to organization with address ihughes15234/phi35_tictactoe_dpo5epoch_v7
Unsloth: Uploading all files... Please wait...


  0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/None/phi35_tictactoe_dpo5epoch_v7
