# OxML2024: LLM

In [None]:
from IPython.display import Markdown, HTML
import numpy as np

In [None]:
from google.colab import userdata
openai_key=userdata.get('openai_key')
hf_key=userdata.get('hf_key')

## Install Packages
*   **Unsloth:** An innovative software library designed to significantly enhance the fine-tuning of large language models (LLMs).
*   **xFormers:** xFormers is an advanced, open-source library developed by the FAIR team at Meta AI, designed to accelerate transformer model research and applications across various fields.
*   **TRL:** A comprehensive library developed for training transformer language models using reinforcement learning techniques. It's built on top of the Hugging Face Transformers library and supports a variety of tasks including Supervised Fine-Tuning (SFT), Reward Modeling (RM), and Proximal Policy Optimization (PPO), among others.
*   **PEFT:** PEFT offers parameter-efficient methods for finetuning large pretrained models. It trains a smaller number of prompt parameters or use a reparametrization method like low-rank adaptation (LoRA) to reduce the number of trainable parameters.
*   **Accelerate:** Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! In short, training and inference at scale made simple, efficient and adaptable.
*   **Bitsandbytes:** Bitsandbytes enables accessible large language models via k-bit quantization for PyTorch.



In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes

Collecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-my0o58_7/unsloth_95612ce7f80e447887cf0fd221af89aa
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-my0o58_7/unsloth_95612ce7f80e447887cf0fd221af89aa
  Resolved https://github.com/unslothai/unsloth.git to commit 4211cc01409e3ced4f7abebaf68e244193b46e2c
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tyro (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading tyro-0.8.4-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/102.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting datasets>=2.16.0 (from unsloth[colab-ne

# Access through API

## ChatGPT

### Install packages.


In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.28.1-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.1/320.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.5 ht

### Configuration




*   **temperature:** What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
*   **top_p:** An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
*   **n:** How many chat completion choices to generate for each input message.
*   **max_tokens:** The maximum number of tokens that can be generated in the chat completion.
*   **logprobs:**  Whether to return log probabilities of the output tokens or not.
*   **top_logprobs:** An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability.

In [None]:
from openai import OpenAI
client = OpenAI(
    api_key=openai_key,
)
model = "gpt-3.5-turbo"
temperature=0.6
top_p=0.5
n=1
max_tokens=512
logprobs=True
top_logprobs=5

### Request

In [None]:

input='How many kinds of human beings are there in the history?'
# input="Who are you?"
response = client.chat.completions.create(
                model=model,
                messages=[
                    {
                        "role": "user",
                        "content": input,
                    }
                ],
                temperature=temperature,
                max_tokens=max_tokens,
                n=n,
                logprobs=logprobs,
                top_logprobs=top_logprobs
            )
display(Markdown(response.choices[0].message.content))

There is only one kind of human beings in history, Homo sapiens. Throughout history, there have been different cultures, ethnicities, and societies, but all of them belong to the same species - Homo sapiens.

Response format:

```json
{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo-0125",
  "system_fingerprint": "fp_44709d6fcb",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?",
    },
    "logprobs": null,
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}
```



In [None]:
for content in response.choices[0].logprobs.content:
    top_logprobs = content.top_logprobs
    html_content = ""
    for i, logprob in enumerate(top_logprobs, start=1):
        html_content += (
            f"<span style='color: cyan'>Output token {i}:</span> {logprob.token}, "
            f"<span style='color: darkorange'>logprobs:</span> {logprob.logprob}, "
            f"<span style='color: magenta'>linear probability:</span> {np.round(np.exp(logprob.logprob)*100,2)}%<br>"
        )
    display(HTML(html_content))
    print('\n')

















































































































































































## Hugging Face
We can also use Hugging Face's Inference API to get access to various of open source LLMs. Here we take Mistral as an example.

### Configuration


*   **top_k**	(Default: None). Integer to define the top tokens considered within the sample operation to create new text.
*   **top_p**	(Default: None). Float to define the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p.
*   **temperature**	(Default: 1.0). Float (0.0-100.0). The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability.
*   **max_new_tokens**	(Default: None). Int (0-250). The amount of new tokens to be generated, this does not include the input length it is a estimate of the size of generated text you want. Each new tokens slows down the request, so look for balance between response times and length of text generated.
*   **return_full_text**	(Default: True). Bool. If set to False, the return results will not contain the original query making it easier for prompting.
*   **num_return_sequences**	(Default: 1). Integer. The number of proposition you want to be returned.
*   **do_sample**	(Optional: True). Bool. Whether or not to use sampling, use greedy decoding otherwise.



In [None]:
API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2"
headers = {"Authorization": f"Bearer {hf_key}"}
temperature=0.6
max_new_tokens=512
return_full_text=False
num_return_sequences=1

In [None]:
import requests

input="What is your favorite condiment?"

payload={
	"inputs": input,
    "parameters": {
        'temperature':temperature,
        'max_new_tokens':max_new_tokens,
        'return_full_text':return_full_text,
        'num_return_sequences':num_return_sequences
    }
}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

output = query(payload)
display(Markdown(output[0]['generated_text']))

 I’ve always been a ketchup girl, but lately I’ve been partial to a good hot sauce.

What is the weirdest thing you’ve ever eaten? I’ve had some pretty weird things, but one that stands out was a fermented tofu dish I had in Thailand. It smelled like old socks and tasted more like a science experiment than food, but it was actually quite delicious once you got past the initial shock.

What is your go-to meal when you’re cooking for yourself? I love making a big bowl of quinoa or brown rice and topping it with roasted veggies, avocado, black beans, and a good drizzle of salsa. It’s simple, healthy, and always hits the spot.

What is the most exotic place you have traveled to? I’ve been lucky enough to travel to some pretty amazing places, but one that stands out was my honeymoon in Bali. The culture, the food, the people – it was truly magical.

What is the best dish you can make for a crowd? I love making a big paella for a crowd. It’s a one-pan dish that feeds a lot of people, and everyone always seems to love it. Plus, it’s a great opportunity to get creative with the ingredients – I’ve made vegetarian, seafood, and even a vegan version.

What is your favorite comfort food? I have a serious weakness for grilled cheese sandwiches. There’s just something about the melty cheese, the crispy bread, and the comforting warmth that makes me feel better no matter what.

What is your favorite restaurant? It’s hard to pick just one, but I have to give a shoutout to my favorite local Thai place. The food is always delicious, the prices are reasonable, and the staff is so friendly and welcoming. I could eat there every day!

What is your go-to drink when you’re cooking? I’m a big fan of a good glass of rosé when I’m cooking. It’s light, refreshing, and goes well with just about anything. Plus, it makes the whole experience feel a little more special.

What is your favorite ingredient? I have to go with garlic. It adds so much flavor to dishes, and I just love the way it

# Generation in Local


### Use Pipeline

Let's first experience the non-instruct-finetuned Llama-2.

In [None]:
import torch
import transformers
from transformers import AutoTokenizer,BitsAndBytesConfig,AutoModelForCausalLM
model_name = 'mistralai/Mistral-7B-v0.1'
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
)
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    device_map="auto",
    token=hf_key,
    model_kwargs={"quantization_config": bnb_config}
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
input="What are the three primary colors?"
output = pipeline(
    input,
    max_new_tokens=100
)
display(Markdown(output[0]['generated_text']))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


What are the three primary colors?

The three primary colors are red, yellow, and blue.

### What are the three primary colors in art?

The three primary colors in art are red, yellow, and blue.

### What are the three primary colors in art?

The three primary colors in art are red, yellow, and blue.

### What are the three primary colors in art?

The three primary colors in art are red, yellow, and blue.



## Use Generate Function

Now we try the instruct tuned version of Mistral-7B.

In [None]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True,  use_auth_token=hf_key)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, use_auth_token=hf_key)




config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Tokenize the input:

In [None]:
prompt = "What are the three primary colors?"
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
print(model_inputs)

{'input_ids': tensor([[    1,  1824,   460,   272,  1712,  6258,  9304, 28804]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}


Generate and decode:

In [None]:
output = model.generate(**model_inputs, max_new_tokens=100)
generated_text=tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)

display(Markdown(generated_text))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




The three primary colors in additive color systems, such as RGB, are red, green, and blue. In subtractive color systems, such as CMYK, the primary colors are cyan, magenta, and yellow.

Why are red, green, and blue considered primary colors?

In additive color systems, light is added together to produce various colors. Red, green, and blue are considered primary colors because no other combination of these colors can

The answer is not properly ended.

Then we adjust the template:

In [None]:
prompt = "[INST] What are the three primary colors?[/INST]"
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs, max_new_tokens=100)
generated_text=tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)

display(Markdown(generated_text))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The three primary colors in additive color systems, used in television and computer screens, are Red, Green, and Blue. In subtractive color systems, used in painting and printing, the primary colors are Cyan, Magenta, and Yellow. These primary colors can be combined in various ways to produce secondary colors and ultimately, a wide range of hues.

Decode the tokenized input tokens :

In [None]:
print(tokenizer.decode(model_inputs.input_ids[0]))

<s> [INST] What are the three primary colors?[/INST]


Remove `skip_special_tokens`:

In [None]:
print(tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:]))

The three primary colors in additive color systems, used in television and computer screens, are Red, Green, and Blue. In subtractive color systems, used in painting and printing, the primary colors are Cyan, Magenta, and Yellow. These primary colors can be combined in various ways to produce secondary colors and ultimately, a wide range of hues.</s>


The chat template:

In [None]:
prompt = '''[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> [INST] The right amount of what? [/INST]'''
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs, max_new_tokens=100)
generated_text=tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)

display(Markdown(generated_text))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The right amount of zest and tanginess to enhance the flavors of the food I'm preparing. Lemon juice is versatile and can be used in a variety of dishes, from savory to sweet. It's a staple in my kitchen!

# Prompt Techniques


## Zero-Shot Prompting

In [None]:
prompt = '''[INST]Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:[/INST]'''
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs, max_new_tokens=100)
generated_text=tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)

display(Markdown(generated_text))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Neutral. The text expresses a neutral opinion towards the vacation.

## Few-Shot Prompting

In [None]:
prompt = '''[INST] A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.</s>

To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:[/INST]'''
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs, max_new_tokens=100)
generated_text=tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)

display(Markdown(generated_text))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


During the music festival, the crowd farduddled in excitement when the main act finally took the stage.

## Chain-of-Thought Prompting


### Zero-Shot

In [None]:
prompt = '''[INST] I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
Let's think step by step. [/INST]'''
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs, max_new_tokens=512)
generated_text=tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)

display(Markdown(generated_text))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


You started with 10 apples.
You gave away 2 to the neighbor and 2 to the repairman, so you had 10 - 2 - 2 = 6 apples left.
Then you bought and added 5 more apples, so you had 6 + 5 = 11 apples in total.
After eating 1 apple, you were left with 11 - 1 = 10 apples.

### Few-Shot

In [None]:
prompt = '''[INST] Problem 1: A farmer has 17 sheep, and all but 9 die. How many are left alive?
Thought process:
- The problem says "all but 9 die", which means 9 sheep remain alive.
Answer: 9

Problem 2: Lisa has 45 apples. She gives away 12 to her friends and then buys 10 more. How many apples does she have now?
Thought process:
- Lisa starts with 45 apples.
- She gives away 12, so 45 - 12 = 33 apples remain.
- Then she buys 10 more apples, so 33 + 10 = 43 apples.
Answer: 43

Problem 3: John and 4 friends share 20 cookies equally. How many cookies does each person get?
Thought process:
- There are 5 people in total including John.
- They share 20 cookies equally, so each person gets 20 / 5 = 4 cookies.
Answer: 4

Now solve this:
Problem 4: A school purchased 5 boxes of pencils. Each box contains 24 pencils. Later, 3 pencils were given out to students. How many pencils are left in total?
Thought process: [/INST]'''
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs, max_new_tokens=512)
generated_text=tokenizer.decode(output[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)

display(Markdown(generated_text))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The school initially purchased 5 boxes of pencils, and each box contains 24 pencils. So, the school has a total of 5 * 24 = <<5*24>>120 pencils.

Later, 3 pencils were given out to students. So, the number of pencils left is 120 - 3 = <<120-3>>117.

Answer: 117.

# Finetune LLM

## Load Model and Tokenizer

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)



config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Add LoRA Adapters

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",

    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Print the model to see the applied LoRA adapters.

In [None]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=409

## Prepare Data

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Downloading readme:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

## Train!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/51760 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.8194
2,2.2928
3,1.691
4,1.9464
5,1.643
6,1.6018
7,1.1933
8,1.2565
9,1.106
10,1.1658


## Inference

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n13, 21, 34, 55, 89, 144, 233, 377, 610, 987<|end_of_text|>']

## Save model.

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")



('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

## Reload the model.

In [None]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


["<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\nOne of the most famous tall towers in Paris is the Eiffel Tower. It is 324 meters tall and was built in 1889 for the World's Fair. The Eiffel Tower is the most visited paid monument in the world, with over 7 million visitors annually.<|end_of_text|>"]

## Save the compelete model.

In [None]:
# Merge to 16bit
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.81 out of 12.67 RAM for saving.


  3%|▎         | 1/32 [00:00<00:07,  3.97it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:25<00:00,  2.69s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving model/pytorch_model-00001-of-00004.bin...
Unsloth: Saving model/pytorch_model-00002-of-00004.bin...
Unsloth: Saving model/pytorch_model-00003-of-00004.bin...
Unsloth: Saving model/pytorch_model-00004-of-00004.bin...
Done.


In [None]:
# Or merge to 4bit
model.save_pretrained_merged("model_4bit", tokenizer, save_method = "merged_4bit_forced")