<a href="https://colab.research.google.com/github/keqingli1129/langchain/blob/main/Finetuning_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers trl accelerate torch bitsandbytes peft datasets -qU

#### Load HF Dataset

First things first, we need to load our `mosaicml/instruct-v3` dataset. It's a great collection of effective and safe tasks.

In [None]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("mosaicml/instruct-v3")

Let's take a peek at our dataset.

It's our job to merge these `prompt` and `response` columns into a single formatted prompt for instruct-tuning.

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 56167
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 6807
    })
})

Since we want to generate a model that generates instructions - we're going to filter away all the subset datasets and only used the `dolly_hhrlhf` component!

In [None]:
instruct_tune_dataset = instruct_tune_dataset.filter(lambda x: x["source"] == "dolly_hhrlhf")

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 34333
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 4771
    })
})

We're going to train on a small subset of the data - if you were considering an Epoch based approach this would reduce the amount of time spent training!

In [None]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(5_000))

In [None]:
instruct_tune_dataset["test"] = instruct_tune_dataset["test"].select(range(200))

In [None]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 200
    })
})

#### Create Formatted Prompt

In the following function we'll be merging our `prompt` and `response` columns by creating the following template:

```
<s>### Instruction:
Use the provided input to create an instruction that could have been used to generate the response with an LLM.

### Input:
{input}

### Response:
{response}</s>
```

In [None]:
def create_prompt(sample):
  bos_token = "<s>"
  original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
  system_message = "Use the provided input to create an instruction that could have been used to generate the response with an LLM."
  response = sample["prompt"].replace(original_system_message, "").replace("\n\n### Instruction\n", "").replace("\n### Response\n", "").strip()
  input = sample["response"]
  eos_token = "</s>"

  full_prompt = ""
  full_prompt += bos_token
  full_prompt += "### Instruction:"
  full_prompt += "\n" + system_message
  full_prompt += "\n\n### Input:"
  full_prompt += "\n" + input
  full_prompt += "\n\n### Response:"
  full_prompt += "\n" + response
  full_prompt += eos_token

  return full_prompt

In [None]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.\n\n### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:\nWhat are different types of grass?</s>'

### Loading the Base Model

We're going to load our model in `4bit`, with double quantization, with `bfloat16` as our compute dtype.

You'll notice we're loading the instruct-tuned model - this is because it's already adept at following tasks - we're just teaching it a new one!

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False
)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Let's example how well the model does at this task currently:

In [None]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [None]:
generate_response("### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:")



"<s> ### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:\nWhen it comes to grass, there are many different varieties to choose from. The most common type is Kentucky Bluegrass, which is known for its quick growth, ease of care, and soft texture. If you're looking for a shiny, bright green grass, Rygrass is a great option. Fescues are another popular choice, with their dark green, shiny appearance. If you live in an area with dry soil, Bermuda grass may be the best choice for you, as it's harder but can still grow in those conditions.</s>"

Now, we're going to prepare our model for 4bit LoRA training!

We can use these handy helper functions to achieve this goal thanks to `huggingface` and the `peft` library!

In [None]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

In [None]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

All that's left to do is set up a number of hyper parameters.

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "mistral_instruct_generation",
  #num_train_epochs=5,
  max_steps = 100, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 4,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=20, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2e-4,
  bf16=True,
  lr_scheduler_type='constant',
)

In [None]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)



In [None]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
20,1.5234,1.371291
40,1.4518,1.3347
60,1.4292,1.32248
80,1.4286,1.314245
100,1.4276,1.309581


TrainOutput(global_step=100, training_loss=1.478941478729248, metrics={'train_runtime': 476.1112, 'train_samples_per_second': 0.84, 'train_steps_per_second': 0.21, 'total_flos': 3.50843194834944e+16, 'train_loss': 1.478941478729248, 'epoch': 0.08})

In [None]:
trainer.save_model("mistral_instruct_generation")

# Save Model and Push to Hub

4bit save and push coming soon!

The PR is literally in the process of being added! Check it out [here](https://github.com/TimDettmers/bitsandbytes/pull/753)!

For now, we'll save our adapters!

In [None]:
!pip install huggingface-hub -qU

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
trainer.push_to_hub("ai-maker-space/mistral-instruct-generation")

In [None]:
merged_model = model.merge_and_unload()



In [None]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [None]:
generate_response("### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:", merged_model)

'<s> ### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:\nIdentify the most common species of grass, and provide a brief description of its properties.</s>'