# Chapter 4 Command Fine-tuning

In this lesson, you will learn about command fine-tuning, a variation of fine-tuning that turns GPT-3 into a chatty GPT and gives it the ability to chat. Okay, let's start giving all models chatty capabilities.

Let's take a deeper look at what command fine-tuning is. Command fine-tuning is a type of fine-tuning. You can also do various other tasks like inference, routing, copilot (i.e. write code, chat, create different agents). But specifically, command fine-tuning (which may also be called command tuning or command-following LLMs) allows the model to follow commands and behave more like a chatbot.

Just like we saw with the chatty GPT, this is a better user interface to interact with the model. It is this approach that turned GPT-3 into a chatty GPT, which greatly expanded the scope of AI from a few researchers like me to millions of people.

![What is instruction finetuning](../../figures/What%20is%20instruction%20finetuning.png)

For the dataset for fine-tuning instructions, you can use a lot of data that is already available online or company-specific. This data may be FAQs, customer support conversations, or Slack messages.

So this is actually a conversation dataset or instruction response dataset. Of course, if you don't have data, there is no question. You can also convert the data into a question-answer pair format or an instruction response format by using a prompt template. Here you can see that a README file may be converted into a pair of question answers. You can also use other LLMs to do this. Stanford University has a technology called Alpaca that can do this using chat GPT. Of course, you can also use the workflow of different open source models to do this.

![LLM data generation](../../figures/LLM%20data%20generation.png)

I think the cool thing about fine-tuning is that it can teach new behaviors to the model.
You might have fine-tuning data about the capital of France being Paris. Because those are easy to get question-answer pairs. You can also generalize this idea of ​​question answering to data where you might not have given the model a fine-tuning dataset, but the model has learned this data in its pre-training steps, which might be code. This is actually what the ChatGPT paper found, where the model can now answer questions about code even though they didn't learn code before. Even though they didn't have question-answer pairs about code when they were fine-tuning. This is because it's really expensive for programmers to annotate datasets, ask questions about code, and write code for it.

![Instruction finetuning generalization](../../figures/Instruction%20finetuning%20generalization.png)

So the different steps of fine-tuning can be summarized as data preparation, training, and evaluation. And of course, after you evaluate the model, you need to prepare the data again to improve the model. It's a very iterative process of improving the model. When it comes to instruction fine-tuning and other different types of fine-tuning, data preparation is where it really makes a difference. This is where you change your data. You adjust your data based on the specific type of fine-tuning, the specific task you're fine-tuning. Training and evaluation are very similar.

![Different types of finetuning](../../figures/Different%20types%20of%20finetuning.png)

Now let's move on to the lab, where you can see the Alpaca dataset used for instruction tuning. You will also compare the instruction-tuned and non-instruction-tuned models again. You will also see models of different sizes.

First, we need to import a few libraries. The most important one is the `load_dataset` function from the `datasets` library.

In [1]:
# Import the itertools library to create efficient iterators
import itertools

# Import the jsonlines library to process JSONL (JSON Lines) format files
import jsonlines

# Import the load_dataset function of the datasets library to load various natural language processing datasets
from datasets import load_dataset

# Import the pprint library to print Python objects beautifully
from pprint import pprint

# Import the BasicModelRunner class of the llama library to simplify model running
from llama import BasicModelRunner

# Import the AutoTokenizer class from the transformers library,
# Used to automatically select and load pre-trained tokenizers suitable for a specific model
from transformers import AutoTokenizer

# Import the AutoModelForCausalLM class from the transformers library,
# Used to automatically select and load pre-trained models for causal language models
from transformers import AutoModelForCausalLM

# Import the AutoModelForSeq2SeqLM class from the transformers library,
# Used to automatically select and load pre-trained models suitable for sequence-to-sequence tasks
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


## 1. Loading instructions to fine-tune the dataset

Let's load this instruction tuning dataset, which is the Alpaca dataset. Again, we're using streaming because this is actually a large fine-tuning dataset, but it's not a heap of large. We're going to load it.

In [2]:
# Use the `load_dataset` function to load the dataset named "tatsu-lab/alpaca" from the datasets library
# Set split to "train" to load training data and enable streaming mode
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)

Just like before, you will see several examples. Unlike the previous section, it is not unstructured text.

In [3]:
# Define m, which represents the number of records we want to view
m = 5

# Print a descriptive message
print("指令微调的数据集是：")

# Use `itertools.islice` to slice the first m records from the dataset
# and convert it into a list top_m
top_m = list(itertools.islice(instruction_tuned_dataset, m))

# Traverse the top_m list and print each record
for j in top_m:
  print(j)

指令微调的数据集是：
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n

## 2. Two prompt templates

It's a little more organized here. But it's still not as clear cut as the question-answer pair. The cool thing is that the authors of the Alpaca paper actually have two prompt templates because they want the model to be able to handle two different types of prompts and two different types of tasks. One of them is the instruction, and the other is the additional input. For example, the instruction might be to add two numbers. The input might be that the first number is 3 and the second number is 4. There's also a prompt template with no input, which you can see in these examples.

In [4]:
# Define two string templates: one for data points with input fields and another for data points without input fields
prompt_template_with_input = """下面是一条描述任务的指令，辅以一个提供进一步上下文的输入。请编写一个能合理完成请求的响应。

### Instruction:
{instruction}

### Input:
{input}

### Response: """

prompt_template_without_input = """下面是一条描述任务的指令。编写一个能合理完成请求的响应。

### Instruction:
{instruction}

### Reponse: """

## 3. Fusion prompts (adding data to prompts)

Sometimes, the input is not important. So there is no input. These are the prompt templates that are being used. Again, very similar to the previous method, you just fuse these prompts together and then run it on the entire dataset. Let's print out a question and answer pair to see what it looks like.

In [5]:
# Initialize an empty list to store the processed data
processed_data = []

# Loop through each element j in the top_m list (you didn't give the definition of top_m, I assume it is a list containing multiple data points)
for j in top_m:
# Determine whether the "input" field of the current element j is empty or does not exist
  if not j["input"]:
# If the "input" field is empty or does not exist, use the template without the input field and fill it with the "instruction" field in j
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
  else:
# If the "input" field exists and is not empty, use the template with the input field, filling it with the "instruction" and "input" fields in j
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])

# Create a new dictionary where the "input" field is the processed prompt and the "output" field is the "output" field in j and add it to the processed_data list
  processed_data.append({"input": processed_prompt, "output": j["output"]})

In [6]:
# Use the pprint function to print the first element of the processed_data list in a beautiful format
pprint(processed_data[0])

{'input': '下面是一条描述任务的指令。编写一个能合理完成请求的响应。\n'
          '\n'
          '### Instruction:\n'
          'Give three tips for staying healthy.\n'
          '\n'
          '### Reponse:',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits '
           'and vegetables. \n'
           '2. Exercise regularly to keep your body active and strong. \n'
           '3. Get enough sleep and maintain a consistent sleep schedule.'}


So that's the input and output, and you know how it all fits into the prompt. It ends with `###Response`, and then it outputs the response here.

## 4. jsonl data storage

Just like before, you can write this out to a JSON line file. You can upload it to Hugging Face Hub if you want. We've actually already got it loaded up on Lamini/Alpaca, so it's pretty stable. You can go there and take a look at it and use it, too.

In [7]:
# Open a jsonl file named 'alpaca_processed.jsonl' in write mode
with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
# Use the writer's write_all method to write all elements of the processed_data list to the jsonl file
    writer.write_all(processed_data)

## 5. Comparison of models without instruction fine-tuning and with instruction fine-tuning

Great, now that you’ve seen what the command dataset looks like, I think the next thing to do is to compare how different models answer the question “tell me how to train my dog ​​to sit”.

In [8]:
# Import the load_dataset function from the HuggingFace datasets library
dataset_path_hf = "lamini/alpaca"          # 设置数据集路径为"lamini/alpaca"
dataset_hf = load_dataset(dataset_path_hf) # 加载数据集
print(dataset_hf)                          # 打印加载的数据集

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 52002
    })
})


The first model is the llama2 model, which is also untrained. We're going to run it. Tell me how to train my dog ​​to sit. Okay, it starts with this period again, and then it continues to output.

In [9]:
# Initialize a non-command-tuned model, the model path is "meta-llama/Llama-2-7b-hf"
non_instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-hf")
# Use this model to generate responses on how to train a dog to sit
non_instruct_output = non_instruct_model("告诉我如何训练狗狗坐下")
print("Not instruction-tuned output (Llama 2 Base):", non_instruct_output) # 打印响应

Not instruction-tuned output (Llama 2 Base): 来

我的狗狗很难坐下来，它很难停下来，它停下来后很难又坐下来。

我想要它坐下来，但是它很難停下来，很難又坐下去。

我想要坐下来，且停下来，但是我很难很难吃下去。

我想坐下来，依然停下来，依然难吃下来。

我想坚持下来，但是依然很难却坐下来。


Keeping in mind the previous result, let's now compare it to the instruction-tuned model. It performs much better and actually produces different steps.

In [12]:
# Use this model to generate responses on how to train a dog to sit
instruct_output = instruct_model("告诉我如何训练狗狗坐下")
print("Instruction-tuned output (Llama 2): ", instruct_output) # 打印响应

Instruction-tuned output (Llama 2):  ？

很多人都想要训练犬坐下，但是它们不知道如何做。下面是一些简单的步骤，可以帮助你训练牧牛坐下：
1. 选择合适的场合：选择一个安全的和舒适的场合，例如在家中或者在一个宽敞的地方。
2. 预备好物品：您需要一些物品来训练炸犬坐下。例如，您可以使用一个够大的床垫，或者一个够舒适的毯子。
3. 训练着犬坐下：您可以通过�����


Finally, I want to share ChatGPT again so you can compare here.

In [None]:
# Initialize a ChatGPT model named "chat-gpt"
chatgpt = BasicModelRunner("chat-gpt")
# Use this model to generate responses on how to train a dog to sit
instruct_output_chatgpt = chatgpt("告诉我如何训练狗狗坐下")
print("Instruction-tuned output (ChatGPT): ", instruct_output_chatgpt) # 打印响应

Instruction-tuned output (ChatGPT):  训练狗狗坐下是一项基本的训练技巧，以下是一些步骤来帮助你训练狗狗坐下：

1. 准备奖励：准备一些小零食或者狗狗喜欢的食物作为奖励，这将帮助你激励狗狗。

2. 找一个安静的地方：选择一个安静的地方开始训练，这样可以减少干扰，让狗狗更容易集中注意力。

3. 坐下手势：站在狗狗面前，拿起一小块食物，将手掌心朝上，然后慢慢将手从狗狗的鼻子上方移向后方，使狗狗的头部跟随你的


Okay, this is a larger set of models. The ChatGPT is pretty large compared to the llama2 model. The llama2 model actually has 7 billion parameters, and the ChatGPT is said to be around 70 billion. So the models are pretty large. You're also going to explore some smaller models. One of them is a 70 million parameter model.

## 6. Try a smaller model

I'm loading these models. This isn't really important yet. You'll explore this more later. But I'm going to load two different things to process the data and then run the model. You can see that our tag here is `EleutherAI/pythia-70m`. This is a 70 million parameter model that has not been tuned yet. I'm going to paste some code here. This is a function that runs inference or basically runs a model on text. In the next few experiments, we're going to go into detail on different parts of this function.

In [15]:
# Import necessary modules
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")

In [16]:
# Define the inference function
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
# Tokenize: Convert input text into Token IDs
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",  # 返回PyTorch张量
          truncation=True,  # 如果文本太长，进行截断
          max_length=max_input_tokens  # 输入文本的最大长度
  )

# Generate: Generate output using the model
  device = model.device  # 获取模型所在的设备（CPU或GPU）
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),  # 将输入数据移到相同的设备
    max_length=max_output_tokens  # 输出的最大长度
  )

# Decode: Decode the generated Token IDs back into text
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

# Strip the prompt: Remove the input text from the output to get pure response
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer  # 返回生成的文本

This model hasn't been fine-tuned yet. It doesn't know anything specific about the companies. But we can load our company dataset again. So, we're going to give this model a question from this dataset. For example, maybe just the first example in the test set. So we can run it here.

In [19]:
# Load the dataset for fine-tuning
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)  # 打印数据集信息

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [None]:
# Get test samples
test_sample = finetuning_dataset["test"][0]
print(test_sample)  # 打印测试样本

# Perform inference using the base model
print(inference(test_sample["question"], model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


{'question': 'Can Lamini generate technical documentation or user manuals for software projects?', 'answer': 'Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.', 'input_ids': [5804, 418, 4988, 74, 6635, 7681, 10097, 390, 2608, 11595, 84, 323, 3694, 6493, 32, 4374, 13, 418, 4988, 74, 476, 6635, 7681, 10097, 285, 2608, 11595, 84, 323, 3694, 6493, 15, 733, 4648, 3626, 3448, 5978, 5609, 281, 2794, 2590, 285, 44003, 10097, 326, 310, 3477, 281, 2096, 323, 1097, 7681, 285, 1327, 14, 48746, 4212, 15, 831, 476, 5321, 12259, 247, 1534, 2408, 273, 673, 285, 3434, 275, 6153, 10097, 13, 6941, 731, 281, 2770, 327, 643, 7794, 273, 616, 6493, 15], 'attention

The question is, can Lamini generate technical documentation or user manuals for software projects? The actual answer is yes, Lamini can generate technical documentation and user manuals for software projects. It keeps running. But the model's answer is, I have a question about the following. How can the right documentation work? The answer is, I think you need to use the following code, etc. So it's far from the right answer.

Of course, it learned English, and it also understands the word documentation. So, it may understand that it is answering the question because it uses A to represent the answer. But this is obviously wrong. So in terms of knowledge, it doesn't quite understand this dataset, nor does it understand the behavior we expect it to do. So it doesn't know that it should answer this question.

## VII. Comparison with a smaller model after fine-tuning

Now, compare this to the model that we fine-tuned for you, which you're actually going to fine-tune for the following instructions. Load this model, and then we can run the same problem through this model and see how it performs.

In [None]:
# Load the fine-tuned model
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")

In [None]:
# Use the fine-tuned model for inference
print(inference(test_sample["question"], instruction_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Yes, Lamini can generate technical documentation or user manuals for software projects. This can be achieved by providing a prompt for a specific technical question or question to the LLM Engine, or by providing a prompt for a specific technical question or question. Additionally, Lamini can be trained on specific technical questions or questions to help users understand the process and provide feedback to the LLM Engine. Additionally, Lamini


The results show that, yes, Lamini can generate technical documentation or user manuals for software projects and so on. So, this model is much more accurate than the previous one. It follows the correct behavior as we expect.

Now that you have seen how to fine-tune the instructions, the next step is to learn about the tokenizer, how to pre-process our data so that the model can be trained with this data.

In [None]:
# If you want to know how to upload your own dataset to Huggingface
# This is how we implement it

# !pip install huggingface_hub
# !huggingface-cli login

# import pandas as pd
# import datasets
# from datasets import Dataset

# finetuning_dataset = Dataset.from_pandas(pd.DataFrame(data=finetuning_dataset))
# finetuning_dataset.push_to_hub(dataset_path_hf)