# 3. Where Finetuning Fits In 
In this section, we will understand where finetuning really fits into the training process. Finetuning actually comes after the "pre-training" step!

# Table of Contents
- [3.1 - Some knowledge before the experiment](#3.1)
- [3.2 - Experiment](#3.2)
- [3.2.1 - Check out the pre-training dataset](#3.2.1)
- [3.2.2 - Comparison of the company fine-tuning data we will use](#3.2.2)
- [3.2.3 - Different ways to format data](#3.2.3)
- [3.2.4 - Common ways to store data](#3.2.4)

<a name='3.1'></a>
# 3.1 Learn some knowledge before the experiment

This is the first step before fine-tuning happens. It starts with a completely random model. It knows nothing about the world. So all the weights, if we're familiar with weights, are completely random. It can't form English words at all. It doesn't have language skills yet. Its learning goal is next `token` prediction. Or, we know, in a simplified sense, it's just next word prediction. So we see the word `once`, so now we want it to predict the word `upon`. And then we see that the LLM is just `sd!!@`, which is far from the word `upon`. 

So that's where it starts. But it takes and reads data from a huge corpus of data that is usually scraped from all over the web. We usually call it unlabeled because it's not something we built together. We just got it from the web. There's a lot of data cleaning process that goes into it. So there's a lot of manual work that goes into making this dataset effective for model pre-training. This is often called self-supervised learning because the model is essentially supervising itself through next `token` prediction. All it has to do is predict the next word. There's no real labels beyond that.

![1](../../figures/FLLM-3-1.png)

Now, after training, we can see that the model is now able to predict the word `upon`, or the tag `upon`. This is learned language. It has learned a lot from the internet.

![2](../../figures/FLLM-3-2.png)

The actual understanding and knowledge behind this is often not very public. People don't know exactly what the datasets look like for closed source models from big companies. But there's an amazing open source project by Eleuther AI that created a dataset called The Pile, which we'll explore in the code. It's a set of 22 different datasets that were scraped from all over the internet. Here you can see this graph, you know, four score and seven years. This is Lincoln's Gettysburg Address. And Lincoln's carrot cake recipe. And of course, there's information about different medical literature that was scraped from PubMed. And then finally, there's code from GitHub. So it's a very clever set of datasets that were put together to infuse these models with knowledge. Now, this pre-training step is very expensive and time-consuming. It's actually quite expensive because to get this model to go through all this data, from absolutely random to understanding some of the text in it, you know, it has to learn about medicine and the Gettysburg Address while writing code.

![3](../../figures/FLLM-3-3.png)

In the picture below, when we input “What is the capital of Mexico?”, the output is obviously wrong.

![4](../../figures/FLLM-3-4.png)

As we can see, it's not really useful in the sense of a chatbot interface. So how do we get information into a chatbot interface? Well, fine-tuning is one of the ways to get us there. It should be a tool in our toolbox. So pre-training is the first step to get a basic model. As we add more data, which is not a lot of data, we can use fine-tuning to get a fine-tuned model. And actually, even a fine-tuned model, we can continue to add fine-tuning steps afterwards. So fine-tuning is a step after that. We can use the same type of data. We can actually collect data from different sources and bring it together, and we'll see a little bit. That's unlabeled data. But we can also curate the data ourselves and make it more structured for the model to learn. We think one of the key things that distinguishes fine-tuning from pre-training is that it requires much less data. We're building on this basic model that has already learned a lot of knowledge and basic language skills, and we're just taking it to the next level. We don't need as much data. So it's really a tool in our toolbox. If you're coming from other machine learning fields, this could be fine-tuning on a discriminative task. Maybe we're working with images and we fine-tuned on ImageNet. We can see that the definition of fine-tuning here is a bit loose. For generation tasks, it is not well defined because we are actually updating the weights of the entire model, not just some of them.This is often the case with fine-tuning other types of models. So our fine-tuning training objective is the same as the pre-training objective here, which is to generate the next `token`. What we've done is to change the data so that it's more structured in some way. The model can be more consistent in outputting and emulating that structure. And there are more advanced ways to reduce the number of times you want to update this model.

![5](../../figures/FLLM-3-5.png)

So what does fine-tuning do for us? We now have a sense of what it is, but what different tasks can we actually do with it? One big category that I like to think about is behavior change. We're changing the behavior of the model. We're telling it exactly that in this chat interface, we're now in a chat setting. That makes the model more consistent in its responses. That means the model can focus better. For example, this might be better for moderation. It's also generally just combing through its capabilities. So here, it's better at conversation. So it can now talk about a wide variety of things versus before where we had to do a lot of just-in-time engineering to comb through that information. Fine-tuning can also help the model acquire new knowledge. So that might be around specific topics that weren't in the base pre-trained model. It might mean correcting information that was previously incorrect. So maybe there's more up-to-date information that we want the model to actually inject. And of course, more commonly, both models are used. So generally, we're changing the behavior, we want it to acquire new knowledge.

![6](../../figures/FLLM-3-6.png)

The task to be completed by fine-tuning can be to extract text information or to expand the text, such as writing an email based on the basic information we provide. When we are more clear about what task we want to accomplish, the effect of fine-tuning will be better.

![7](../../figures/FLLM-3-7.png)

If this is your first time fine-tuning a model, follow these steps: 
1. First identify tasks that can be completed through the prompt project
2. Find tasks that we think the large model does well
3. Pick a task
4. Get the corresponding task, which can be 1,000 pairs of input and output (better than the original large model)
5. Use this data to fine-tune a small model

![8](../../figures/FLLM-3-8.png)

<a name='3.2'></a>
## 3.2 Experiment
Next, let's take a look at the difference between the original large model and the fine-tuned model:

In [1]:
# Import the corresponding library
import jsonlines
import itertools
import pandas as pd
# The pprint() function is used to format the print output object to make the output more neat and beautiful
from pprint import pprint

import datasets
from datasets import load_dataset

<a name='3.2.1'></a>
### 3.2.1 View the pre-training dataset

In [4]:
# Import dataset https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00000-of-01024.json.gz
pretrained_dataset = load_dataset("./c4-train.00000-of-01024.json/", "en", split="train", streaming=True)

In [5]:
# View the first n elements of the dataset (which is also an iterator)
n = 5
print("Pretrained dataset:")
# Syntax itertools.islice(iterable, start, stop, step) and returns an iterator object
top_n = itertools.islice(pretrained_dataset, n)
for i in top_n:
  print(i)

Pretrained dataset:
{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': datetime.datetime(2019, 4, 25, 12, 57, 54), 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'}
{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'

<a name='3.2.2'></a>
### 3.2.2 Comparison with the company fine-tuning data we will use

In [9]:
# .jsonl is a common JOSN line format file extension, namely JOSN Lines. And .jsonal is an optimized JSON format, which is very suitable for storing and reading large amounts of structured data
# Read the .jsonl file and view it
filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df

Unnamed: 0,question,answer
0,What are the different types of documents avai...,"Lamini has documentation on Getting Started, A..."
1,What is the recommended way to set up and conf...,Lamini can be downloaded as a python package a...
2,How can I find the specific documentation I ne...,"You can ask this model about documentation, wh..."
3,Does the documentation include explanations of...,Our documentation provides both real-world and...
4,Does the documentation provide information abo...,External dependencies and libraries are all av...
...,...,...
1395,What is Lamini and what is its collaboration w...,Lamini is a library that simplifies the proces...
1396,How does Lamini simplify the process of access...,Lamini simplifies data access in Databricks by...
1397,What are some of the key features provided by ...,Lamini automatically manages the infrastructur...
1398,How does Lamini ensure data privacy during the...,"During the training process, Lamini ensures da..."


<a name='3.2.3'></a>
### 3.2.3 Different ways to format data

In [29]:
# Form a dictionary between the "question" and "answer" columns of the above data, and check the first text after the connection
examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
text

"What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/."

In [27]:
# Connect different corresponding "pairs"
if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]
else:
  text = examples["text"][0]

In [30]:
# Construct a prompt template (including "question" and "answer")
prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

In [31]:
# Fill the "question" and "answer" of the first text into the prompt template
question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template

"### Question:\nWhat are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?\n\n### Answer:\nLamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/."

In [32]:
# Construct a prompt template (containing only "questions")
prompt_template_q = """### Question:
{question}

### Answer: """

In [15]:
# According to the length of the dictionary key-value pair, fill "question" and "answer" into the two templates in turn
num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

In [33]:
# Print the first element of the list under the view text template
pprint(finetuning_dataset_text_only[0])

{'text': '### Question:\n'
         'What are the different types of documents available in the '
         "repository (e.g., installation guide, API documentation, developer's "
         'guide)?\n'
         '\n'
         '### Answer:\n'
         'Lamini has documentation on Getting Started, Authentication, '
         'Question Answer Model, Python Library, Batching, Error Handling, '
         'Advanced topics, and class documentation on LLM Engine available at '
         'https://lamini-ai.github.io/.'}


In [17]:
# Print the first element of a list under another template
pprint(finetuning_dataset_question_answer[0])

{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository (e.g., installation guide, API documentation, '
             "developer's guide)?\n"
             '\n'
             '### Answer:'}


<a name='3.2.4'></a>
### 3.2.4 Common ways to store data

In [35]:
# For example, the content stored in the second template is in .jsonl format, and each line is in json format
with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

In [36]:
# If we uploaded the data to Hugging Face before, we can pull the data as follows
finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})
