# Chapter 5 Data Processing

- [I. Important factors in preparing training data](#I.-Important factors in preparing training data)
- [II. Data processing steps](#II.-Data processing steps)
- [2.1 Text Tokenization](#2.1-Text-Tokenization)
- [2.1.1 Tokenize a text](#2.1.1-Tokenize a text)
- [2.1.2 Tokenize multiple texts at a time](#2.1.2-Tokenize multiple texts at a time)
- [2.1.3 Padding and truncation](#2.1.3-Padding and truncation)
- [2.1.4 Prepare instruction data set](#2.1.4-Prepare instruction data set)
- [2.1.5 Tokenize a sample](#2.1.5-Tokenize a sample)
- [2.1.6 Token [Tokenized instruction dataset](#2.1.6-Tokenized instruction dataset)
- [2.2 Splitting of test/training dataset](#2.2-Splitting of test/training dataset)
- [2.2.1 Some datasets for you to try](#2.2.1-Some datasets for you to try)

In this course, you will learn how to prepare training data to provide a solid foundation for your machine learning models. We will start with data collection and guide you step by step through the process of data preprocessing, tokenization, and model training. Let's get started!

## 1. Important factors in preparing training data

**1. Data Quality**
Data quality is the primary concern in data preparation. High-quality data can significantly improve model performance during fine-tuning and training. When preparing data, make sure to provide high-quality and accurate inputs. There is a famous saying: Garbage in, Garbage out

**2. Data Diversity**
Data diversity is another crucial factor. If the training data is too monotonous, the model may over-memorize and repeat outputs in similar situations. To avoid this, make sure the training data covers a variety of use cases and scenarios. Diverse datasets help models better understand different inputs and make more accurate predictions.

**3. Data Authenticity**
Although the generated data approach is feasible in some scenarios, in most cases, real data is very important for training and fine-tuning models. Generated data tends to have fixed patterns, which can limit the creativity and adaptability of the model. In most cases, having real data is more effective and helpful, especially for applications such as writing tasks. There are certain patterns in the generated data. Some services try to detect whether content is generated by a generative model by looking for patterns and regularities in the generated data.

**4. Data quantity**
The amount of data is indeed very important for training models. More data can usually help models generalize better and adapt to different situations. The introduction of pre-trained models can alleviate this problem to a certain extent.The problem of data quantity is that it has already established a certain foundational understanding by pre-training on a large amount of data on the Internet. So having more data helps the model, but not as important as the top three, and definitely not as important as quality.

In summary, data quality, diversity, authenticity, and quantity need to be considered when preparing training data. These factors will jointly affect the performance of the model and the quality of the output. Next, let's take a deep dive into how to effectively prepare training data.

## 2. Data processing steps

**1. Collect command-response pairs**
Collect question-answer pairs or command-response pairs from different sources. This can be conversation records from users, existing question-answer datasets, etc.

**2. Merge command pairs (add prompt templates if necessary)**
Merge the collected command and response pairs to form a whole dataset. In this step, you can add some prompt templates to each command or response as needed to help the model better understand the context.

**3. Tokenization**
Convert text data into digital representation. This step is done by using a tokenizer, which divides the text into words or subwords and assigns a unique identifier, i.e., a token, to each word or subword. This converts the text into a form that can be understood by the machine and is ready for subsequent processing. During the tokenization process, padding or truncation operations are also required to ensure that all texts are of the same length for easy processing by the model.

**4. Dataset division**
Divide the processed dataset into a training set and a test set. The training set is used to train the parameters of the model, while the test set is used to evaluate the performance and generalization ability of the model.

`ing` is a very common character. Most gerunds have this character. For example, finetun**ing** and tokeniz**ing** both have `ing`. In this example, `ing` is encoded as `278`.

When you decode the token using the same tokenizer, it is restored to the original text.

Each model is associated with a specific tokenizer and trained on it. If the wrong tokenizer is chosen, it will cause the model to think that different numbers represent different sets of letters and words, resulting in confusion and wrong results. Therefore, it is crucial to use the right tokenizer.

![tokenizing.png](../../figures/tokenizing.png)

In [1]:
import pandas as pd
import datasets

from pprint import pprint
from transformers import AutoTokenizer

The HuggingFace Transformers library is a very powerful and popular natural language processing tool library. You just need to specify the model and name you want, and it will automatically help you find the appropriate tokenizer and match it to the model.

### 2.1 Text Tokenization

Here we use the word segmenter corresponding to the 70m pythia model

In [52]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

#### 2.1.1 Tokenizing a Text

In [3]:
text = "Hi, how are you?"

In [4]:
encoded_text = tokenizer(text)["input_ids"]
print(encoded_text)

[12764, 13, 849, 403, 368, 32]


In [44]:
text = "嗨你好么"

In [45]:
encoded_text = tokenizer(text)["input_ids"]
print(encoded_text)

[161, 234, 103, 24553, 34439, 43244]


The tokenizer encodes this string of text into different numbers.
The tokenizer outputs a dictionary containing `input_ids` representing tokens

Now decode it back to text

In [7]:
decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)

Decoded tokens back into text:  Hi, how are you?


In [46]:
decoded_text = tokenizer.decode(encoded_text)
print("将 token 解码为文本: ", decoded_text)

将 token 解码为文本:  嗨你好么


It can be seen that the decoded text is consistent with the original

#### 2.1.2 Tokenizing multiple texts at once

Sometimes we need to tokenize multiple texts at once.

When processing batch input, we can concatenate lists of texts as input.

In [9]:
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])

Encoded several texts:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]


In [47]:
list_texts = ["嗨你好么", "我很好", "是"]
encoded_texts = tokenizer(list_texts)
print("编码多个文本: ", encoded_texts["input_ids"])

编码多个文本:  [[161, 234, 103, 24553, 34439, 43244], [15367, 45091, 34439], [12105]]


It can be seen that the length returned by the word segmenter is different for texts of different lengths.

#### 2.1.3 Padding and truncation

The model needs to process fixed-size tensors, so everything in a batch must be the same length.
Padding is a strategy for dealing with these variable-length encoded texts.

When filling, you need to choose a specific number as the filling token to represent the filling. Usually `0` is used to fill, which is also the end token of the sentence.

So when we run `padding = true` through the tokenizer, you can see that the `Yes` string has a lot of `0` on the right to make sure it is the same length as `Hi, how are you?`.

In [11]:
tokenizer.pad_token = tokenizer.eos_token
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("Using padding: ", encoded_texts_longest["input_ids"])

Using padding:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175, 0, 0, 0], [4374, 0, 0, 0, 0, 0]]


In [48]:
tokenizer.pad_token = tokenizer.eos_token
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("使用填充后的结果: ", encoded_texts_longest["input_ids"])

使用填充后的结果:  [[161, 234, 103, 24553, 34439, 43244], [15367, 45091, 34439, 0, 0, 0], [12105, 0, 0, 0, 0, 0]]


The model has a maximum length limit, which is the length of text that the model can process and accommodate. Therefore, it is not suitable for processing text data of arbitrary length.

You may have noticed that there is a length limit when using Prompt before. This is the model's truncation strategy, which is used to truncate the encoded text to fit the actual acceptable model input.

Through truncation, we can trim the overly long text to a length that is suitable for the model to process. This helps to shorten the processing time, because long text may take longer to process.

Here we set the maximum length to 3 and set truncation (`truncation = True`). You can see that `Hi, how are you?` is much shorter, removing all the content on the right.

In [12]:
encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("Using truncation: ", encoded_texts_truncation["input_ids"])

Using truncation:  [[12764, 13, 849], [42, 1353, 1175], [4374]]


In [49]:
encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("使用截断: ", encoded_texts_truncation["input_ids"])

使用截断:  [[24553, 34439, 43244], [15367, 45091, 34439], [12105]]


In practical applications, for example, if you are writing an article, you may be given some prompts at a certain position, and there are many important contents to consider at the same time. It may be more important to keep the right side, which saves important information from the previous text to maintain the coherence of the context. At this time, it may be more appropriate to mark the truncated side as the left side. So, the final decision depends on the specific problem you are solving and the desired results.

In [13]:
tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])

Using left-side truncation:  [[403, 368, 32], [42, 1353, 1175], [4374]]


In [50]:
tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("使用左侧截断: ", encoded_texts_truncation_left["input_ids"])

使用左侧截断:  [[24553, 34439, 43244], [15367, 45091, 34439], [12105]]


In fact, we often use both padding and truncation when processing input. We set the parameters of truncation and padding to True. You can see that the content filled with 0 is truncated to three.

In [14]:
encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])

Using both padding and truncation:  [[403, 368, 32], [42, 1353, 1175], [4374, 0, 0]]


In [51]:
encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("同时使用填充和截断: ", encoded_texts_both["input_ids"])

同时使用填充和截断:  [[24553, 34439, 43244], [15367, 45091, 34439], [12105, 0, 0]]


#### 2.1.4 Prepare instruction data set

Here is some code from the previous experiment.

Load the dataset file with "question" and "answer" and put it into the prompt to process.

Now you can see a data with "question" and "answer" here.

Let's run this tokenizer on one of the data.

In [15]:
import pandas as pd

filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
examples = instruction_dataset_df.to_dict()

if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]
else:
  text = examples["text"][0]

prompt_template = """### Question:
{question}

### Answer: """

num_examples = len(examples["question"])
finetuning_dataset = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]
  text_with_prompt_template = prompt_template.format(question=question)
  finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})

from pprint import pprint
print("One datapoint in the finetuning dataset:") #微调数据集中的一个数据点
pprint(finetuning_dataset[0])

One datapoint in the finetuning dataset:
{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository (e.g., installation guide, API documentation, '
             "developer's guide)?\n"
             '\n'
             '### Answer:'}


#### 2.1.5 Tokenizing a Sample

First the question is concatenated with the answer and then tokenized by a tokenizer.

For simplicity, the tensor is simply returned as a NumPy array and padded.

In [16]:
text = finetuning_dataset[0]["question"] + finetuning_dataset[0]["answer"]
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    padding=True
)
print(tokenized_inputs["input_ids"])

[[ 4118 19782    27   187  1276   403   253  1027  3510   273  7177  2130
    275   253 18491   313    70    15    72   904 12692  7102    13  8990
  10097    13 13722   434  7102  6177   187   187  4118 37741    27    45
   4988    74   556 10097   327 27669 11075   264    13  5271 23058    13
  19782 37741 10031    13 13814 11397    13   378 16464    13 11759 10535
   1981    13 21798 12989    13   285   966 10097   327 21708    46 10797
   2130   387  5987  1358    77  4988    74    14  2284    15  7280    15
    900 14206]]


Because you are not sure about the actual length of these tokens.

Therefore, set the configured maximum length to the minimum of the maximum length and the token length.

Of course, you can always set the padding length to the maximum length.

In [17]:
max_length = 2048
max_length = min(
    tokenized_inputs["input_ids"].shape[1],
    max_length,
)

Then, it is tokenized again and truncated to a maximum length.

In [18]:
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    truncation=True,
    max_length=max_length
)

In [19]:
print(tokenized_inputs["input_ids"])

[[ 4118 19782    27   187  1276   403   253  1027  3510   273  7177  2130
    275   253 18491   313    70    15    72   904 12692  7102    13  8990
  10097    13 13722   434  7102  6177   187   187  4118 37741    27    45
   4988    74   556 10097   327 27669 11075   264    13  5271 23058    13
  19782 37741 10031    13 13814 11397    13   378 16464    13 11759 10535
   1981    13 21798 12989    13   285   966 10097   327 21708    46 10797
   2130   387  5987  1358    77  4988    74    14  2284    15  7280    15
    900 14206]]


#### 2.1.6 Tokenized instruction dataset

Wrapping the above process into a function makes it easy to run it on the entire dataset.

In [20]:
def tokenize_function(examples):
    """
    对输入进行 token 化，并进行填充和截断处理，返回经过处理后的 token 作为结果

    Args:
        examples (dict): 待 token 化的数据，可以是包含"question"和"answer"键的字典，或包含"input"和"output"键的字典，或包含"text"键的字典

    Returns:
        dict: 经过处理后的输入数据的 token，包含经过 token 化并进行填充和截断处理后的输入数据
    """

# Merge instruction pairs
    if "question" in examples and "answer" in examples:
      text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0] + examples["output"][0]
    else:
      text = examples["text"][0]

# Tokenization
    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

Now we load the dataset.

We map the tokenization function to the dataset using the map method.

We set batch_size to 1 so that we can process in batches.

Set drop_last_batch to True to handle mixed-size inputs, since the last batch will have a different size than batch_size when the data length is not a multiple of batch_size.

In [30]:
finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")

tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)

Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 1400
})


Add a label column so the model can learn

In [22]:
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

### 2.2 Splitting of test/training datasets

Running this train-test split function will specify a test size of 10% of the data.

Of course, you can change this setting depending on the size of your dataset.

`shuffle=True` is used to randomize the order of the dataset.

Now you can see that the dataset has been split into training and test sets.

In [23]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


#### 2.2.1 Some datasets for you to try

The data we use can be loaded directly through Hugging Face.

This dataset is a professional dataset about a company, maybe it is similar to your company.

You can adapt it to your needs.

In [29]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = datasets.load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


If you think this dataset is a bit boring.

We provide some more interesting datasets for you to choose from.
1. `taylor_swift`'s dataset,
2. The dataset of the popular band `BTS`.
3. Actual open source large language model dataset.

In [25]:
taylor_swift_dataset = "lamini/taylor_swift"
bts_dataset = "lamini/bts"
open_llms = "lamini/open_llms"

Now let's look at one piece of data from the `taylor_swift` dataset, okay.

These datasets are also available through Hugging Face.

In [28]:
dataset_swiftie = datasets.load_dataset(taylor_swift_dataset)
print(dataset_swiftie["train"][1])

{'question': 'What is the most popular Taylor Swift song among millennials? How does this song relate to the millennial generation? What is the significance of this song in the millennial culture?', 'answer': 'Taylor Swift\'s "Shake It Off" is the most popular song among millennials. This song relates to the millennial generation as it is an anthem of self-acceptance and embracing one\'s individuality. The song\'s message of not letting others bring you down and to just dance it off resonates with the millennial culture, which is often characterized by a strong sense of individuality and a rejection of societal norms. Additionally, the song\'s upbeat and catchy melody makes it a perfect fit for the millennial generation, which is known for its love of pop music.', 'input_ids': [1276, 310, 253, 954, 4633, 11276, 24619, 4498, 2190, 24933, 8075, 32, 1359, 1057, 436, 4498, 14588, 281, 253, 24933, 451, 5978, 32, 1737, 310, 253, 8453, 273, 436, 4498, 275, 253, 24933, 451, 4466, 32, 37979, 24

In [27]:
# Here is how to push your own dataset to Huggingface hub
# !pip install huggingface_hub
# !huggingface-cli login
# split_dataset.push_to_hub(dataset_path_hf)

We have prepared all the data and tokenized it. In the following experiments, we will use this data to train our model.