# Data Preparation for Pre-training LLMs

Data preparation is one of the most important steps in pre-training a Large Language Model (LLM). The quality of the model depends heavily on the quality of the data used.

### Key Steps

1. **Data Collection**
   - Gather large amounts of text from sources like books, websites, research papers, and code.

2. **Cleaning**
   - Remove duplicates, spam, corrupted text, and irrelevant content.

3. **Filtering**
   - Keep high-quality text and remove low-quality or toxic data.

4. **Deduplication**
   - Eliminate repeated data so the model does not overfit.

5. **Formatting**
   - Convert data into a consistent structure suitable for training.

6. **Privacy Reduction**
   - Remove Personality Identifiable Information(PII).

7. **Tokenization**
   - Break text into tokens so the model can process it efficiently.

Better data leads to better models. Well-prepared datasets improve learning efficiency, reduce bias, and enhance overall model performance.

In [6]:
import warnings
warnings.filterwarnings("ignore")

##Sourcing datasets for pretraining
In this section, you'll see two ways to source data for training:

1. Download an existing dataset from Hugging Face
2. Create a dataset of python scripts sourced from Github

In both cases the result will be a Hugging Face Dataset object, part of the Datasets library.

In [7]:
from datasets import load_dataset
import datasets

dataset = load_dataset("roneneldan/TinyStories", split="train")



README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00004-2d5a1467fff108(…):   0%|          | 0.00/249M [00:00<?, ?B/s]

data/train-00001-of-00004-5852b56a2bd28f(…):   0%|          | 0.00/248M [00:00<?, ?B/s]

data/train-00002-of-00004-a26307300439e9(…):   0%|          | 0.00/246M [00:00<?, ?B/s]

data/train-00003-of-00004-d243063613e5a0(…):   0%|          | 0.00/248M [00:00<?, ?B/s]

data/validation-00000-of-00001-869c898b5(…):   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

{'text': 'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'}


In [9]:
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 2119719
})


In [10]:
print(dataset['text'])

Column(['One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.', 'Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.\n\nOne day, Beep was driving in the park when he saw a big tree. The tree had many leaves t

In [11]:
pretraining_dataset = dataset.select_columns(
    ['text']
    )

In [12]:
print(pretraining_dataset[4]["text"][:100])

Once upon a time, there was a little girl named Lily. Lily liked to pretend she was a popular prince


In [13]:
pretraining_dataset.shape

(2119719, 1)

## Comparing Pre-training and Fine-tuning Datasets

| Aspect        | Pre-training Dataset                                | Fine-tuning Dataset                                            |
| ------------- | --------------------------------------------------- | -------------------------------------------------------------- |
| **Purpose**   | Teach the model general language understanding      | Adapt the model to a specific task or behavior                 |
| **Size**      | Extremely large (millions to trillions of tokens)   | Much smaller (thousands to millions of tokens)                 |
| **Data Type** | Broad, diverse internet text, books, articles, code | Task-specific examples like instructions, Q&A, or labeled data |
| **Structure** | Mostly unstructured raw text                        | Structured input–output pairs                                  |
| **Goal**      | Learn grammar, facts, reasoning patterns            | Improve performance on a particular application                |
| **Examples**  | Web crawl data, Wikipedia, code repositories        | Chat datasets, customer support logs, classification labels    |


#### Simple Way to Think About It

* **Pre-training:** Learning the language of the world.
* **Fine-tuning:** Learning how to perform a specific job.

### Example

**Pre-training data**

```
The solar system consists of the Sun and the objects that orbit it...
```

**Fine-tuning data**

```
Instruction: Explain the solar system simply.
Response: The solar system is the Sun and all the planets and objects that move around it.
```


In [14]:
instruction_dataset = datasets.load_dataset(
    "c-s-ale/alpaca-gpt4-data",
    split="train"
)

README.md: 0.00B [00:00, ?B/s]

data/alpaca_gpt4_data.json:   0%|          | 0.00/43.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [15]:
print(instruction_dataset)

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 52002
})


In [16]:
i=0
print("Instruction: " + instruction_dataset[i]["instruction"]
      + "\nInput: " + instruction_dataset[i]["input"]
      + "\nOutput: " + instruction_dataset[i]["output"])

Instruction: Give three tips for staying healthy.
Input: 
Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.


Notice how in contrast to the pretraining data, which is just raw text, fine-tuning datasets are structured into question-answer pairs or instruction-response sets that can include additional input context if required.


##Scrape python code from Github

Here, we download a selection of python scripts from Github and then prepare them as a Hugging Face `Dataset` object to use in training.

The same pattern here will work for preparing any text scraped from the web.

In [17]:
# Import some required packages
import os
import requests

code_dir = "./code"
# Path to directory to store python scripts
os.makedirs(code_dir, exist_ok=True)

In [18]:
urls = [
    "https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py",
    "https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py",
    "https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py",
    "https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py",
    "https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py",
    "https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py",
    "https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py",
    "https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/django/contrib/messages/__init__.py",
    "https://raw.githubusercontent.com/PaliC/pytorch/master/test/fx/test_subgraph_rewriter.py"
]

In [19]:
for url in urls:
  print(f'Working on url: {url}')
  response = requests.get(url)

  file_name = os.path.basename(url)
  file_path = os.path.join(code_dir, file_name)

  with open(file_path, 'wb') as f:
    f.write(response.content)

Working on url: https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py
Working on url: https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py
Working on url: https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py
Working on url: https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py
Working on url: https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py
Working on url: https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py
Working on url: https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py
Working on url: https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/djan

###Retrieve the python scripts

In [20]:
files = os.listdir(code_dir)
for file in files:
  print(file)

module_util.py
values.py
distribute_coordinator_context.py
numpy_mlp.py
double_linear_search_recursion.py
version.py
visualize.py
test_subgraph_rewriter.py
__init__.py


###Concatenate scripts into a list

In [21]:
code_dataset = []
for file in os.listdir(code_dir):
  code_dataset.append(
      {'text': open(os.path.join(code_dir, file), 'r').read()}
  )

Convert list to Hugging Face Dataset object:

In [22]:
code_dataset = datasets.Dataset.from_list(code_dataset)
print(code_dataset)

Dataset({
    features: ['text'],
    num_rows: 9
})


Combine the python code dataset with the pretraining dataset you downloaded above:

In [23]:
dataset = datasets.concatenate_datasets(
    [pretraining_dataset, code_dataset]
)

In [24]:
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 2119728
})


##Data Cleaning



###Remove examples that are too short  

In [25]:
import heapq

def paragraph_length_filter(x):
  """Returns False iff a page has too few lines or lines are too short."""
  lines = x['text'].split('\n')
  if(
      len(lines)<3
      or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
  ):
      return False
  return True

In [26]:
dataset = dataset.filter(
    paragraph_length_filter,
    load_from_cache_file = False
    )

Filter:   0%|          | 0/2119728 [00:00<?, ? examples/s]

In [27]:
print(dataset.num_rows)

2000809


##Remove repeated text within each training example

In [38]:
def find_duplicates(paragraphs):

  unique_x = set()
  duplicate_chars = 0
  duplicate_elements = 0
  for element in paragraphs:
    if element in unique_x:
      duplicate_chars += len(element)
      duplicate_elements += 1
    else:
      unique_x.add(element)

  return duplicate_elements, duplicate_chars

In [41]:
import re

def paragraph_repetiton_filter(x):
  text = x['text']
  paragraphs = re.compile(r'\n{2, }').split(text.strip())
  paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)
  if paragraphs_duplicates / len(paragraphs) > 0.3:
        return False
  if char_duplicates / len(text) > 0.2:
        return False
  return True

In [42]:
dataset = dataset.filter(
    paragraph_repetiton_filter,
    load_from_cache_file = False
)

Filter:   0%|          | 0/2000809 [00:00<?, ? examples/s]

In [None]:
dataset.num_rows

##Deduplication

Remove duplicate examples from the entire data set

In [None]:
def deduplication(ds):
    def dedup_func(x):
        if x['text'] in unique_text:
            return False
        else:
            unique_text.add(x['text'])
            return True

    unique_text = set()

    ds = ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)
    return ds

dataset = deduplication(dataset)

In [44]:
dataset.num_rows

2000809

##Quality Filter - Language


 Remove any text examples that are in a language other than English.

In [66]:
!pip install lingua-language-detector

Collecting lingua-language-detector
  Downloading lingua_language_detector-2.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Downloading lingua_language_detector-2.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (96.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lingua-language-detector
Successfully installed lingua-language-detector-2.1.1


In [67]:
from lingua import Language, LanguageDetectorBuilder

# Build a detector that only distinguishes English vs non-English
detector = LanguageDetectorBuilder.from_languages(Language.ENGLISH).build()

In [68]:
def is_english(example):
    text = example["text"]

    if not text or len(text) < 20:   # skip tiny/noisy rows
        return False

    try:
        lang = detector.detect_language_of(text)
        return lang == Language.ENGLISH
    except:
        return False

In [69]:
dataset = dataset.filter(is_english, num_proc=1)

Filter:   0%|          | 0/2000809 [00:00<?, ? examples/s]

In [70]:
dataset.num_rows

2000809

##Save the dataset to disk

In [71]:
file_path = "content/Drive/MyDrive/data/preprocessed_dataset.parquet"
dataset.to_parquet(file_path)

Creating parquet from Arrow format:   0%|          | 0/2001 [00:00<?, ?ba/s]

1844849190