# Lesson 2: 데이터 준비

Deeplearning.AI & Upstage의 다음 강의를 듣고 정리한 노트북입니다.

https://learn.deeplearning.ai/courses/pretraining-llms/lesson/3/data-preparation

이 노트북에서는 LLM 사전 훈련을 위한 데이터를 어떻게 준비하는지 학습합니다. Upstage에서는 데이터 준비를 위한 [Dataverse](https://github.com/UpstageAI/dataverse)라는 라이브러리를 개발하여 배포하였습니다.

In [1]:
import warnings
warnings.filterwarnings("ignore")

## 사전 학습을 위한 데이터셋 확보

여기서는 데이터를 다음 두 가지 방법으로 확보합니다.

1. HuggingFace 데이터셋에서 다운로드
2. Github에서 파이썬 스크립트 스크레이핑

이후 HuggingFace의 [Dataset](https://huggingface.co/docs/datasets/en/index) 라이브러리를 이용하여 데이터셋을 처리합니다.

### HuggingFace에서 데이터 다운로드

여기서는 [Red Pajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) 데이터셋의 일부를 가져와서 사용합니다. 원본 데이터셋은 1조 개의 토큰으로 이루어져 있습니다.

In [4]:
import datasets
pretraining_dataset = datasets.load_dataset(
    "upstage/Pretraining_Dataset",
    split="train"
)

Downloading data:   0%|          | 0.00/150M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
print(pretraining_dataset)

Dataset({
    features: ['text', 'meta'],
    num_rows: 60000
})


여기서는 `text` 컬럼만을 사용합니다. 일부 데이터를 살펴보겠습니다.

In [6]:
pretraining_dataset = pretraining_dataset.select_columns(['text'])
print(pretraining_dataset[10]["text"][:500])

Home World CEO of Crew Clothing CEO Resigns
CEO of Crew Clothing CEO Resigns
By Karen Roe [CC BY 2.0], via Wikimedia Commons
Crew, a British lifestyle clothing brand, has been sold by Livingbridge, its founder and private equity firm to Exquisite Apparel.
However, Crew will be advancing under a new image, as the chief executive who was brought in by Livingbridge in order to develop the brand, Louise Barnes, has resigned following the sale. Barnes attempted to lead a management buyout. However, i


### 파인튜닝 데이터셋 다운로드

Alpaca 모델의 [instruction tuning dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html)을 살펴봅니다. 파인튜닝 데이터셋은 위와는 달리 instruction/input/output 형태로 이루어져 있습니다.

우리는 여기서는 파인튜닝 데이터셋은 사용하지 않을 예정입니다.

In [7]:
instruction_dataset = datasets.load_dataset(
    "c-s-ale/alpaca-gpt4-data",
    split="train"
)
print(instruction_dataset)

Downloading readme:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/43.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 52002
})


In [9]:
i=5535
print("Instruction: " + instruction_dataset[i]["instruction"] 
      + "\nInput: " + instruction_dataset[i]["input"] 
      + "\nOutput: " + instruction_dataset[i]["output"])

Instruction: Determine the most common word in the text.
Input: Humans are created in the image of God, from a spiritual perspective and from a physical perspective.
Output: The most common word in the text is "from" as it appears twice in the sentence.


### Github에서 파이썬 코드 스크레이핑

Github에서 파이썬 스크립트를 다운로드해서 `Dataset` 오브젝트로 변환합니다.

In [15]:
# Import some required packages
import os
import requests

# Path to directory to store python scripts
code_dir = "./code"

if not os.path.exists(code_dir):
    os.mkdir(code_dir)

In [11]:
urls = [
    "https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py",
    "https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py",
    "https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py",
    "https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py",
    "https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py",
    "https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py",
    "https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py",
    "https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/django/contrib/messages/__init__.py",
    "https://raw.githubusercontent.com/PaliC/pytorch/master/test/fx/test_subgraph_rewriter.py"
]

In [16]:
for url in urls:
    print(f"Working on url: {url}")
    response = requests.get(url)
    file_name = os.path.basename(url)
    file_path = os.path.join(code_dir, file_name)
    
    with open(file_path, "wb") as file:
        file.write(response.content)

Working on url: https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py
Working on url: https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py
Working on url: https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py
Working on url: https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py
Working on url: https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py
Working on url: https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py
Working on url: https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py
Working on url: https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/djan

In [17]:
files = os.listdir(code_dir)
for file in files:
    print(file)

test_subgraph_rewriter.py
numpy_mlp.py
values.py
version.py
double_linear_search_recursion.py
__init__.py
visualize.py
module_util.py
distribute_coordinator_context.py


In [18]:
code_dataset = []
for f in os.listdir(code_dir):
    code_dataset.append(
        {'text':open(os.path.join(code_dir, f), 'r').read()}
    )

In [20]:
code_dataset = datasets.Dataset.from_list(code_dataset)
print(code_dataset)

Dataset({
    features: ['text'],
    num_rows: 9
})


사전 학습 데이터셋과 파이썬 코드 데이터셋을 결합합니다.

In [23]:
dataset = datasets.concatenate_datasets([pretraining_dataset, code_dataset])
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 60009
})


## 데이터 정제

여기서는 다음 과정을 통해 데이터를 정제합니다.

1. 너무 짧은 샘플 삭제
2. 한 텍스트 내에 반복문이 많은 경우 삭제
3. 중복 데이터 삭제
4. 영어가 아닌 데이터 삭제

### 너무 짧은 샘플 삭제

In [30]:
import heapq

def paragraph_length_filter(x):
    """Return False iff a page has too few lines or lines are too short."""
    lines = x['text'].split('\n')
    if (
        len(lines) < 3
        or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
    ):
        return False
    return True

In [31]:
dataset = dataset.filter(
    paragraph_length_filter,
    load_from_cache_file=False
)

Filter:   0%|          | 0/60009 [00:00<?, ? examples/s]

In [32]:
len(dataset)

52357

### 한 텍스트 내에 반복문이 많은 경우 삭제

In [33]:
def find_duplicates(paragraphs):
    unique_x = set()
    duplicate_chars = 0
    duplicate_elts = 0
    for elt in paragraphs:
        if elt in unique_x:
            duplicate_elts += 1
            duplicate_chars += len(elt)
        else:
            unique_x.add(elt)
    return duplicate_elts, duplicate_chars

In [36]:
import re

def paragraph_repetition_filter(x):
    text = x['text']
    paragraphs = re.compile(r"\n{2,}").split(text.strip())
    paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)
    if paragraphs_duplicates / len(paragraphs) > 0.3:
        return False
    if char_duplicates / len(text) > 0.2:
        return False
    return True

실제로 중복이 발생한 텍스트 확인

In [52]:
text = dataset[1891]['text']
paragraphs = re.compile(r"\n{2,}").split(text.strip())
find_duplicates(paragraphs), len(paragraphs), len(text)

((4, 697), 15, 2845)

In [53]:
paragraphs

['Q: Adding lines and spaces with \\addtocontents{toc} without \\addtocontents{ptc} I have a follow-up question to this one:\nWant \\addtocontents{toc} without \\addtocontents{ptc}\nI use the titletoc package and want to add vertical spaces and a line in the table of contents, but NOT in the partial TOC. However, the lines and spaces do appear in all partial TOCs as marked in red in the picture below. The solution in the linked question did not work for me, because I want to add an object and not a section. \nDoes anybody know how to circumvent this?\nHere is my MWE:\n\\documentclass{article}\n\\usepackage{titletoc}',
 '\\begin{document}\n\\tableofcontents',
 '\\section{Section1}\nHere the text of the document begins with Section 1.',
 '\\section{Section2}\n\\startcontents % Want partial TOC for Section2\n\\printcontents{}{1}{}\nHere is the text of Section 2.\n\\subsection{Subsection2.1}\nHere is the text of the first Subsection.\n\\subsection{Subsection2.2}\nHere is the text of the se

In [54]:
dataset = dataset.filter(
    paragraph_repetition_filter,
    load_from_cache_file=False
)

len(dataset)

Filter:   0%|          | 0/52357 [00:00<?, ? examples/s]

52327

### 중복 데이터 삭제

In [57]:
unique_text = set()

def dedup_func(x):
    if x['text'] in unique_text:
        return False
    else:
        unique_text.add(x['text'])
        return True
    
dataset = dataset.filter(dedup_func, load_from_cache_file=False, num_proc=1)
len(dataset)

Filter:   0%|          | 0/52327 [00:00<?, ? examples/s]

43598

### 영어가 아닌 데이터 삭제

특정 언어의 데이터를 추출하기 위해 [FastText](https://fasttext.cc) 라이브러리를 사용합니다. 실제 모델의 웨이트 파일은 [여기](https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin)에서 다운로드 받을 수 있습니다.

In [74]:
from fasttext.FastText import _FastText

def english_language_filter(ds):
    model = _FastText("./models/lid.176.bin")

    def is_english(x):
        language, score = model.predict(x['text'].replace('\n', ' '))
        language = language[0].split("__")[2]
        return score > 0.4 and language == "en" # en을 다른 언어로 변경 가능
    
    ds = ds.filter(is_english, load_from_cache_file=False, num_proc=1)
    return ds

dataset = english_language_filter(dataset)
len(dataset)


Filter:   0%|          | 0/43598 [00:00<?, ? examples/s]

40478

## 데이터 저장

parquet 형식에 대해서는 [여기](https://parquet.apache.org)를 참고하세요.

In [77]:
file_path = "./data/preprocessed_dataset.parquet"
dataset.to_parquet(file_path)

Creating parquet from Arrow format:   0%|          | 0/41 [00:00<?, ?ba/s]

197041742

In [78]:
!ls -alh data

total 228720
drwxr-xr-x  3 emart  staff    96B Jul 22 17:51 [1m[36m.[m[m
drwxr-xr-x  8 emart  staff   256B Jul 22 17:51 [1m[36m..[m[m
-rw-r--r--  1 emart  staff   112M Jul 22 17:51 preprocessed_dataset.parquet
