# MCT用のデータセット作成ノートブック

参考：https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/README.md

## 準備

LLM-Foundryのリポジトリをクローン

In [0]:
!git clone https://github.com/mosaicml/llm-foundry.git /tmp/llm-foundry

In [0]:
%cd /tmp/llm-foundry
!pip install -e .
!pip install --upgrade typing_extensions

Hugging Faceにログイン（アクセストークン使用）

In [0]:
from huggingface_hub import login
login()

Linuxコマンドを実行用に必要な値を環境変数に設定

In [0]:
import os
os.environ['OUTPUT_ROOT'] = f"/tmp/dataset"
os.environ['TOKENIZER_NAME'] = "meta-llama/Llama-3.2-1B"
os.environ['MAX_SEQ_LEN'] = "4096"

## 事前学習用データの準備

### Option 1. 元データがJSON形式の場合

In [0]:
from IPython.display import display
import pandas as pd 

json_path = f'/tmp/llm-foundry/scripts/data_prep/example_data/arxiv.jsonl'

pandasDF = pd.read_json(path_or_buf=json_path, lines=True)
display(pandasDF)

学習用

In [0]:
%%bash
python3 /tmp/llm-foundry/scripts/data_prep/convert_dataset_json.py \
--path /tmp/llm-foundry/scripts/data_prep/example_data/arxiv.jsonl \
--out_root $OUTPUT_ROOT/my-copy-arxiv/train \
--split train \
--concat_tokens $MAX_SEQ_LEN \
--tokenizer $TOKENIZER_NAME \
--eos_text '<|end_of_text|>' \
--compression zstd

検証用

In [0]:
%%bash
python3 /tmp/llm-foundry/scripts/data_prep/convert_dataset_json.py \
--path /tmp/llm-foundry/scripts/data_prep/example_data/arxiv.jsonl \
--out_root $OUTPUT_ROOT/my-copy-arxiv/val \
--split train \
--concat_tokens $MAX_SEQ_LEN \
--tokenizer $TOKENIZER_NAME \
--eos_text '<|end_of_text|>' \
--compression zstd

### Option 2. 元データがTEXTの場合

In [0]:
%%bash
mkdir /tmp/shakespeare && cd /tmp/shakespeare
curl -O https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
echo '------------------------------------------------------'
head t8.shakespeare.txt

学習用

In [0]:
%%bash
python3 /tmp/llm-foundry/scripts/data_prep/convert_text_to_mds.py \
  --output_folder $OUTPUT_ROOT/my-copy-shakespeare/train \
  --input_folder /tmp/shakespeare \
  --concat_tokens $MAX_SEQ_LEN \
  --tokenizer $TOKENIZER_NAME \
  --use_tokenizer_eos \
  --compression zstd

検証用

In [0]:
%%bash
python3 /tmp/llm-foundry/scripts/data_prep/convert_text_to_mds.py \
  --output_folder $OUTPUT_ROOT/my-copy-shakespeare/val \
  --input_folder /tmp/shakespeare \
  --concat_tokens $MAX_SEQ_LEN \
  --tokenizer $TOKENIZER_NAME \
  --use_tokenizer_eos \
  --compression zstd

### Option 3. 元データがHuggingFace Datasetの場合

In [0]:
%%bash
python3 /tmp/llm-foundry/scripts/data_prep/convert_dataset_hf.py \
  --dataset allenai/c4 \
  --data_subset ja \
  --out_root $OUTPUT_ROOT/my-copy-c4-ja \
  --splits train_small val_small \
  --concat_tokens $MAX_SEQ_LEN \
  --tokenizer $TOKENIZER_NAME \
  --eos_text '<|end_of_text|>' \
  --compression zstd

## ファインチューニング用データの準備

In [0]:
%%bash
python3 /tmp/llm-foundry/scripts/data_prep/convert_finetuning_dataset.py \
    --dataset kunishou/databricks-dolly-15k-ja \
    --preprocessor "llmfoundry.data.finetuning.tasks:dolly_preprocessing_function" \
    --splits train \
    --out_root $OUTPUT_ROOT/my-copy-dolly-15k-ja