# Using datasets with local text file

This will be covering, specifically on how to load local dataset files.

In general you should review through `./dataset-config-huggingface-example.ipynb` first, because a large percentage of settings can be used together with the settings covered here.

> Important note: These example focuses only on how to configure your dataset, and does not properly perform checkmarking - for trainer configurations refer to the training notebooks

## Intial setup

Before we go into the dataset setup, lets perform an initial setup for all the folders we need, and a small toy model which we would use throughout the various examples within this notebook.

In [None]:
# Setup the folders we will need
!mkdir -p ../../model/
!mkdir -p ../../datapath/
!mkdir -p ../../checkpoint/

# Initialized a simple L6-D512 model, for both the v4 neox (50277) tokenizer
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size neox --skip-if-exists ../model/L6-D512-neox-init.pth

# and rwkv world (65529) tokenizers
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size world --skip-if-exists ../model/L6-D512-world-init.pth

## Training with collection of text files

### **Download the dataset file**

In [None]:
# Setup the dataset dir
!mkdir -p ../../dataset/dataset-config/text/
!mkdir -p ../../dataset/dataset-config/zip/

# Download the files
!cd ../../dataset/dataset-config/zip/ && wget -nc https://data.deepai.org/enwik8.zip
!cd ../../dataset/dataset-config/text/ && rm -rf ./*
!cd ../../dataset/dataset-config/text/ && unzip ../zip/enwik8.zip
!cd ../../dataset/dataset-config/text/ && mv enwik8 enwik8.txt
!cd ../../dataset/dataset-config/text/ && ls -lh

### **Parse the dataset**

The following is the `example-local-text.yaml` settings, for using local textual data

---
```yaml
trainer:
  max_steps: 10
  target_batch_size: 32
model:
  load_model: ../model/L6-D512-neox-init.pth
  ctx_len: 1024
  lr_init: 3e-4
  bptt_learning: true
  bptt_learning_range: -1
data:
  data_path: ../datapath/enwiki8_neox_1024/

  source: "text"
  source_data_dir: "../dataset/dataset-config/text/"
  tokenizer: neox
  
  text_rechunk_size: 2048
  
  test_split: 0.01
  test_split_shuffle: false
```
---

### Understanding the `data` config, for textual datasets

**data.data_path** 

This is where the HF datapath is saved, when used against existing HF data sources. This is a requried parameter

**data.source** 

This can be configured as `text / json / csv / pandas` for local files

**data.source_data_dir** 

Folder / Directory which contains the respective `text / json (or jsonl) / csv / pandas` files

**data.tokenizer**

The tokenizer to use for the dataset, use either `neox` or `world` for the respective RWKV models. For custom HF tokenizer refer to `./dataset-config-huggingface-examples.ipynb`

**data.text_rechunk_size** 

Number of tokens each datasample should have after rechunking. Recommended sizes is the context size you intend the model to support (ie. 2048, 4096, etc). This is enabled, for text based dataset.

**data.test_split**

If configured as a floating number between 0.0 and 1.0, it will be the percentage (0.1 is 10%) of the test data that is used for test validation.

If configured as an int number, it will be the number of samples.

Due to some limitations in the current trainer code, even if its set as 0, we will use a single data sample for the test split.

This defaults to 0.01 or 1%

**data.test_split_shuffle**

Perform a dataset shuffle before test split, this defaults to False.

Note, this is not a truely random shuffle, but a detriministic shuffle. To ensure a consistent result.

---

### Parse the dataset, and run the training process

In [None]:
# Lets preload the requried dataset
!cd ../../RWKV-v5 && python3 preload_datapath.py ../notebook/dataset-config/example-local-text.yaml

In [None]:
# Validate the dataset is working, by doing a quick training run
!cd ../../RWKV-v5 && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-local-text.yaml

## Training with JSON / JSONL / CSV files

### **Download the dataset**

In [None]:
# Setup the dataset dir
!mkdir -p ../../dataset/dataset-config/jsonl/

# Download the files
!cd ../../dataset/dataset-config/jsonl/ && wget -nc https://huggingface.co/datasets/picocreator/RWKV-notebook-assets/raw/main/sample-memory-train-10-word-count.jsonl

### **Parse the dataset**

The following is the `example-local-json.yaml` settings, for using local textual data

---
```yaml
trainer:
  max_steps: 10
  target_batch_size: 32
model:
  load_model: ../model/L6-D512-neox-world.pth
  ctx_len: 1024
  lr_init: 3e-4
  bptt_learning: true
  bptt_learning_range: -1
data:
  data_path: ../datapath/enwiki8_neox_1024/

  # Note that json work for both ".json" and ".jsonl"
  source: "json"
  source_data_dir: "../dataset/dataset-config/jsonl/"
  tokenizer: world
  
  test_split: 0.01
  test_split_shuffle: false
```
---

### Understanding the `data` config, for textual datasets

**data.source** 

This can be configured as `text / json / csv / pandas` for local files. For most part, since json/csv/pandas deal with structured data formatting, they should work in a similar fashion.

**data.source_data_dir** 

Folder / Directory which contains the respective `text / json (or jsonl) / csv / pandas` files

**data.tokenizer**

The tokenizer to use for the dataset, use either `neox` or `world` for the respective RWKV models. For custom HF tokenizer refer to `./dataset-config-huggingface-examples.ipynb`

> All the advance settings for collumn datasets handling in `./dataset-config-huggingface-examples.ipynb` works with `JSON / CSV / pandas` based local data files formats. This includes force text rechunking, multi-column formatting, collumn masking, etc.

---

### Parse the dataset, and run the training process

In [None]:
# Lets preload the requried dataset
!cd ../../RWKV-v5 && python3 preload_datapath.py ../notebook/dataset-config/example-local-jsonl.yaml

In [None]:
# Validate the dataset is working, by doing a quick training run
!cd ../../RWKV-v5 && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-local-jsonl.yaml