This file tests the deduplication code from EleutherAI's `janitor.py` file on small section(s) of Dolma to estimate how long full deduplication would take.

To run ``janitor.py`` with C++ on Linux:
1. At ``lm-evaluation-harness/scripts/clean_training_data``, run ``c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix)``
2. Rename the resulting ``.so`` file to ``janitor_util.so``
3. Tell Python the location of ``janitor_util.so`` when it looks for ``janitor_util``: ```sys.path.append(harness_dir + "/scripts/clean_training_data")```

In [1]:
import pyarrow.parquet as pq
from pathlib import Path
from transformers import AutoTokenizer
import sys
import datetime

harness_dir = str(Path("__file__").resolve().parents[3] / "lm-evaluation-harness")
sys.path.append(harness_dir)

sys.path.append(harness_dir + "/scripts/clean_training_data")
from lm_eval.decontamination.janitor import Janitor

  _torch_pytree._register_pytree_node(
2024-02-11:15:28:17,438 INFO     [utils.py:145] Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-11:15:28:17,439 INFO     [utils.py:148] Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-11:15:28:17,439 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [2]:
with open("./arithmetic.txt", "r") as file:
    arithmetic = file.read()

In [2]:
data = pq.read_table("/data/tir/projects/tir7/user_data/mchen5/dolma_100B/c4/part_1.arrow")
data_string = "".join(data.column("text").to_pandas())
print(f"Loaded c4/part_1.arrow")

Loaded c4/part_1.arrow


In [3]:
LLAMA_DIR = "/data/datasets/models/huggingface/meta-llama/Llama-2-70b-hf/"
tokenizer = AutoTokenizer.from_pretrained(LLAMA_DIR)
tokenizer.pad_token = tokenizer.eos_token
encoded_inputs = tokenizer(
    data_string, truncation=True, padding=True, return_tensors="pt"
)
num_non_padding_toks = (
    encoded_inputs["attention_mask"].sum(dim=1).tolist()
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
sum(num_non_padding_toks)

In [7]:
# Decontaminate 1 file of c4
total_decontamination_time = datetime.timedelta(hours=0)
janitor = Janitor(delete_chars="")
janitor.register_contaminant(arithmetic)

for file_num in range(5, 13):
    arrow_path = (
        f"/data/tir/projects/tir7/user_data/mchen5/dolma_100B/c4/part_{file_num}.arrow"
    )
    data = pq.read_table(arrow_path)
    data_string = "".join(data.column("text").to_pandas())
    print(f"Loaded c4/part_{file_num}.arrow")

    pre_decontaminate = datetime.datetime.now()
    # Test decontaminating arithmetic against c4 part {file_num}
    result = janitor.clean_python(data_string)

    print(
        f"Decontaminated c4 part {file_num} of arithmetic in "
        + str(datetime.datetime.now() - pre_decontaminate)
    )

    total_decontamination_time += datetime.datetime.now() - pre_decontaminate

Loaded c4/part_5.arrow


For estimating runtimes:
| Folder | # of arrow files |
|--------|-------|
| c4 | 4213499 |
| common-crawl | 510983 |
| gutenberg-books | 1178 |
| peS2o | 20803 |
| stack-code | 103818 |
| wiki-en-simple | 2785999 |
| dolma_100B (total) | 7636280 |


Testing results:
- Deduplicating full part 1 of c4 against arithmetic (C++):
    - 512G RAM (43 G used), 1 GPU, 4 CPUs (169% efficiency) - 15 mins 40 sec
- Deduplicating 1/10 of part 1 of c4 against arithmetic (Python):
    - 512G RAM (43 G used), 1 GPU, 4 CPUs (169% efficiency) - 1 min 35 sec
- Deduplicating full part 2 of c4 against arithemtic (C++):
    - 512G RAM (26 G used), 1 GPU, 16 CPUs (14% efficiency) - 16 min 16 sec
- Deduplicating parts 5 - 12 of c4 against arithmetic (C++):
    - 512G RAM (43 G used), 1 GPU, 16 CPUs (66% efficiency) - 140 min
        - Average 17.5 min per part

To do:
- Multithread(?) by splitting data into chunks and deduplicating each chunk in parallel