This file tests the deduplication code from EleutherAI's `janitor.py` file on small section(s) of Dolma to estimate how long full deduplication would take.

To run ``janitor.py`` with C++ on Linux:
1. At ``lm-evaluation-harness/scripts/clean_training_data``, run ``c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix)``
2. Rename the resulting ``.so`` file to ``janitor_util.so``
3. Tell Python the location of ``janitor_util.so`` when it looks for ``janitor_util``: ```sys.path.append(harness_dir + "/scripts/clean_training_data")```

In [1]:
import pyarrow.parquet as pq
from pathlib import Path
import pandas as pd
import sys
import datetime
import os
import pyarrow
from tqdm import tqdm
import copy

harness_dir = str(Path("__file__").resolve().parents[3] / "lm-evaluation-harness")
sys.path.append(harness_dir)

sys.path.append(harness_dir + "/scripts/clean_training_data")
from lm_eval.decontamination.janitor import Janitor

os.environ['NUMEXPR_MAX_THREADS'] = '256'
os.environ['NUMEXPR_NUM_THREADS'] = '128'
import numexpr as ne

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [2]:
with open("./tasks_txt_files/arithmetic.txt", "r") as file:
    arithmetic: str = file.read() # 339K
with open("./contaminant.txt", "r") as file:
    contaminant: str = file.read() # 3.4G

In [4]:
data_raw: pyarrow.lib.Table = pq.read_table("/data/tir/projects/tir7/user_data/mchen5/dolma_100B/c4/part_1.arrow")
data_raw_size = sys.getsizeof(data_raw)
print(f"Size of data_raw: {sys.getsizeof(data_raw)} bytes")

In [3]:
data_mini: pyarrow.lib.Table = pq.read_table("data_mini.arrow")
data_mini_size = sys.getsizeof(data_mini)
print(f"Size of data_mini: {sys.getsizeof(data_mini)} bytes")

Size of data_mini: 112133501 bytes


In [4]:
df: pd.DataFrame = data_mini.to_pandas()
df["text"] = df["text"].str.encode("utf-8", errors="ignore")
df.head(5)

Unnamed: 0,id,text
0,09c6eceb562caeba5b94489087fb1e8d,"b'TAMPA, Fla., Nov. 03, 2016 (GLOBE NEWSWIRE) ..."
1,7378e5a823604985555d1d9267827368,"b'It was brimming with midges. Everywhere, the..."
2,43088e9ab3bdb2236fc493594b99f72f,b'We encourage all our employees to be ambitio...
3,14b802b07c5b0685470f5c87fc60e394,b'The first road assignment is coming this wee...
4,954f973826676c5a9421c0286f964bd3,"b""Course to upgrade skills for experienced Hr ..."


In [9]:
def test_decontaminate(contaminant: str, df: pd.DataFrame) -> (Janitor, pd.DataFrame):
    janitor = Janitor()
    result = copy.deepcopy(df)
    result["num_contaminated"] = 0

    registration_time = datetime.timedelta(hours=0)
    pre_register = datetime.datetime.now()
    print("Registering contaminant")
    janitor.register_contaminant(contaminant)
    registration_time += datetime.datetime.now() - pre_register
    print(f"Registered in {str(registration_time)}")
    
    print("Decontaminating")
    for index, row in tqdm(df.iterrows(), total=len(df)):
        (cleaned, num_contaminated) = janitor.clean_cpp(str(row["text"]))
        result.iloc[index]["num_contaminated"] = num_contaminated
        if num_contaminated != 0:
            result.iloc[index]["text"] = "".join(cleaned)
        
    return (janitor, result)

    

In [10]:
"""
def test_decontaminate(contaminant: str, output_filename: str):
    print(f"Contaminant size {len(contaminant)}")
    janitor = Janitor(delete_chars="")

    registration_time = datetime.timedelta(hours=0)
    pre_register = datetime.datetime.now()
    print("Registering contaminant")
    janitor.register_contaminant(contaminant)
    registration_time += datetime.datetime.now() - pre_register
    print(f"Registered in {str(registration_time)}")

    decontamination_time = datetime.timedelta(hours=0)
    pre_decontaminate = datetime.datetime.now()
    print("Decontaminating")
    # NOTE: Running clean_cpp throws unicodedecode error; maybe sort this out later
    result = janitor.clean_python(data_string)
    
    decontamination_time += datetime.datetime.now() - pre_decontaminate
    print(f"Decontaminated in {str(decontamination_time)}")

    print(f"Total time: {str(registration_time + decontamination_time)}")
    return janitor
"""

In [10]:
(arithmetic_janitor, df_dedup) = test_decontaminate(arithmetic, df)

Registering contaminant
Registered in 0:00:00.132472
Decontaminating


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result.iloc[index]["num_contaminated"] = num_contaminated
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result.iloc[index]["text"] = "".join(cleaned)
100%|██████████| 50000/50000 [00:24<00:00, 2024.42it/s]

389





In [12]:
scrolls_janitor = test_decontaminate(scrolls)

Contaminant size 3518772866
Registering contaminant
Registered in 0:03:56.102353
Decontaminating
Decontaminated in 0:00:10.991375
Total time: 0:04:07.093728


For estimating runtimes:
| Folder | # of arrow files |
|--------|-------|
| c4 | 4213499 |
| common-crawl | 510983 |
| gutenberg-books | 1178 |
| peS2o | 20803 |
| stack-code | 103818 |
| wiki-en-simple | 2785999 |
| dolma_100B (total) | 7636280 |


Testing results:
- Deduplicating full part 1 of c4 against arithmetic (C++):
    - 512G RAM (43 G used), 1 GPU, 4 CPUs (169% efficiency) - 15 mins 40 sec
- Deduplicating 1/10 of part 1 of c4 against arithmetic (Python):
    - 512G RAM (43 G used), 1 GPU, 4 CPUs (169% efficiency) - 1 min 35 sec
- Deduplicating full part 2 of c4 against arithemtic (C++):
    - 512G RAM (26 G used), 1 GPU, 16 CPUs (14% efficiency) - 16 min 16 sec
- Deduplicating parts 5 - 12 of c4 against arithmetic (C++):
    - 512G RAM (43 G used), 1 GPU, 16 CPUs (66% efficiency) - 140 min
        - Average 17.5 min per part

To do:
- Multithread(?) by splitting data into chunks and deduplicating each chunk in parallel