This notebook performs a few basic checks on the output of dedup.py.

In [33]:
import os
import sys
import pandas as pd

ORIGINAL_DIR = "/data/tir/projects/tir7/user_data/mchen5/dolma_100B"
DEDUPED_DIR = "/data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3"
domains = [
    "c4",
    "common-crawl",
    "peS2o",
    "gutenberg-books",
    "stack-code",
    "wiki-en-simple",
]


sys.path.append(
    "/data/tir/projects/tir7/user_data/mchen5/llm-pretraining-behaviours/lm-evaluation-harness"
)
from lm_eval.decontamination.janitor import Janitor

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(




Traceback (most recent call last):
  File "/data/tir/projects/tir7/user_data/mchen5/llm-pretraining-behaviours/lm-evaluation-harness/lm_eval/decontamination/janitor.py", line 11, in <module>
    import janitor_util
ModuleNotFoundError: No module named 'janitor_util'


First, the filenames in the original and deduped directories should match.

In [24]:
original_file_names = {}
deduped_file_names = {}

for domain in domains:
    original_file_names[domain] = [
        sorted(files) for _, _, files in os.walk(f"{ORIGINAL_DIR}/{domain}")
    ]
    deduped_file_names[domain] = [
        sorted(files) for _, _, files in os.walk(f"{DEDUPED_DIR}/{domain}")
    ]

print("Original files: ", {k: len(v[0]) for k, v in original_file_names.items()})
print("Deduped files: ", {k: len(v[0]) for k, v in deduped_file_names.items()})
print(original_file_names == deduped_file_names)


Original files:  {'c4': 26, 'common-crawl': 92, 'peS2o': 5, 'gutenberg-books': 13, 'stack-code': 12, 'wiki-en-simple': 4}
Deduped files:  {'c4': 26, 'common-crawl': 92, 'peS2o': 5, 'gutenberg-books': 13, 'stack-code': 12, 'wiki-en-simple': 4}


True

Also, both directories should take roughly equal space for each domain.

In [31]:
!du -sh /data/tir/projects/tir7/user_data/mchen5/dolma_100B/*

33G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B/c4
123G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B/common-crawl
4.7G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B/gutenberg-books
15G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B/peS2o
18G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B/stack-code
4.2G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B/wiki-en-simple


In [30]:
!du -sh /data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3/*

33G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3/c4
123G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3/common-crawl
4.6G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3/gutenberg-books
15G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3/peS2o
18G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3/stack-code
4.2G	/data/tir/projects/tir7/user_data/mchen5/dolma_100B_deduped_3/wiki-en-simple


Here's a simple demo of decontamination, using the same method as in dedup.py.

In [62]:
# Make janitor, register contaminant
with open(
    "/data/tir/projects/tir7/user_data/mchen5/llm-pretraining-behaviours/dolma_data_processing/decontamination/dedup/contaminant_mini.txt",
    "r",
) as file:
    contaminant: str = file.read()
janitor = Janitor()
janitor.register_contaminant(contaminant)


def decontaminate(df: pd.DataFrame) -> pd.DataFrame:
    df["num_contaminated"] = 0
    df["thrown"] = False

    num_thrown = 0
    for index, row in df.iterrows():
        try:
            (cleaned, num_contaminated) = janitor.clean_python(row["text"])
            df.at[index, "num_contaminated"] = num_contaminated
            if num_contaminated != 0:
                df.at[index, "text"] = "".join(cleaned)
        except:
            df.at[index, "thrown"] = True
            num_thrown += 1

    return df


clean_test_1 = "Artificial intelligence is the intelligence of machines or software, as opposed to the intelligence of living beings, primarily of humans. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs."
clean_test_2 = "早上好中国 现在我有冰淇淋 我很喜欢冰淇淋 但是 速度与激情9 比冰淇淋 速度与激情 速度与激情9 我最喜欢 所以…现在是音乐时间 准备 1 2 3 两个礼拜以后 速度与激情9 *3 不要忘记 不要错过 记得去电影院看速度与激情9 因为非常好电影 动作非常好 差不多一样冰淇淋 再见"
dirty_test_1 = """
<Problem ID="nluds-0001" Grade="1" Source="http://www.k5learning.com">
		<Body>Seven red apples and two green apples are in the basket.</Body>
		<Question>How many apples are in the basket?</Question>
		<Solution-Type>Addition</Solution-Type>
 		<Answer>9 (apples)</Answer>
THIS IS A DIRTY STRING THIS IS A DIRTY STRING THIS IS A DIRTY STRING THIS IS A DIRTY STRING
		<Formula>7+2=9</Formula>
	</Problem>
"""

df = pd.DataFrame([clean_test_1, clean_test_2, dirty_test_1], columns=["text"])
decontaminate(df)



Unnamed: 0,text,num_contaminated,thrown
0,Artificial intelligence is the intelligence of...,0,False
1,早上好中国 现在我有冰淇淋 我很喜欢冰淇淋 但是 速度与激情9 比冰淇淋 速度与激情 速度与...,0,False
2,,2,False
