# GneissWeb Recipe

#### This notebook presents the GneissWeb recipe and applies the components in sequence to reproduce the GneissWeb processing pipeline using DPK transforms. 
#### ![](recipe3.png)



In [2]:
# !pip install "data-prep-toolkit-transforms[rep_removal]==1.0.1.dev1"

### 0. Read the input parquet file
##### Download a parquet file from HF using the HF download API

In [3]:
# from huggingface_hub import hf_hub_download
# import pandas as pd

# REPO_ID = "HuggingFaceFW/fineweb"
# FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"

# hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

In [6]:
import urllib.request
import shutil

In [7]:
shutil.os.makedirs("tmp/input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/universal/rep_removal/test-data/input/test1.parquet", "tmp/input/test1.parquet")
# urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/extreme_tokenized/test-data/input/arrow/test1.arrow", "tmp/input/test1.arrow")

('tmp/input/test1.parquet', <http.client.HTTPMessage at 0x1110dada0>)

#### Pip installations

##### These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release.

##### Example for transform developers working from git clone:

##### make venv

##### source venv/bin/activate

##### pip install jupyterlab



### 1. Repetition Removal
##### This component applies exact substring deduplication to remove any substring of predetermined length that repeats more than once within a single parquet file level by adapting the implementation from [deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)


#### Prerequisites

##### To run the repetition removal transform, Rust is required to be installed on the machine. You can install rust following instructions [here](https://www.rust-lang.org/tools/install).

##### Add Rust to $PATH

##### If Rust is not added to your $PATH, run the below steps to add the rust installation location for proper execution.

##### You can use the !whereis cargo command to find where rust is installed in your machine, and set the path there up to the /bin

##### ex: whereis cargo produces: cargo: /Users/USERNAME/.cargo/bin/cargo

##### set the $PATH to include /Users/USERNAME/.cargo/bin/

In [8]:
from dpk_rep_removal.runtime import RepRemoval

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [9]:
RepRemoval(input_folder= "tmp/input",
            output_folder= "tmp/files-repremoval",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads='1',
            ).transform()

00:38:28 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
00:38:28 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
00:38:28 INFO - data factory data_ is using local data access: input_folder - tmp/input output_folder - tmp/files-repremoval
INFO:data_processing.data_access.data_access_factory_basea8443818-6c54-4be7-90d0-ff8a12bfa061:data factory data_ is using local data access: input_folder - tmp/input output_folder - tmp/files-repremoval
00:38:28 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basea8443818-6c54-4be7-90d0-ff8a12bfa061:data factory data_ max_files -1, n_sample -1
00:38:28 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basea8443818-6c54-4be7-90

cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


INFO:root:running the merge
INFO:root:merging complete
[1m[32m    Updating[0m crates.io index
[1m[32m   Compiling[0m libc v0.2.169
[1m[32m   Compiling[0m version_check v0.9.5
[1m[32m   Compiling[0m proc-macro2 v1.0.93
[1m[32m   Compiling[0m either v1.13.0
[1m[32m   Compiling[0m unicode-ident v1.0.16
[1m[32m   Compiling[0m shlex v1.3.0
[1m[32m   Compiling[0m glob v0.3.2
[1m[32m   Compiling[0m syn v1.0.109
[1m[32m   Compiling[0m autocfg v1.4.0
[1m[32m   Compiling[0m zstd-safe v2.0.6+zstd.1.4.7
[1m[32m   Compiling[0m itertools v0.9.0
[1m[32m   Compiling[0m heck v0.4.1
[1m[32m   Compiling[0m os_str_bytes v6.6.1
[1m[32m   Compiling[0m proc-macro-error-attr v1.0.4
[1m[32m   Compiling[0m proc-macro-error v1.0.4
[1m[32m   Compiling[0m hashbrown v0.12.3
[1m[32m   Compiling[0m indexmap v1.9.3
[1m[32m   Compiling[0m clap_lex v0.2.4
[1m[32m   Compiling[0m textwrap v0.16.1
[1m[32m   Compiling[0m bitflags v1.3.2
[1m[32m   Compiling[0m

Start load!
0 / 19909 
Duplicates found: 7250
Total time taken: 2ms


00:38:43 INFO - Completed 1 files (100.0%) in 0.248 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (100.0%) in 0.248 min
00:38:43 INFO - Done processing 1 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 1 files, waiting for flush() completion.
00:38:43 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
00:38:43 INFO - Completed execution in 0.248 min, execution result 0
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 0.248 min, execution result 0


0

### 2. Annotation


### 2.1. Fasttext Quality Annotator
##### This step annotates the documents using two FastText quality classifiers: (i) the fastText classifier from [DCLM](https://arxiv.org/pdf/2406.11794) and (ii) our own fastText classifier trained on a mix of high-quality synthetic data and data annotated by an LLM for high educational value. 

### 2.2. Readability Scores Quality Annotator
##### This transform calculates the McAlpine-EFLAW readability score for each document in the output parquet file from the previous step and adds McAlpine-EFLAW readability score column to the data.

##### McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of ≤ 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. 

### 2.3. Extreme-Tokenized Annotator
##### This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).

##### The annotator transform annotates the input table with 5 columns:

###### 1. doc_num_tokens - number of tokens for each document
###### 2. doc_size_kbs - document size in kb
###### 3. doc_num_chars - number of characters in the document
###### 4. tokens_per_doc_size - ratio between number of tokens and document size
###### 5. tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document
##### Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.



### 2.4. Document Category Classifiers
##### This step annotates the documents using four FastText category classifiers:
######   1. Science
######   2. Education
######   3. Technology & computing
######   4. Medical health

### 3. Ensemble Quality Filter
##### This filtering step filters out low-quality documents from the input data using multiple quality annotators and by leveraging the category information of documents. 