# GneissWeb Recipe

#### In order to be able to reproduce GneissWeb, we provide here a notebook that presents the GneissWeb recipe and applies the components in sequence to reproduce the GneissWeb processing pipeline using DPK transforms. 
<br>

#### Owner:  IBM Research
<br>


### **An Overview of the GneissWeb Recipe**
#### The GneissWeb dataset was obtained by applying the following processing steps:
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Exact substring deduplication at line level (Sec. 1)
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Quality annotators (Sec. 2): 
#### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Custom built fastText Quality Classifier (Sec. 2.1)
#### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Custom built fastText Category Classifiers (Sec. 2.2)
#### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Custom built Readability Score Quality Annotator (Sec. 2.3)
#### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Custom built Extreme-Tokenized-Documents Quality Annotator (Sec. 2.4)
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Category-aware Ensemble Quality Filter (Sec. 3)

####  These were applied in the order shown in the Figure.
#### &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ![](GneissWeb_recipe_new.png)

#### Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) dataset page for more details.
<b>
    


### pip installations

##### These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirements.txt file that includes the right release.


##### Example for transform developers working from data-prep-kit git clone:

##### cd data-prep-kit/examples/notebooks/GneissWeb/

##### python -m venv venv

##### source venv/bin/activate

##### pip install --no-cache "data-prep-toolkit-transforms[rep_removal, readabilty, extreme_tokenized, filter, tokenization]==1.0.1.dev1"

##### pip install jupyterlab

##### venv/bin/jupyter lab


### 0. Read the input parquet file


### 0.1. Download a parquet file 
#### Download a parquet file from HF using the HF download API

In [3]:
from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"

file1 = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")


  from .autonotebook import tqdm as notebook_tqdm


### 0.2. Resize the parquet file
#### Resize the file to a smaller size for testing purposes

In [4]:
import os

from dpk_resize.runtime import Resize
Resize(input_folder= os.path.dirname(file1),
        output_folder= "input",
        resize_max_mbytes_per_table= 200).transform()

23:59:31 INFO - Split file parameters are : {'max_rows_per_table': -1, 'max_mbytes_per_table': 200.0, 'size_type': 'disk'}
23:59:31 INFO - pipeline id pipeline_id
23:59:31 INFO - code location None
23:59:31 INFO - data factory data_ is using local data access: input_folder - /Users/hajaremami/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20 output_folder - input
23:59:31 INFO - data factory data_ max_files -1, n_sample -1
23:59:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:59:31 INFO - orchestrator resize started at 2025-02-15 23:59:31
23:59:31 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
00:00:08 INFO - Completed 1 files (100.0%) in 0.623 min
00:00:08 INFO - Done processing 1 f

0


## 1. Exact substring deduplication
##### This component applies exact substring deduplication to remove any substring of predetermined length that repeats more than once within a single parquet file level by adapting the implementation from [deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)

<!--  -->
#### Prerequisites

##### To run the repetition removal transform, Rust is required to be installed on the machine. You can install rust following instructions [here](https://www.rust-lang.org/tools/install).

##### Add Rust to $PATH

##### If Rust is not added to your $PATH, run the below steps to add the rust installation location for proper execution.

##### You can use the !whereis cargo command to find where rust is installed in your machine, and set the path there up to the /bin

##### ex: whereis cargo produces: cargo: /Users/USERNAME/.cargo/bin/cargo

##### set the $PATH to include /Users/USERNAME/.cargo/bin/

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `rep_removal_contents_column_name` | `contents`                         | Name of the column holding the document contents  |
| `rep_remova_length_thresh`         | `50`                               | Length threshold for processing                   |
| `rep_removal_frequency_threshold`  | `1`                                | Frequency threshold for processing                |
| `rep_removal_retain_first_copy`    | `True`                             | Boolean value for whether to retain first copy    |
| `rep_removal_num_threads`          | `psutils.cpu_count(logical=False)` | Value for number of threads to use for processing |


#### Output Format

The output format will be a new parquet file with the repeated sequence(s) removed.


In [7]:
%%time
from dpk_rep_removal.runtime import RepRemoval

RepRemoval(input_folder= "input",
            output_folder= "tmp/repRemoval",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads='10',
            ).transform()

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:00:10 INFO - pipeline id pipeline_id
00:00:10 INFO - code location None
00:00:10 INFO - data factory data_ is using local data access: input_folder - input output_folder - tmp/repRemoval
00:00:10 INFO - data factory data_ max_files -1, n_sample -1
00:00:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:00:10 INFO - orchestrator rep_removal started at 2025-02-16 00:00:10
00:00:10 INFO - Number of files is 9, source profile {'max_file_size': 166.99776935577393, 'min_file_size': 104.17425632476807, 'total_file_size': 1437.520227432251}
00:00:21 INFO - encoding parquet
00:01:00 INFO - making suffix array
00:01:00 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmp

cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:01:04 INFO - Creating part: 127653705-170204940
00:01:04 INFO - Creating part: 42551235-85202470
00:01:04 INFO - Creating part: 0-42651235
00:01:04 INFO - Creating part: 85102470-127753705
00:01:13 INFO - Checking file integrity...
00:01:13 INFO - Merging suffix trees...
00:01:38 INFO - Merge successful.
00:01:38 INFO - Final cleanup and verification...
00:01

Start load!
0 / 17020493 


00:01:51 INFO - collecting duplicates


Duplicates found: 1401249
Total time taken: 818ms


00:02:00 INFO - Num Duplicate Rows: 21906
00:02:02 INFO - Completed 1 files (11.11%) in 1.7 min
00:02:03 INFO - encoding parquet
00:02:43 INFO - making suffix array
00:02:43 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmprwwyvbph/save_dir/parquet
00:02:43 INFO - timeout is: 5347.551239157373
00:02:43 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:02:47 INFO - Creating part: 85341076-128111614
00:02:47 INFO - Creating part: 42670538-85441076
00:02:47 INFO - Creating part: 128011614-170682154
00:02:47 INFO - Creating part: 0-42770538
00:02:56 INFO - Checking file integrity...
00:02:56 INFO - Merging suffix trees...
00:03:22 INFO - Merge successful.
00:03:22 INFO - Final cleanup and verification...
00:03

Start load!
0 / 17068214 


00:03:24 INFO - collecting duplicates


Duplicates found: 1297013
Total time taken: 890ms


00:03:33 INFO - Num Duplicate Rows: 21418
00:03:35 INFO - Completed 2 files (22.22%) in 3.25 min
00:03:36 INFO - encoding parquet
00:04:16 INFO - making suffix array
00:04:16 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpibb04qoy/save_dir/parquet
00:04:16 INFO - timeout is: 5346.1872366790585
00:04:16 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:04:20 INFO - Creating part: 127978593-170638124
00:04:20 INFO - Creating part: 0-42759531
00:04:20 INFO - Creating part: 85319062-128078593
00:04:20 INFO - Creating part: 42659531-85419062
00:04:28 INFO - Checking file integrity...
00:04:28 INFO - Merging suffix trees...
00:04:54 INFO - Merge successful.
00:04:54 INFO - Final cleanup and verification...
00:04

Start load!
0 / 17063811 


00:04:55 INFO - collecting duplicates


Duplicates found: 1328924
Total time taken: 858ms


00:05:04 INFO - Num Duplicate Rows: 20805
00:05:06 INFO - Completed 3 files (33.33%) in 4.764 min
00:05:07 INFO - encoding parquet
00:05:47 INFO - making suffix array
00:05:47 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmp0elat747/save_dir/parquet
00:05:47 INFO - timeout is: 5359.475712515489
00:05:47 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:05:51 INFO - Creating part: 42766769-85633538
00:05:51 INFO - Creating part: 0-42866769
00:05:51 INFO - Creating part: 85533538-128400307
00:05:51 INFO - Creating part: 128300307-171067076
00:05:59 INFO - Checking file integrity...
00:05:59 INFO - Merging suffix trees...
00:06:25 INFO - Merge successful.
00:06:25 INFO - Final cleanup and verification...
00:06

Start load!
0 / 17106706 


00:06:27 INFO - collecting duplicates


Duplicates found: 1256571
Total time taken: 864ms


00:06:36 INFO - Num Duplicate Rows: 19782
00:06:38 INFO - Completed 4 files (44.44%) in 6.298 min
00:06:39 INFO - encoding parquet
00:07:18 INFO - making suffix array
00:07:18 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpikdhq87w/save_dir/parquet
00:07:18 INFO - timeout is: 5357.358921933085
00:07:18 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:07:22 INFO - Creating part: 85499372-128349058
00:07:22 INFO - Creating part: 128249058-170998746
00:07:22 INFO - Creating part: 0-42849686
00:07:22 INFO - Creating part: 42749686-85599372
00:07:31 INFO - Checking file integrity...
00:07:31 INFO - Merging suffix trees...
00:07:57 INFO - Merge successful.
00:07:57 INFO - Final cleanup and verification...
00:07

Start load!
0 / 17099873 


00:07:59 INFO - collecting duplicates


Duplicates found: 1166360
Total time taken: 876ms


00:08:08 INFO - Num Duplicate Rows: 19066
00:08:10 INFO - Completed 5 files (55.56%) in 7.824 min
00:08:10 INFO - encoding parquet
00:08:51 INFO - making suffix array
00:08:51 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpj_btrb32/save_dir/parquet
00:08:51 INFO - timeout is: 5372.160037174721
00:08:51 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:08:54 INFO - Creating part: 42869131-85838262
00:08:54 INFO - Creating part: 0-42969131
00:08:54 INFO - Creating part: 85738262-128707393
00:08:54 INFO - Creating part: 128607393-171476526
00:09:03 INFO - Checking file integrity...
00:09:03 INFO - Merging suffix trees...
00:09:29 INFO - Merge successful.
00:09:29 INFO - Final cleanup and verification...
00:09

Start load!
0 / 17147651 


00:09:31 INFO - collecting duplicates


Duplicates found: 1204603
Total time taken: 886ms


00:09:40 INFO - Num Duplicate Rows: 19049
00:09:42 INFO - Completed 6 files (66.67%) in 9.363 min
00:09:43 INFO - encoding parquet
00:10:23 INFO - making suffix array
00:10:23 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpptikhbme/save_dir/parquet
00:10:23 INFO - timeout is: 5364.340644361834
00:10:23 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:10:27 INFO - Creating part: 0-42906029
00:10:27 INFO - Creating part: 128418087-171224116
00:10:27 INFO - Creating part: 42806029-85712058
00:10:27 INFO - Creating part: 85612058-128518087
00:10:36 INFO - Checking file integrity...
00:10:36 INFO - Merging suffix trees...
00:11:02 INFO - Merge successful.
00:11:02 INFO - Final cleanup and verification...
00:11

Start load!
0 / 17122410 


00:11:03 INFO - collecting duplicates


Duplicates found: 1220217
Total time taken: 822ms


00:11:12 INFO - Num Duplicate Rows: 19826
00:11:14 INFO - Completed 7 files (77.78%) in 10.901 min
00:11:15 INFO - encoding parquet
00:11:55 INFO - making suffix array
00:11:55 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpmh1aghjl/save_dir/parquet
00:11:55 INFO - timeout is: 5368.35092936803
00:11:55 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:11:59 INFO - Creating part: 0-42938392
00:11:59 INFO - Creating part: 128515176-171353568
00:11:59 INFO - Creating part: 42838392-85776784
00:11:59 INFO - Creating part: 85676784-128615176
00:12:08 INFO - Checking file integrity...
00:12:08 INFO - Merging suffix trees...
00:12:34 INFO - Merge successful.
00:12:34 INFO - Final cleanup and verification...
00:12

Start load!
0 / 17135355 


00:12:36 INFO - collecting duplicates


Duplicates found: 1193033
Total time taken: 899ms


00:12:44 INFO - Num Duplicate Rows: 19341
00:12:47 INFO - Completed 8 files (88.89%) in 12.437 min
00:12:47 INFO - encoding parquet
00:13:13 INFO - making suffix array
00:13:13 INFO - Starting the deduplication process for file: /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpa7znwhth/save_dir/parquet
00:13:13 INFO - timeout is: 3375.1241635687734
00:13:13 INFO - Scheduling 4 jobs to create dataset parts.


cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
00:13:17 INFO - Creating part: 80259156-107012208
00:13:17 INFO - Creating part: 26753052-53606104
00:13:17 INFO - Creating part: 53506104-80359156
00:13:17 INFO - Creating part: 0-26853052
00:13:22 INFO - Checking file integrity...
00:13:22 INFO - Merging suffix trees...
00:13:37 INFO - Merge successful.
00:13:37 INFO - Final cleanup and verification...
00:13:3

Start load!
0 / 10701219 


00:13:38 INFO - collecting duplicates


Duplicates found: 686798
Total time taken: 557ms


00:13:44 INFO - Num Duplicate Rows: 11030
00:13:45 INFO - Completed 9 files (100.0%) in 13.414 min
00:13:45 INFO - Done processing 9 files, waiting for flush() completion.
00:13:45 INFO - done flushing in 0.0 sec
00:13:45 INFO - Completed execution in 13.588 min, execution result 0


CPU times: user 39.3 s, sys: 36.9 s, total: 1min 16s
Wall time: 13min 36s


0

## 2. Annotation


### 2.1. FastText Quality Annotator
##### This transform annotates each document with two fastText quality classifiers: 
##### (i) [GneissWeb.Quality_annotator](https://huggingface.co/ibm-granite/GneissWeb.Quality_annotator) classifier trained on a mix of high-quality synthetic data and data annotated by an LLM for high educational value 
##### (ii) the fastText classifier from [DCLM](https://arxiv.org/pdf/2406.11794)

These fastText models are used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents. 


#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                             | Default                            | Description                                       |
|------------------------------------   |------------------------------------|---------------------------------------------------|
| `gcls_model_credential`               | unset                              | Credential you use to get model. This is huggingface token. [Guide to get huggingface token](https://huggingface.co/docs/hub/security-tokens)  |
| `gcls_model_file_name`                | unset                              | specifies what filename of model you use to get model, like fasttext_gneissweb_quality_annotator.bin|
| `gcls_model_url`                      | unset                              | specifies url that model locates. For fasttext, this will be repo name of the model, like ibm-granite/GneissWeb.Quality_annotator                |
| `gcls_content_column_name`            | contents                           | Name of the column containing documents   |
| `gcls_output_lablel_column_name`      | label                              | Name of the output column to hold predicted classes |
| `gcls_output_score_column_name`       | score                              | Name of the output column to hold score of prediction |


#### Output Format

The output format will be a new parquet file with the label and score columns added.

In [16]:
credential= "HUGGINGFACE CREDENTIAL"

In [13]:
%%time 
from dpk_gneissweb_classification.transform_python import Classification

Classification(input_folder= "tmp/repRemoval",
        output_folder= "tmp/fastText/quality",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_gneissweb_quality_annotator.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Quality_annotator",
        gcls_output_label_column_name= "cosmo_fastText_label",
        gcls_output_score_column_name= "cosmo_fastText_score",
        gcls_content_column_name= "text").transform()

00:16:45 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_gneissweb_quality_annotator.bin', 'gcls_model_url': 'ibm-granite/GneissWeb.Quality_annotator', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'cosmo_fastText_label', 'gcls_output_score_column_name': 'cosmo_fastText_score'}
00:16:45 INFO - pipeline id pipeline_id
00:16:45 INFO - code location None
00:16:45 INFO - data factory data_ is using local data access: input_folder - tmp/repRemoval output_folder - tmp/fastText/quality
00:16:45 INFO - data factory data_ max_files -1, n_sample -1
00:16:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:16:45 INFO - orchestrator gcls started at 2025-02-16 00:16:45
00:16:45 INFO - Number of files is 9, source profile {'max_file_size': 165.65634059906006, 'min_file_size': 103.46352863

CPU times: user 2min 39s, sys: 5.9 s, total: 2min 44s
Wall time: 2min 45s


0

In [17]:
%%time 

Classification(input_folder= "tmp/fastText/quality",
        output_folder= "tmp/fastText/DCLM",
        gcls_model_credential= credential,
        gcls_model_file_name= "openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin",
        gcls_model_url= "mlfoundations/fasttext-oh-eli5",
        gcls_output_label_column_name= "dclm_fastText_label",
        gcls_output_score_column_name= "dclm_fastText_score",
        gcls_content_column_name= "text").transform()

00:20:06 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin', 'gcls_model_url': 'mlfoundations/fasttext-oh-eli5', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'dclm_fastText_label', 'gcls_output_score_column_name': 'dclm_fastText_score'}
00:20:06 INFO - pipeline id pipeline_id
00:20:06 INFO - code location None
00:20:06 INFO - data factory data_ is using local data access: input_folder - tmp/fastText/quality output_folder - tmp/fastText/DCLM
00:20:06 INFO - data factory data_ max_files -1, n_sample -1
00:20:06 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:20:06 INFO - orchestrator gcls started at 2025-02-16 00:20:06
00:20:06 INFO - Number of files is 9, source profile {'max_file_size': 165.78384017944336, 'min_file_size': 103.542

CPU times: user 2min 42s, sys: 5.3 s, total: 2min 47s
Wall time: 2min 47s


0

### 2.2. Document Category Classifiers
##### This step annotates each document using four fastText category classifiers:
##### &emsp;&emsp;  - [GneissWeb.Med_classifier](https://huggingface.co/ibm-granite/GneissWeb.Med_classifier)
##### &emsp;&emsp;  - [GneissWeb.Edu_classifier](https://huggingface.co/ibm-granite/GneissWeb.Edu_classifier)
##### &emsp;&emsp;  - [GneissWeb.Tech_classifier](https://huggingface.co/ibm-granite/GneissWeb.Tech_classifier)
##### &emsp;&emsp;  - [GneissWeb.Sci_classifier](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier)

These fastText models are used as part of the ensemble filter in GneissWeb to leverage the category annotations in category-aware readability score quality filtering and extreme-tokenized quality filtering.


In [18]:
%%time 

Classification(input_folder= "tmp/fastText/DCLM",
        output_folder= "tmp/fastText/medical",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_medical.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Med_classifier",
        gcls_output_label_column_name= "medical_label",
        gcls_output_score_column_name= "medical_score",
        gcls_content_column_name= "text").transform()

00:24:16 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_medical.bin', 'gcls_model_url': 'ibm-granite/GneissWeb.Med_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'medical_label', 'gcls_output_score_column_name': 'medical_score'}
00:24:16 INFO - pipeline id pipeline_id
00:24:16 INFO - code location None
00:24:16 INFO - data factory data_ is using local data access: input_folder - tmp/fastText/DCLM output_folder - tmp/fastText/medical
00:24:16 INFO - data factory data_ max_files -1, n_sample -1
00:24:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:24:16 INFO - orchestrator gcls started at 2025-02-16 00:24:16
00:24:16 INFO - Number of files is 9, source profile {'max_file_size': 166.23744297027588, 'min_file_size': 103.83616161346436, 'total_file_size': 1430.85

CPU times: user 2min 38s, sys: 7.08 s, total: 2min 45s
Wall time: 2min 47s


0

In [19]:
%%time 

Classification(input_folder= "tmp/fastText/medical",
        output_folder= "tmp/fastText/education",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_education.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Edu_classifier",
        gcls_output_label_column_name= "education_label",
        gcls_output_score_column_name= "education_score",
        gcls_content_column_name= "text").transform()

00:27:03 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_education.bin', 'gcls_model_url': 'ibm-granite/GneissWeb.Edu_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'education_label', 'gcls_output_score_column_name': 'education_score'}
00:27:03 INFO - pipeline id pipeline_id
00:27:03 INFO - code location None
00:27:03 INFO - data factory data_ is using local data access: input_folder - tmp/fastText/medical output_folder - tmp/fastText/education
00:27:03 INFO - data factory data_ max_files -1, n_sample -1
00:27:03 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:27:03 INFO - orchestrator gcls started at 2025-02-16 00:27:03
00:27:03 INFO - Number of files is 9, source profile {'max_file_size': 166.36492156982422, 'min_file_size': 103.91450691223145, 'total_file_siz

CPU times: user 2min 38s, sys: 7.37 s, total: 2min 46s
Wall time: 2min 48s


0

In [20]:
%%time 

Classification(input_folder= "tmp/fastText/education",
        output_folder= "tmp/fastText/technology",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_technology_computing.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Tech_classifier",
        gcls_output_label_column_name= "technology_computing_label",
        gcls_output_score_column_name= "technology_computing_score",
        gcls_content_column_name= "text").transform()

00:29:51 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_technology_computing.bin', 'gcls_model_url': 'ibm-granite/GneissWeb.Tech_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'technology_computing_label', 'gcls_output_score_column_name': 'technology_computing_score'}
00:29:51 INFO - pipeline id pipeline_id
00:29:51 INFO - code location None
00:29:51 INFO - data factory data_ is using local data access: input_folder - tmp/fastText/education output_folder - tmp/fastText/technology
00:29:51 INFO - data factory data_ max_files -1, n_sample -1
00:29:51 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:29:51 INFO - orchestrator gcls started at 2025-02-16 00:29:51
00:29:51 INFO - Number of files is 9, source profile {'max_file_size': 166.49221515655518, 'min_file_size'

CPU times: user 2min 35s, sys: 5.11 s, total: 2min 40s
Wall time: 2min 40s


0

In [21]:
%%time 

Classification(input_folder= "tmp/fastText/technology",
        output_folder= "tmp/fastText/science",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_science.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Sci_classifier",
        gcls_output_label_column_name= "science_label",
        gcls_output_score_column_name= "science_score",
        gcls_content_column_name= "text").transform()

00:32:32 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_science.bin', 'gcls_model_url': 'ibm-granite/GneissWeb.Sci_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'science_label', 'gcls_output_score_column_name': 'science_score'}
00:32:32 INFO - pipeline id pipeline_id
00:32:32 INFO - code location None
00:32:32 INFO - data factory data_ is using local data access: input_folder - tmp/fastText/technology output_folder - tmp/fastText/science
00:32:32 INFO - data factory data_ max_files -1, n_sample -1
00:32:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:32:32 INFO - orchestrator gcls started at 2025-02-16 00:32:32
00:32:32 INFO - Number of files is 9, source profile {'max_file_size': 166.6246156692505, 'min_file_size': 104.07626056671143, 'total_file_size': 14

CPU times: user 2min 35s, sys: 5.55 s, total: 2min 41s
Wall time: 2min 42s


0

### 2.3. Readability Scores Quality Annotator

Readability scores are formulas based on text statistics (such as sentence length, average number of words, number of syllables etc.) designed to assess how easily the text can be read and understood.

This transform calculates the McAlpine-EFLAW readability score for each document in the output parquet file from the previous step and adds [McAlpine-EFLAW](https://www.angelfire.com/nd/nirmaldasan/journalismonline/fpetge.html) readability score column to the data.

McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of â‰¤ 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. 

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `readability_contents_column_name` | `text`                             | specifies the name of the column holding the document text.  |
| `readability_score_list`           | `mcalpine_eflaw_textstat`          | list of readability scores to be computed by the transform   |



#### Output Format

The output format will be a new parquet file with the Readability scores added.

In [24]:
# !pip install textstat
from dpk_readability.runtime import Readability

Readability(
    input_folder="tmp/fastText/science",
    output_folder="tmp/readabilty",
    readability_contents_column_name="text",
).transform()

00:48:44 INFO - Readability parameters are : {'readability_contents_column_name': 'text', 'readability_score_list': 'mcalpine_eflaw_textstat'}
00:48:44 INFO - pipeline id pipeline_id
00:48:44 INFO - code location None
00:48:44 INFO - data factory data_ is using local data access: input_folder - tmp/fastText/science output_folder - tmp/readabilty
00:48:44 INFO - data factory data_ max_files -1, n_sample -1
00:48:44 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:48:44 INFO - orchestrator readability started at 2025-02-16 00:48:44
00:48:44 INFO - Number of files is 9, source profile {'max_file_size': 166.7492561340332, 'min_file_size': 104.1508150100708, 'total_file_size': 1435.1349716186523}
00:49:09 INFO - Completed 1 files (11.11%) in 0.426 min
00:49:36 INFO - Completed 2 files (22.22%) in 0.871 min
00:50:03 INFO - Completed 3 files (33.33%) in 1.315 min
00:50:30 INFO - 

0

### 2.4. Extreme-Tokenized-Documents Quality Annotator
##### This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).

##### The annotator transform annotates the input table with 5 columns:

##### &emsp;&emsp;1. doc_num_tokens - number of tokens for each document
##### &emsp;&emsp;2. doc_size_kbs - document size in kb
##### &emsp;&emsp;3. doc_num_chars - number of characters in the document
##### &emsp;&emsp;4. tokens_per_doc_size - ratio between number of tokens and document size
##### &emsp;&emsp;5. tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document
##### Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.



#### 2.4.1 Tokenization

In [29]:
## packages needed for the tokenization
# !pip install "transformers>=4.38.2" "torch" "python-dotenv"


In [30]:
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow

Tokenization2Arrow(
        input_folder= "tmp/readabilty",
        output_folder= "tmp/arrows",
        tkn_tokenizer=  "bigcode/starcoder",
        tkn_doc_id_column= "id",
        tkn_doc_content_column= "text",
        tkn_chunk_size= 20_000).transform()

00:58:18 INFO - pipeline id pipeline_id
00:58:18 INFO - code location None
00:58:18 INFO - data factory data_ is using local data access: input_folder - tmp/readabilty output_folder - tmp/arrows
00:58:18 INFO - data factory data_ max_files -1, n_sample -1
00:58:18 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:58:18 INFO - orchestrator Tokenization2Arrow started at 2025-02-16 00:58:18
00:58:18 INFO - Number of files is 9, source profile {'max_file_size': 166.8946590423584, 'min_file_size': 104.24081420898438, 'total_file_size': 1436.3791055679321}
00:58:18 INFO - Tokenizer config['tokenizer'] = 'bigcode/starcoder' loaded.
00:58:19 INFO - Tokenization2ArrowTransform.transform_binary file_name = '/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/tmp/readabilty/000_00000_0.parquet'
01:02:27 INFO - Completed 1 files (11.11%) in 4.146 min
01:02

0

#### 2.4.2 Annotation

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `et_contents_column_name`          | `text`                             | specifies the name of the column holding the document text.  |
| `et_arrow_path`                    | `unset`                            | location of the folder containing the arrow (tokenization) files.   |



#### Output Format

The output format will be a new parquet file with 5 columns added.

In [36]:
from dpk_extreme_tokenized.runtime import ExtremeTokenized

ExtremeTokenized(
    input_folder="tmp/readabilty",
    output_folder="tmp/extreme_tokenized",
    et_contents_column_name="text",
    et_arrow_path="tmp/arrows",
).transform()

09:58:41 INFO - data factory et_ is using local configuration without input/output path
09:58:41 INFO - data factory et_ max_files -1, n_sample -1
09:58:41 INFO - data factory et_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
09:58:41 INFO - pipeline id pipeline_id
09:58:41 INFO - code location None
09:58:41 INFO - data factory data_ is using local data access: input_folder - tmp/readabilty output_folder - tmp/extreme_tokenized
09:58:41 INFO - data factory data_ max_files -1, n_sample -1
09:58:41 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
09:58:41 INFO - orchestrator et started at 2025-02-16 09:58:41
09:58:41 INFO - Number of files is 9, source profile {'max_file_size': 166.8946590423584, 'min_file_size': 104.24081420898438, 'total_file_size': 1436.3791055679321}
09:58:59 INFO - Com

0

### 3. Category-aware Ensemble Quality Filter

##### GneissWeb ensemble filtering rule: A document is retained if either the fastText combination and category-aware readability score filter agree, or the fastText combination and category-aware extreme-toeknized filter agree. Here the fastText combination is logical OR of the fastText classifiers, i.e., either of the fastText classifiers agrees. Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) dataset page and GneissWeb paper for more details.




##### This filtering step filters out low-quality documents from the input data using multiple quality annotators and by leveraging the category information of documents. 

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `filter_criteria_list`             | `[]`                               | specifies the list of row filter criteria (in SQL WHERE clause format). Each filter criterion is a string. The default value of this parameter is [] (an empty list, meaning that all the rows in the input table will be kept).  |
| `filter_logical_operator`          | `AND`                              | specifies the logical operator that joins filter criteria (AND or OR).   |
| `filter_columns_to_drop`           | `[]`                              | the list with the names of the columns to drop after row filtering is complete. The default value of this parameter is [] (an empty list, meaning that all the columns in the input table will be kept).   |



#### Output Format

The output format will be a new parquet file with the rows that do not meet a specific set of criteria removed.

In [37]:
from dpk_filter.transform_python import Filter

Filter(input_folder= "tmp/extreme_tokenized",
        output_folder= "output",
        filter_criteria_list= 
        ['(("dclm_fastText_score" > 0.002 OR "cosmo_fastText_score" > 0.03)) AND (((mcalpine_eflaw_textstat < 70) AND (technology_computing_label IN (\'technology\') OR medical_label IN (\'medical\') OR education_label IN (\'education\') OR science_label IN (\'science\'))) OR ((mcalpine_eflaw_textstat < 30) AND (technology_computing_label IN (\'cc\') AND medical_label IN (\'cc\') AND education_label IN (\'cc\') AND science_label IN (\'cc\'))))',
         '(("dclm_fastText_score" > 0.002 OR "cosmo_fastText_score" > 0.03)) AND (((tokens_per_doc_num_chars BETWEEN 0.1 AND 0.5) AND (technology_computing_label IN (\'technology\') OR medical_label IN (\'medical\') OR education_label IN (\'education\') OR science_label IN (\'science\'))) OR ((tokens_per_doc_num_chars BETWEEN 0.22 AND 0.28) AND (technology_computing_label IN (\'cc\') AND medical_label IN (\'cc\') AND education_label IN (\'cc\') AND science_label IN (\'cc\'))))'],
        filter_logical_operator= "OR").transform()

10:02:50 INFO - pipeline id pipeline_id
10:02:50 INFO - code location None
10:02:50 INFO - data factory data_ is using local data access: input_folder - tmp/extreme_tokenized output_folder - output
10:02:50 INFO - data factory data_ max_files -1, n_sample -1
10:02:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
10:02:50 INFO - orchestrator filter started at 2025-02-16 10:02:50
10:02:50 INFO - Number of files is 9, source profile {'max_file_size': 169.34099292755127, 'min_file_size': 105.82319831848145, 'total_file_size': 1457.4556217193604}
10:02:52 INFO - Completed 1 files (11.11%) in 0.03 min
10:02:53 INFO - Completed 2 files (22.22%) in 0.061 min
10:02:55 INFO - Completed 3 files (33.33%) in 0.091 min
10:02:57 INFO - Completed 4 files (44.44%) in 0.122 min
10:02:59 INFO - Completed 5 files (55.56%) in 0.154 min
10:03:01 INFO - Completed 6 files (66.67%) in 0.184 min
10

0