# GneissWeb Recipe

#### In order to be able to reproduce GneissWeb, we provide here a notebook that presents the GneissWeb recipe and applies the components in sequence to reproduce the GneissWeb processing pipeline using DPK transforms. 
<br>

#### Owner:  Hajar Emami-Gohari (hajar.emami@ibm.com)
<br>


### **An Overview of the GneissWeb Recipe**
##### The GneissWeb dataset was obtained by applying the following processing steps:
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Step 1: Exact substring deduplication at line level 
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Step 2: Quality annotators: 
##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.1: Custom built fastText Quality Classifier 
##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.2: Custom built fastText Category Classifiers 
##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.3: Custom built Readability Score Quality Annotator 
##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.4: Custom built Extreme-Tokenized-Documents Quality Annotator 
##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Step 3: Category-aware Ensemble Quality Filter

#####  These were applied in the order shown in the Figure.
##### &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ![](GneissWeb_recipe_new.png)

##### Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) dataset page, [GneissWeb blog](https://research.ibm.com/blog/gneissweb-for-granite-training), and [GneissWeb Technical paper](https://arxiv.org/abs/2502.14907) for more details.
<b>
    


## Prerequisites

##### To run the repetition removal transform, Rust is required to be installed on the machine. You can install rust following instructions [here](https://www.rust-lang.org/tools/install).

##### Add Rust to $PATH

##### If Rust is not added to your $PATH, run the below steps to add the rust installation location for proper execution.

##### You can use the !whereis cargo command to find where rust is installed in your machine, and set the path there up to the /bin

##### ex: whereis cargo produces: cargo: /Users/USERNAME/.cargo/bin/cargo

##### set the $PATH to include /Users/USERNAME/.cargo/bin/

##### These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirements.txt file that includes the right release.


##### Example for transform developers working from data-prep-kit git clone:

##### cd data-prep-kit/examples/notebooks/GneissWeb/

##### python -m venv venv

##### source venv/bin/activate

##### pip install jupyterlab

##### venv/bin/jupyter lab


In [None]:
!pip install --no-cache "data-prep-toolkit-transforms[rep_removal, readabilty, extreme_tokenized, filter, tokenization]==1.0.1.dev1"
!pip install fasttext-wheel textstat
## packages needed for the tokenization
!pip install "transformers>=4.38.2" "torch" "python-dotenv"

### Pre-requisite: Load and split large parquet file

#### Download a parquet file from HF using the HF download API

In [1]:
from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"

file1 = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")


#### Resize the file to a smaller size for testing purposes

###### Estimated completion time:~ 0.6 min

In [3]:
import os

from dpk_resize.runtime import Resize
Resize(input_folder= os.path.dirname(file1),
        output_folder= "input",
        resize_max_mbytes_per_table= 200).transform()


## Step 1. Exact substring deduplication
##### This component applies exact substring deduplication to remove any substring of predetermined length that repeats more than once within a single parquet file level by adapting the implementation from [deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `rep_removal_contents_column_name` | `contents`                         | Name of the column holding the document contents  |
| `rep_remova_length_thresh`         | `50`                               | Length threshold for processing                   |
| `rep_removal_frequency_threshold`  | `1`                                | Frequency threshold for processing                |
| `rep_removal_retain_first_copy`    | `True`                             | Boolean value for whether to retain first copy    |
| `rep_removal_num_threads`          | `psutils.cpu_count(logical=False)` | Value for number of threads to use for processing |


#### Output Format

The output format will be a new parquet file with the repeated sequence(s) removed.

###### Estimated completion time: ~13min


In [4]:
%%time
from dpk_rep_removal.runtime import RepRemoval

RepRemoval(input_folder= "input",
            output_folder= "tmp/repRemoval",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads='10',
            ).transform()

## Step 2. Annotation


### Step 2.1. FastText Quality Annotator
##### This transform annotates each document with two fastText quality classifiers: 
##### (i) [GneissWeb.Quality_annotator](https://huggingface.co/ibm-granite/GneissWeb.Quality_annotator) classifier trained on a mix of high-quality synthetic data and data annotated by an LLM for high educational value 
##### (ii) the fastText classifier from [DCLM](https://arxiv.org/pdf/2406.11794)

These fastText models are used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents. 


#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                             | Default                            | Description                                       |
|------------------------------------   |------------------------------------|---------------------------------------------------|
| `gcls_model_credential`               | unset                              | Credential you use to get model. This is huggingface token. [Guide to get huggingface token](https://huggingface.co/docs/hub/security-tokens)  |
| `gcls_model_file_name`                | unset                              | specifies what filename of model you use to get model, like fasttext_gneissweb_quality_annotator.bin|
| `gcls_model_url`                      | unset                              | specifies url that model locates. For fasttext, this will be repo name of the model, like ibm-granite/GneissWeb.Quality_annotator                |
| `gcls_content_column_name`            | contents                           | Name of the column containing documents   |
| `gcls_output_lablel_column_name`      | label                              | Name of the output column to hold predicted classes |
| `gcls_output_score_column_name`       | score                              | Name of the output column to hold score of prediction |


#### Output Format

The output format will be a new parquet file with the label and score columns added.

###### Expected completion time: ~ 5min

In [16]:
credential= "HUGGINGFACE CREDENTIAL"

In [7]:
%%time 
from dpk_gneissweb_classification.transform_python import Classification

Classification(input_folder= "tmp/repRemoval",
        output_folder= "tmp/fastText/quality",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_gneissweb_quality_annotator.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Quality_annotator",
        gcls_output_label_column_name= "cosmo_fastText_label",
        gcls_output_score_column_name= "cosmo_fastText_score",
        gcls_content_column_name= "text").transform()

In [8]:
%%time 

Classification(input_folder= "tmp/fastText/quality",
        output_folder= "tmp/fastText/DCLM",
        gcls_model_credential= credential,
        gcls_model_file_name= "openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin",
        gcls_model_url= "mlfoundations/fasttext-oh-eli5",
        gcls_output_label_column_name= "dclm_fastText_label",
        gcls_output_score_column_name= "dclm_fastText_score",
        gcls_content_column_name= "text").transform()

### Step 2.2. Document Category Classifiers
##### This step annotates each document using four fastText category classifiers:
##### &emsp;&emsp;  - [GneissWeb.Med_classifier](https://huggingface.co/ibm-granite/GneissWeb.Med_classifier)
##### &emsp;&emsp;  - [GneissWeb.Edu_classifier](https://huggingface.co/ibm-granite/GneissWeb.Edu_classifier)
##### &emsp;&emsp;  - [GneissWeb.Tech_classifier](https://huggingface.co/ibm-granite/GneissWeb.Tech_classifier)
##### &emsp;&emsp;  - [GneissWeb.Sci_classifier](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier)

These fastText models are used as part of the ensemble filter in GneissWeb to leverage the category annotations in category-aware readability score quality filtering and extreme-tokenized quality filtering.

###### Expected completion time: ~ 10min

In [9]:
%%time 

Classification(input_folder= "tmp/fastText/DCLM",
        output_folder= "tmp/fastText/medical",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_medical.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Med_classifier",
        gcls_output_label_column_name= "medical_label",
        gcls_output_score_column_name= "medical_score",
        gcls_content_column_name= "text").transform()

In [10]:
%%time 

Classification(input_folder= "tmp/fastText/medical",
        output_folder= "tmp/fastText/education",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_education.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Edu_classifier",
        gcls_output_label_column_name= "education_label",
        gcls_output_score_column_name= "education_score",
        gcls_content_column_name= "text").transform()

In [11]:
%%time 

Classification(input_folder= "tmp/fastText/education",
        output_folder= "tmp/fastText/technology",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_technology_computing.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Tech_classifier",
        gcls_output_label_column_name= "technology_computing_label",
        gcls_output_score_column_name= "technology_computing_score",
        gcls_content_column_name= "text").transform()

In [12]:
%%time 

Classification(input_folder= "tmp/fastText/technology",
        output_folder= "tmp/fastText/science",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_science.bin",
        gcls_model_url= "ibm-granite/GneissWeb.Sci_classifier",
        gcls_output_label_column_name= "science_label",
        gcls_output_score_column_name= "science_score",
        gcls_content_column_name= "text").transform()

### step 2.3. Readability Scores Quality Annotator

Readability scores are formulas based on text statistics (such as sentence length, average number of words, number of syllables etc.) designed to assess how easily the text can be read and understood.

This transform calculates the McAlpine-EFLAW readability score for each document in the output parquet file from the previous step and adds [McAlpine-EFLAW](https://www.angelfire.com/nd/nirmaldasan/journalismonline/fpetge.html) readability score column to the data.

McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of ≤ 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. 

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `readability_contents_column_name` | `text`                             | specifies the name of the column holding the document text.  |
| `readability_score_list`           | `mcalpine_eflaw_textstat`          | list of readability scores to be computed by the transform   |



#### Output Format

The output format will be a new parquet file with the Readability scores added.

###### Expected completion time: ~3.778 min

In [13]:
from dpk_readability.runtime import Readability

Readability(
    input_folder="tmp/fastText/science",
    output_folder="tmp/readabilty",
    readability_contents_column_name="text",
).transform()

### Step 2.4. Extreme-Tokenized-Documents Quality Annotator
##### This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).

##### The annotator transform annotates the input table with 5 columns:

##### &emsp;&emsp;1. doc_num_tokens - number of tokens for each document
##### &emsp;&emsp;2. doc_size_kbs - document size in kb
##### &emsp;&emsp;3. doc_num_chars - number of characters in the document
##### &emsp;&emsp;4. tokens_per_doc_size - ratio between number of tokens and document size
##### &emsp;&emsp;5. tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document
##### Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.

###### Estimated completion time: ~40 min

#### Step 2.4.1 Tokenization

In [14]:
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow

Tokenization2Arrow(
        input_folder= "tmp/readabilty",
        output_folder= "tmp/arrows",
        tkn_tokenizer=  "bigcode/starcoder",
        tkn_doc_id_column= "id",
        tkn_doc_content_column= "text",
        tkn_chunk_size= 20_000).transform()

#### Step 2.4.2 Annotation

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `et_contents_column_name`          | `text`                             | specifies the name of the column holding the document text.  |
| `et_arrow_path`                    | `unset`                            | location of the folder containing the arrow (tokenization) files.   |



#### Output Format

The output format will be a new parquet file with 5 columns added.

In [15]:
from dpk_extreme_tokenized.runtime import ExtremeTokenized

ExtremeTokenized(
    input_folder="tmp/readabilty",
    output_folder="tmp/extreme_tokenized",
    et_contents_column_name="text",
    et_arrow_path="tmp/arrows",
).transform()

### Step 3. Category-aware Ensemble Quality Filter

##### GneissWeb ensemble filtering rule: A document is retained if either the fastText combination and category-aware readability score filter agree, or the fastText combination and category-aware extreme-toeknized filter agree. Here the fastText combination is logical OR of the fastText classifiers, i.e., either of the fastText classifiers agrees. Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) dataset page and GneissWeb paper for more details.




##### This filtering step filters out low-quality documents from the input data using multiple quality annotators and by leveraging the category information of documents. 

#### Input Parameters

The transform can be initialized with the following parameters:

| Parameter                          | Default                            | Description                                       |
|------------------------------------|------------------------------------|---------------------------------------------------|
| `filter_criteria_list`             | `[]`                               | specifies the list of row filter criteria (in SQL WHERE clause format). Each filter criterion is a string. The default value of this parameter is [] (an empty list, meaning that all the rows in the input table will be kept).  |
| `filter_logical_operator`          | `AND`                              | specifies the logical operator that joins filter criteria (AND or OR).   |
| `filter_columns_to_drop`           | `[]`                              | the list with the names of the columns to drop after row filtering is complete. The default value of this parameter is [] (an empty list, meaning that all the columns in the input table will be kept).   |



#### Output Format

The output format will be a new parquet file with the rows that do not meet a specific set of criteria removed.

###### Estimated completion time: ~.3 min

In [16]:
from dpk_filter.transform_python import Filter

Filter(input_folder= "tmp/extreme_tokenized",
        output_folder= "output",
        filter_criteria_list= 
        ['(("dclm_fastText_score" > 0.002 OR "cosmo_fastText_score" > 0.03)) AND (((mcalpine_eflaw_textstat < 70) AND (technology_computing_label IN (\'technology\') OR medical_label IN (\'medical\') OR education_label IN (\'education\') OR science_label IN (\'science\'))) OR ((mcalpine_eflaw_textstat < 30) AND (technology_computing_label IN (\'cc\') AND medical_label IN (\'cc\') AND education_label IN (\'cc\') AND science_label IN (\'cc\'))))',
         '(("dclm_fastText_score" > 0.002 OR "cosmo_fastText_score" > 0.03)) AND (((tokens_per_doc_num_chars BETWEEN 0.1 AND 0.5) AND (technology_computing_label IN (\'technology\') OR medical_label IN (\'medical\') OR education_label IN (\'education\') OR science_label IN (\'science\'))) OR ((tokens_per_doc_num_chars BETWEEN 0.22 AND 0.28) AND (technology_computing_label IN (\'cc\') AND medical_label IN (\'cc\') AND education_label IN (\'cc\') AND science_label IN (\'cc\'))))'],
        filter_logical_operator= "OR").transform()