# GneissWeb Recipe

#### This notebook presents the GneissWeb recipe and applies the components in sequence to reproduce the GneissWeb processing pipeline using DPK transforms. 
#### ![](recipe3.png)



In [2]:
!pip install "data-prep-toolkit-transforms[rep_removal, readabilty, extreme_tokenized, filter]==1.0.1.dev1"
!pip install langcodes huggingface-hub fasttext-wheel

### 0. Read the input parquet file
##### Download a parquet file from HF using the HF download API

In [3]:
# from huggingface_hub import hf_hub_download
# import pandas as pd

# REPO_ID = "HuggingFaceFW/fineweb"
# FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"

# hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

In [4]:
import urllib.request
import shutil

shutil.os.makedirs("input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/universal/rep_removal/test-data/input/test1.parquet", "input/test1.parquet")
# urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/extreme_tokenized/test-data/input/arrow/test1.arrow", "tmp/input/test1.arrow")

('input/test1.parquet', <http.client.HTTPMessage at 0x1074e89a0>)

#### Pip installations

##### These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release.

##### Example for transform developers working from git clone:

##### make venv

##### source venv/bin/activate

##### pip install jupyterlab



### 1. Repetition Removal
##### This component applies exact substring deduplication to remove any substring of predetermined length that repeats more than once within a single parquet file level by adapting the implementation from [deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)


#### Prerequisites

##### To run the repetition removal transform, Rust is required to be installed on the machine. You can install rust following instructions [here](https://www.rust-lang.org/tools/install).

##### Add Rust to $PATH

##### If Rust is not added to your $PATH, run the below steps to add the rust installation location for proper execution.

##### You can use the !whereis cargo command to find where rust is installed in your machine, and set the path there up to the /bin

##### ex: whereis cargo produces: cargo: /Users/USERNAME/.cargo/bin/cargo

##### set the $PATH to include /Users/USERNAME/.cargo/bin/

In [5]:
%%time
from dpk_rep_removal.runtime import RepRemoval

RepRemoval(input_folder= "input",
            output_folder= "tmp/repRemoval",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads='1',
            ).transform()

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
16:54:37 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
16:54:37 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
16:54:37 INFO - data factory data_ is using local data access: input_folder - input output_folder - tmp/repRemoval
INFO:data_processing.data_access.data_access_factory_base746a4404-d03c-4082-9f2e-68a9653021e0:data factory data_ is using local data access: input_folder - input output_folder - tmp/repRemoval
16:54:37 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_base746a4404-d03c-4082-9f2e-68a9653021e0:data factory data_ max_files -1, n_sample -1
16:54:37 INFO - data factory data_ Not using data sets, che

cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


INFO:root:running the merge
INFO:root:merging complete
[1m[32m    Finished[0m dev [optimized + debuginfo] target(s) in 0.09s
[1m[32m     Running[0m `venv/lib/python3.10/site-packages/dpk_rep_removal/rust/target/debug/dedup_dataset self-similar --data-file /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpe_ti_ok9/save_dir/parquet --length-threshold 50 --cache-dir /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpe_ti_ok9/cache --num-threads 1 --frequency-threshold 1 --retain-first-copy`


Start load!
0 / 19909 
Duplicates found: 7250
Total time taken: 2ms


16:54:40 INFO - Completed 1 files (100.0%) in 0.036 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (100.0%) in 0.036 min
16:54:40 INFO - Done processing 1 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 1 files, waiting for flush() completion.
16:54:40 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
16:54:40 INFO - Completed execution in 0.036 min, execution result 0
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 0.036 min, execution result 0


CPU times: user 1.2 s, sys: 2.17 s, total: 3.36 s
Wall time: 8.06 s


0

### 2. Annotation


### 2.1. Fasttext Quality Annotator
##### This step annotates the documents using two FastText quality classifiers: (i) the fastText classifier from [DCLM](https://arxiv.org/pdf/2406.11794) and (ii) our own fastText classifier trained on a mix of high-quality synthetic data and data annotated by an LLM for high educational value. 

In [7]:
# credential= 

In [10]:
%%time 
from dpk_gneissweb_classification.transform_python import Classification

Classification(input_folder= "tmp/repRemoval",
        output_folder= "tmp/fasttext/quality",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_gneissweb_quality_annotator.bin",
        gcls_model_url= "ibm-research/GneissWeb.Quality_annotator",
        gcls_content_column_name= "text").transform()

17:14:55 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_gneissweb_quality_annotator.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Quality_annotator', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'lang', 'gcls_output_score_column_name': 'score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_gneissweb_quality_annotator.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Quality_annotator', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'lang', 'gcls_output_score_column_name': 'score'}
17:14:55 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
17:14:55 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
17:14:55 INFO - data factory data_ is using local 

CPU times: user 1.18 s, sys: 985 ms, total: 2.17 s
Wall time: 2.76 s


0

In [11]:
%%time 

Classification(input_folder= "tmp/fasttext/quality",
        output_folder= "tmp/fasttext/DCLM",
        gcls_model_credential= credential,
        gcls_model_file_name= "openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin",
        gcls_model_url= "mlfoundations/fasttext-oh-eli5",
        gcls_content_column_name= "text").transform()

17:24:43 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin', 'gcls_model_url': 'mlfoundations/fasttext-oh-eli5', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'lang', 'gcls_output_score_column_name': 'score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin', 'gcls_model_url': 'mlfoundations/fasttext-oh-eli5', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'lang', 'gcls_output_score_column_name': 'score'}
17:24:43 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
17:24:43 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
17:24:43 INFO - data factory data_ is using 

CPU times: user 8.23 s, sys: 20.8 s, total: 29 s
Wall time: 3min 51s


1

### 2.2. Document Category Classifiers
##### This step annotates the documents using four FastText category classifiers:
######   1. Science
######   2. Education
######   3. Technology & computing
######   4. Medical health

In [None]:
%%time 

Classification(input_folder= "tmp/fasttext/quality",
        output_folder= "tmp/fasttext/medical",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_medical.bin",
        gcls_model_url= "ibm-research/GneissWeb.Med_classifier",
        gcls_content_column_name= "text").transform()

In [None]:
%%time 

Classification(input_folder= "tmp/fasttext/medical",
        output_folder= "tmp/fasttext/education",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_education.bin",
        gcls_model_url= "ibm-research/GneissWeb.Edu_classifier",
        gcls_content_column_name= "text").transform()

In [None]:
%%time 

Classification(input_folder= "tmp/fasttext/education",
        output_folder= "tmp/fasttext/technology",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_technology_computing.bin",
        gcls_model_url= "ibm-research/GneissWeb.Tech_classifier",
        gcls_content_column_name= "text").transform()

In [None]:
%%time 

Classification(input_folder= "tmp/fasttext/technology",
        output_folder= "tmp/fasttext/science",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_science.bin",
        gcls_model_url= "ibm-research/GneissWeb.Sci_classifier",
        gcls_content_column_name= "text").transform()

### 2.3. Readability Scores Quality Annotator
##### This transform calculates the McAlpine-EFLAW readability score for each document in the output parquet file from the previous step and adds McAlpine-EFLAW readability score column to the data.

##### McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of ≤ 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. 

In [None]:
from dpk_readability.runtime import Readability

Readability(
    input_folder="tmp/fasttext/science",
    output_folder="tmp/readabilty",
    readability_contents_column_name="contents",
    readability_curriculum=False,
).transform()

### 2.4. Extreme-Tokenized Annotator
##### This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).

##### The annotator transform annotates the input table with 5 columns:

###### 1. doc_num_tokens - number of tokens for each document
###### 2. doc_size_kbs - document size in kb
###### 3. doc_num_chars - number of characters in the document
###### 4. tokens_per_doc_size - ratio between number of tokens and document size
###### 5. tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document
##### Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.



#### 2.4.1 Tokenization

#### 2.4.2 Annotation

In [None]:
from dpk_extreme_tokenized.runtime import ExtremeTokenized

ExtremeTokenized(
    input_folder="tmp/readabilty",
    output_folder="tmp/extreme_tokenized",
    et_contents_column_name="text",
    et_arrow_path="tmp/extreme_tokenized/arrow",
).transform()

### 5. Ensemble Quality Filter
##### This filtering step filters out low-quality documents from the input data using multiple quality annotators and by leveraging the category information of documents. 

In [None]:
from dpk_filter.transform_python import Filter

Filter(input_folder= "tmp/fasttext/extreme_tokenized",
        output_folder= "output",
        filter_criteria_list= [
            
        ],
        filter_logical_operator= "AND").transform()