# GneissWeb Recipe

#### This notebook presents the GneissWeb recipe and applies the components in sequence to reproduce the GneissWeb processing pipeline using DPK transforms. 
<br>
__________________________________________________________________________________________________________________________________________________________

##### Contributors: Hajar Emami Gohari (Hajar.Emami@ibm.com)
<br>




#### The GneissWeb Recipe consists of the following ingredients:
#### - Read the input Data (Sec. 0)
#### - Exact substring deduplication at line level (Sec. 1)
#### - Ensemble quality annotator (Sec. 2) consisting of: 
#### &emsp;&emsp; - Custom built FastText Quality Classifier (Sec. 2.1)
#### &emsp;&emsp; - Custom built FastText Category Classifiers (Sec. 2.2)
#### &emsp;&emsp; - Custom built Readability Score Quality Annotator (Sec. 2.3)
#### &emsp;&emsp; - Custom built Extreme-Tokenized-Documents Quality Annotator (Sec. 2.4)
#### &emsp;&emsp;&emsp;&emsp; - Tokenization
#### &emsp;&emsp;&emsp;&emsp; - Annotation
#### - Category-aware Quality Filter (Sec. 3)

#### &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; ![](GneissWeb_recipe_new.png)

In [39]:
# !pip install --no-cache "data-prep-toolkit-transforms[rep_removal, readabilty, extreme_tokenized, filter, tokenization]==1.0.1.dev1"
# !pip install langcodes huggingface-hub fasttext-wheel

### 0. Read the input parquet file
##### Download a parquet file from HF using the HF download API

In [2]:
# from huggingface_hub import hf_hub_download
# import pandas as pd

# REPO_ID = "HuggingFaceFW/fineweb"
# FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"

# hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

In [3]:
import urllib.request
import shutil

shutil.os.makedirs("input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/universal/rep_removal/test-data/input/test1.parquet", "input/test1.parquet")
# urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/extreme_tokenized/test-data/input/arrow/test1.arrow", "tmp/input/test1.arrow")

('input/test1.parquet', <http.client.HTTPMessage at 0x103e2efe0>)

#### Pip installations

##### These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release.

##### Example for transform developers working from git clone:

##### make venv

##### source venv/bin/activate

##### pip install jupyterlab



### 1. Repetition Removal
##### This component applies exact substring deduplication to remove any substring of predetermined length that repeats more than once within a single parquet file level by adapting the implementation from [deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)

<!--  -->
#### Prerequisites

##### To run the repetition removal transform, Rust is required to be installed on the machine. You can install rust following instructions [here](https://www.rust-lang.org/tools/install).

##### Add Rust to $PATH

##### If Rust is not added to your $PATH, run the below steps to add the rust installation location for proper execution.

##### You can use the !whereis cargo command to find where rust is installed in your machine, and set the path there up to the /bin

##### ex: whereis cargo produces: cargo: /Users/USERNAME/.cargo/bin/cargo

##### set the $PATH to include /Users/USERNAME/.cargo/bin/

In [40]:
# import pandas as pd

# pq = "input/test1.parquet"
# df = pd.read_parquet(pq)
# df.head(20)

In [4]:
%%time
from dpk_rep_removal.runtime import RepRemoval

RepRemoval(input_folder= "input",
            output_folder= "tmp/repRemoval",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads='1',
            ).transform()

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
23:09:25 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
23:09:25 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
23:09:25 INFO - data factory data_ is using local data access: input_folder - input output_folder - tmp/repRemoval
INFO:data_processing.data_access.data_access_factory_base57a8845a-6882-43cd-8e6f-24f330ca6ad4:data factory data_ is using local data access: input_folder - input output_folder - tmp/repRemoval
23:09:25 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_base57a8845a-6882-43cd-8e6f-24f330ca6ad4:data factory data_ max_files -1, n_sample -1
23:09:25 INFO - data factory data_ Not using data sets, che

cpu speed: 3228 MHz, Cores: 10
gpu_usage: 0.00%, GPU speed: 0 MHz


INFO:root:running the merge
INFO:root:merging complete
[1m[32m    Finished[0m dev [optimized + debuginfo] target(s) in 0.07s
[1m[32m     Running[0m `venv/lib/python3.10/site-packages/dpk_rep_removal/rust/target/debug/dedup_dataset self-similar --data-file /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpwxurx_g9/save_dir/parquet --length-threshold 50 --cache-dir /var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/tmpwxurx_g9/cache --num-threads 1 --frequency-threshold 1 --retain-first-copy`


Start load!
0 / 19909 
Duplicates found: 7250
Total time taken: 7ms


23:09:27 INFO - Completed 1 files (100.0%) in 0.036 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (100.0%) in 0.036 min
23:09:27 INFO - Done processing 1 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 1 files, waiting for flush() completion.
23:09:27 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
23:09:27 INFO - Completed execution in 0.036 min, execution result 0
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 0.036 min, execution result 0


CPU times: user 1.33 s, sys: 2.11 s, total: 3.44 s
Wall time: 7.5 s


0

In [41]:
# import pandas as pd

# pq = "tmp/repRemoval/test1.parquet"
# df = pd.read_parquet(pq)
# df.head(20)

### 2. Annotation


### 2.1. Fasttext Quality Annotator
##### This step annotates the documents using two FastText quality classifiers: (i) the fastText classifier from [DCLM](https://arxiv.org/pdf/2406.11794) and (ii) our own fastText classifier trained on a mix of high-quality synthetic data and data annotated by an LLM for high educational value. 

In [5]:
credential= ""

In [21]:
%%time 
from dpk_gneissweb_classification.transform_python import Classification

Classification(input_folder= "tmp/repRemoval",
        output_folder= "tmp/fasttext/quality",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_gneissweb_quality_annotator.bin",
        gcls_model_url= "ibm-research/GneissWeb.Quality_annotator",
        gcls_output_label_column_name= "cosmo_fastText_label",
        gcls_output_score_column_name= "cosmo_fastText_score",
        gcls_content_column_name= "text").transform()

12:05:42 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_gneissweb_quality_annotator.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Quality_annotator', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'cosmo_fasttext_label', 'gcls_output_score_column_name': 'cosmo_fasttext_score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_gneissweb_quality_annotator.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Quality_annotator', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'cosmo_fasttext_label', 'gcls_output_score_column_name': 'cosmo_fasttext_score'}
12:05:42 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
12:05:42 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code lo

CPU times: user 1.25 s, sys: 1.04 s, total: 2.29 s
Wall time: 3.24 s


0

In [22]:
%%time 

Classification(input_folder= "tmp/fasttext/quality",
        output_folder= "tmp/fasttext/DCLM",
        gcls_model_credential= credential,
        gcls_model_file_name= "openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin",
        gcls_model_url= "mlfoundations/fasttext-oh-eli5",
        gcls_output_label_column_name= "dclm_fastText_label",
        gcls_output_score_column_name= "dclm_fastText_score",
        gcls_content_column_name= "text").transform()

12:05:48 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin', 'gcls_model_url': 'mlfoundations/fasttext-oh-eli5', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'dclm_Fasttext_label', 'gcls_output_score_column_name': 'dclm_Fasttext_score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin', 'gcls_model_url': 'mlfoundations/fasttext-oh-eli5', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'dclm_Fasttext_label', 'gcls_output_score_column_name': 'dclm_Fasttext_score'}
12:05:48 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
12:05:48 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code 

CPU times: user 1.16 s, sys: 815 ms, total: 1.97 s
Wall time: 2.24 s


0

### 2.2. Document Category Classifiers
##### This step annotates the documents using four FastText category classifiers:
##### &emsp;&emsp;  1. Science
##### &emsp;&emsp;  2. Education
##### &emsp;&emsp;  3. Technology & computing
##### &emsp;&emsp;  4. Medical health

In [23]:
%%time 

Classification(input_folder= "tmp/fasttext/DCLM",
        output_folder= "tmp/fasttext/medical",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_medical.bin",
        gcls_model_url= "ibm-research/GneissWeb.Med_classifier",
        gcls_output_label_column_name= "medical_label",
        gcls_output_score_column_name= "medical_score",
        gcls_content_column_name= "text").transform()

12:06:05 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_medical.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Med_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'medical_label', 'gcls_output_score_column_name': 'medical_score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_medical.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Med_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'medical_label', 'gcls_output_score_column_name': 'medical_score'}
12:06:05 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
12:06:05 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
12:06:05 INFO - data factory data_ is using local data access:

CPU times: user 1.91 s, sys: 1.12 s, total: 3.03 s
Wall time: 3.2 s


0

In [24]:
%%time 

Classification(input_folder= "tmp/fasttext/medical",
        output_folder= "tmp/fasttext/education",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_education.bin",
        gcls_model_url= "ibm-research/GneissWeb.Edu_classifier",
        gcls_output_label_column_name= "education_label",
        gcls_output_score_column_name= "education_score",
        gcls_content_column_name= "text").transform()

12:06:16 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_education.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Edu_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'education_label', 'gcls_output_score_column_name': 'education_score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_education.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Edu_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'education_label', 'gcls_output_score_column_name': 'education_score'}
12:06:16 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
12:06:16 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
12:06:16 INFO - data factory data_ is using local 

CPU times: user 1.69 s, sys: 996 ms, total: 2.68 s
Wall time: 2.82 s


0

In [25]:
%%time 

Classification(input_folder= "tmp/fasttext/education",
        output_folder= "tmp/fasttext/technology",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_technology_computing.bin",
        gcls_model_url= "ibm-research/GneissWeb.Tech_classifier",
        gcls_output_label_column_name= "technology_computing_label",
        gcls_output_score_column_name= "technology_computing_score",
        gcls_content_column_name= "text").transform()

12:06:22 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_technology_computing.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Tech_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'technology_computing_label', 'gcls_output_score_column_name': 'technology_computing_score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_technology_computing.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Tech_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'technology_computing_label', 'gcls_output_score_column_name': 'technology_computing_score'}
12:06:22 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
12:06:22 INFO - code location None
INFO:data_processing.runtime.execution_configuration:c

CPU times: user 1.95 s, sys: 1.31 s, total: 3.25 s
Wall time: 3.59 s


0

In [26]:
%%time 

Classification(input_folder= "tmp/fasttext/technology",
        output_folder= "tmp/fasttext/science",
        gcls_model_credential= credential,
        gcls_model_file_name= "fasttext_science.bin",
        gcls_model_url= "ibm-research/GneissWeb.Sci_classifier",
        gcls_output_label_column_name= "science_label",
        gcls_output_score_column_name= "science_score",
        gcls_content_column_name= "text").transform()

12:06:27 INFO - parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_science.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Sci_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'science_label', 'gcls_output_score_column_name': 'science_score'}
INFO:dpk_gneissweb_classification.transform:parameters are : {'gcls_model_credential': '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', 'gcls_model_file_name': 'fasttext_science.bin', 'gcls_model_url': 'ibm-research/GneissWeb.Sci_classifier', 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': 'science_label', 'gcls_output_score_column_name': 'science_score'}
12:06:27 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
12:06:27 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
12:06:27 INFO - data factory data_ is using local data access:

CPU times: user 2.38 s, sys: 1.49 s, total: 3.87 s
Wall time: 4.18 s


0

### 2.3. Readability Scores Quality Annotator
##### This transform calculates the McAlpine-EFLAW readability score for each document in the output parquet file from the previous step and adds McAlpine-EFLAW readability score column to the data.

##### McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of ≤ 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. 

In [20]:
# !pip install textstat
from dpk_readability.runtime import Readability

Readability(
    input_folder="tmp/fasttext/science",
    output_folder="tmp/readabilty",
    readability_contents_column_name="text",
).transform()

11:41:15 INFO - Readability parameters are : {'readability_contents_column_name': 'text', 'readability_score_list': 'mcalpine_eflaw_textstat'}
11:41:15 INFO - pipeline id pipeline_id
11:41:15 INFO - code location None
11:41:15 INFO - data factory data_ is using local data access: input_folder - tmp/fasttext/science output_folder - tmp/readabilty
11:41:15 INFO - data factory data_ max_files -1, n_sample -1
11:41:15 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
11:41:15 INFO - orchestrator readability started at 2025-02-14 11:41:15
11:41:15 INFO - Number of files is 1, source profile {'max_file_size': 0.042934417724609375, 'min_file_size': 0.042934417724609375, 'total_file_size': 0.042934417724609375}
11:41:15 INFO - Completed 1 files (100.0%) in 0.0 min
11:41:15 INFO - Done processing 1 files, waiting for flush() completion.
11:41:15 INFO - done flushing in 0.0 sec
11:41:15

0

In [42]:
# import pandas as pd

# pq = "tmp/readabilty/test1.parquet"
# df = pd.read_parquet(pq)
# df.info()

### 2.4. Extreme-Tokenized-Documents Quality Annotator
##### This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).

##### The annotator transform annotates the input table with 5 columns:

##### &emsp;&emsp;1. doc_num_tokens - number of tokens for each document
##### &emsp;&emsp;2. doc_size_kbs - document size in kb
##### &emsp;&emsp;3. doc_num_chars - number of characters in the document
##### &emsp;&emsp;4. tokens_per_doc_size - ratio between number of tokens and document size
##### &emsp;&emsp;5. tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document
##### Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.



#### 2.4.1 Tokenization

In [22]:
## packages needed for the tokenization
# !pip install "transformers>=4.38.2" "torch" "python-dotenv"


In [31]:
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow

Tokenization2Arrow(
        input_folder= "tmp/readabilty",
        output_folder= "tmp/arrows",
        tkn_tokenizer=  "bigcode/starcoder",
        tkn_doc_id_column= "id",
        tkn_doc_content_column= "text",
        tkn_chunk_size= 20_000).transform()

11:52:31 INFO - pipeline id pipeline_id
11:52:31 INFO - code location None
11:52:31 INFO - data factory data_ is using local data access: input_folder - tmp/readabilty output_folder - tmp/arrows
11:52:31 INFO - data factory data_ max_files -1, n_sample -1
11:52:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
11:52:31 INFO - orchestrator Tokenization2Arrow started at 2025-02-14 11:52:31
11:52:31 INFO - Number of files is 1, source profile {'max_file_size': 0.03496551513671875, 'min_file_size': 0.03496551513671875, 'total_file_size': 0.03496551513671875}
11:52:31 INFO - Tokenizer config['tokenizer'] = 'bigcode/starcoder' loaded.
11:52:31 INFO - Tokenization2ArrowTransform.transform_binary file_name = '/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/tmp/readabilty/test1.parquet'
11:52:31 INFO - Completed 1 files (100.0%) in 0.0 min
11:52:31 

0

#### 2.4.2 Annotation

In [38]:
from dpk_extreme_tokenized.runtime import ExtremeTokenized

ExtremeTokenized(
    input_folder="tmp/readabilty",
    output_folder="tmp/extreme_tokenized",
    et_contents_column_name="text",
    et_arrow_path="tmp/arrows",
).transform()

12:07:51 INFO - data factory et_ is using local configuration without input/output path
12:07:51 INFO - data factory et_ max_files -1, n_sample -1
12:07:51 INFO - data factory et_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:07:51 INFO - pipeline id pipeline_id
12:07:51 INFO - code location None
12:07:51 INFO - data factory data_ is using local data access: input_folder - tmp/readabilty output_folder - tmp/extreme_tokenized
12:07:51 INFO - data factory data_ max_files -1, n_sample -1
12:07:51 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:07:51 INFO - orchestrator et started at 2025-02-14 12:07:51
12:07:51 INFO - Number of files is 1, source profile {'max_file_size': 0.03496551513671875, 'min_file_size': 0.03496551513671875, 'total_file_size': 0.03496551513671875}
12:07:51 INFO -

1

### 3. Category-aware Ensemble Quality Filter
##### This filtering step filters out low-quality documents from the input data using multiple quality annotators and by leveraging the category information of documents. 

In [16]:
from dpk_filter.transform_python import Filter

Filter(input_folder= "tmp/readabilty",
        output_folder= "output",
        filter_criteria_list= 
        # ['(("dclm_fasttext_score" > 0.002 OR "cosmo_10k_edu_fasttext_score" > 0.03)) AND (((mcalpine_eflaw_textstat < 70) AND (technology_computing_label IN ('technology') OR medical_label IN ('medical') OR education_label IN ('education') OR science_label IN ('science'))) OR ((mcalpine_eflaw_textstat < 30) AND (technology_computing_label IN ('cc') AND medical_label IN ('cc') AND education_label IN ('cc') AND science_label IN ('cc'))))', '(("dclm_fasttext_score" > 0.002 OR "cosmo_10k_edu_fasttext_score" > 0.03)) AND (((tokens_per_doc_num_chars BETWEEN 0.1 AND 0.5) AND (technology_computing_label IN ('technology') OR medical_label IN ('medical') OR education_label IN ('education') OR science_label IN ('science'))) OR ((tokens_per_doc_num_chars BETWEEN 0.22 AND 0.28) AND (technology_computing_label IN ('cc') AND medical_label IN ('cc') AND education_label IN ('cc') AND science_label IN ('cc'))))']
        ['(("dclm_fasttext_score" > 0.002 OR "cosmo_10k_edu_fasttext_score" > 0.03)) AND (((mcalpine_eflaw_textstat < 70) AND (technology_computing_label IN (\'technology\') OR medical_label IN (\'medical\') OR education_label IN (\'education\') OR science_label IN (\'science\'))) OR ((mcalpine_eflaw_textstat < 30) AND (technology_computing_label IN (\'cc\') AND medical_label IN (\'cc\') AND education_label IN (\'cc\') AND science_label IN (\'cc\'))))',
         '(("dclm_fasttext_score" > 0.002 OR "cosmo_10k_edu_fasttext_score" > 0.03)) AND (((tokens_per_doc_num_chars BETWEEN 0.1 AND 0.5) AND (technology_computing_label IN (\'technology\') OR medical_label IN (\'medical\') OR education_label IN (\'education\') OR science_label IN (\'science\'))) OR ((tokens_per_doc_num_chars BETWEEN 0.22 AND 0.28) AND (technology_computing_label IN (\'cc\') AND medical_label IN (\'cc\') AND education_label IN (\'cc\') AND science_label IN (\'cc\'))))'],
        filter_logical_operator= "OR").transform()

23:14:32 INFO - pipeline id pipeline_id
23:14:32 INFO - code location None
23:14:32 INFO - data factory data_ is using local data access: input_folder - tmp/readabilty output_folder - output
23:14:32 INFO - data factory data_ max_files -1, n_sample -1
23:14:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:14:32 INFO - orchestrator filter started at 2025-02-13 23:14:32
23:14:32 INFO - Number of files is 1, source profile {'max_file_size': 0.05427265167236328, 'min_file_size': 0.05427265167236328, 'total_file_size': 0.05427265167236328}
  File "/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
    out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
  File "/Users/hajaremami/Desktop/DPK_notebo

1