# GneissWeb Recipe

#### This notebook presents the GneissWeb recipe and applies the components in sequence to reproduce the GneissWeb processing pipeline using DPK transforms. 
#### ![](recipe3.png)



#### Pip installations

##### These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release.

##### Example for transform developers working from git clone:

##### make venv

##### source venv/bin/activate

##### pip install jupyterlab



### 1. Repetition Removal
##### This component applies exact substring deduplication to remove any substring of predetermined length that repeats more than once within a single parquet file level by adapting the implementation from [deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)


#### Prerequisites

##### To run the repetition removal transform, Rust is required to be installed on the machine. You can install rust following instructions [here](https://www.rust-lang.org/tools/install).

##### Add Rust to $PATH

##### If Rust is not added to your $PATH, run the below steps to add the rust installation location for proper execution.

##### You can use the !whereis cargo command to find where rust is installed in your machine, and set the path there up to the /bin

##### ex: whereis cargo produces: cargo: /Users/USERNAME/.cargo/bin/cargo

##### set the $PATH to include /Users/USERNAME/.cargo/bin/

### 2. Annotation


### 2.1. Fasttext Quality Annotator
##### This step annotates the documents using two FastText quality classifiers: (i) the fastText classifier from [DCLM](https://arxiv.org/pdf/2406.11794) and (ii) our own fastText classifier trained on a mix of high-quality synthetic data and data annotated by an LLM for high educational value. 

### 2.2. Readability Scores Quality Annotator
##### This transform calculates the McAlpine-EFLAW readability score for each document in the output parquet file from the previous step and adds McAlpine-EFLAW readability score column to the data.

##### McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of ≤ 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. 

### 2.3. Extreme-Tokenized Annotator
##### This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).

##### The annotator transform annotates the input table with 5 columns:

###### 1. doc_num_tokens - number of tokens for each document
###### 2. doc_size_kbs - document size in kb
###### 3. doc_num_chars - number of characters in the document
###### 4. tokens_per_doc_size - ratio between number of tokens and document size
###### 5. tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document
##### Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.



### 2.4. Document Category Classifiers
##### This step annotates the documents using four FastText category classifiers:
######   1. Science
######   2. Education
######   3. Technology & computing
######   4. Medical health

### 3. Ensemble Quality Filter
##### This filtering step filters out low-quality documents from the input data using multiple quality annotators and by leveraging the category information of documents. 