# Metrics in RedPajama V2 and CulturaX

## CulturaX (strikethrough found in RedPajama):
- ~Number of words~
- ~Character repetition ratio~ (calculated from n-gram repetitions)
- ~Word repetition ratio~ (from unique words and/or unigram entropy)
- ~Special character ratio~
- ~Stop word ratio~
- ~Flagged word ratio~
- ~Language identification confidence~
- ~Perplexity score~
- ~Document length (number of characters)~
- ~Number of lines~
- Short line length ratio  => what does this mean??
- ~Short line ratio~ (can be calculated form lines_num_words) => used threshold 3 => could try 5?

& Before applying, UT1 filtering => not needed in quantile extraction since binary variable.

## RedPajama annotation tags and their meanings. Overlap with CulturaX marked in red.


| Annotation Tag | Description | Category | Reference |
| --- | --- | --- | --- |
| ccnet_bucket | head, middle or tail bucket of the perplexity score | CCNet | CCNet |
| <font color='red'> <b>ccnet_language_score | score of the language identification model | CCNet | CCNet |
| <font color='red'> <b>ccnet_length </font> | number of characters | CCNet | CCNet |
| <font color='red'><b>ccnet_nlines </font> | number of lines | CCNet | CCNet |
| ccnet_original_length | number of characters before in-document line deduplication | CCNet | CCNet |
| ccnet_original_nlines | number of lines before in-document line deduplication | CCNet | CCNet |
| <font color='red'><b>ccnet_perplexity | perplexity of an LM trained on Wikipedia | CCNet | CCNet |
| rps_doc_books_importance | Given a bag of {1,2}-wordgram model trained on Books p, and a model trained on the source domain q, This is the logarithm of the ratio p(doc)/q(doc). | ML Heuristics | Importance Resampling (Xie et al.) |
| rps_doc_openwebtext_importance | Given a bag of {1,2}-wordgram model trained on OpenWebText p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)/q(doc). | ML Heuristics | Importance Resampling (Xie et al.) |
| rps_doc_wikipedia_importance | Given a bag of {1,2}-wordgram model trained on Wikipedia articles p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)/q(doc). | ML Heuristics | Importance Resampling (Xie et al.) |
| rps_doc_ml_wikiref_score | Fasttext classifier prediction for the document being a Wikipedia reference. This is the same fasttext model used in the RedPajama-1T dataset. Only applies to English data.. | ML Heuristics | LLaMA, RedPajama-1T |
| rps_doc_ml_palm_score | Fasttext classifier prediction for the document being a Wikipedia article, OpenWebText sample or a RedPajama-V1 book. Only for English data. | ML Heuristics | PALM, GLaM |
| rps_doc_ml_wikipedia_score | Fasttext classifier prediction for the document being a Wikipedia article. This is used for non-English data | ML Heuristics | - |
| rps_doc_curly_bracket | The ratio between the number of occurrences of '{' or '}' and the number of characters in the raw text. | Natural Language | C4 |
| rps_doc_frac_all_caps_words | The fraction of words in the content that only consist of uppercase letters. This is based on the raw content. | Natural Language | Pretrainer’s Guide |
| rps_doc_frac_lines_end_with_ellipsis | The fraction of lines that end with an ellipsis, where an ellipsis is defined as either "..." or "…". | Natural Language | RefinedWeb, Gopher |
| <font color='red'><b>rps_doc_frac_no_alph_words | The fraction of words that contain no alphabetical character. | Natural Language | RefinedWeb, Gopher |
| rps_doc_lorem_ipsum | The ratio between the number of occurrences of 'lorem ipsum' and the number of characters in the content after normalization. | Natural Language | C4 |
| rps_doc_mean_word_length | The mean length of words in the content after normalization. | Natural Language | RefinedWeb, Gopher |
| <font color='red'><b>rps_doc_stop_word_fraction | The ratio between the number of stop words and the number of words in the document. Stop words are obtained from the stopwords-json repo. | Natural Language | RefinedWeb, Gopher |
| rps_doc_symbol_to_word_ratio | The ratio of symbols to words in the content.. Symbols are defined "#", "...", and "…". | Natural Language | RefinedWeb, Gopher |
| <font color='red'><b>rps_doc_frac_unique_words | The fraction of unique words in the content. This is also known as the degeneracy of a text sample. Calculated based on the normalized content. | Natural Language | Pretrainer’s Guide |
| <font color='red'><b>rps_doc_unigram_entropy | The entropy of the unigram distribution of the content. This measures the diversity of the content and is computed using sum(-x / total * log(x / total)) where the sum is taken over counts of unique words in the normalized content. | Natural Language | - |
| <font color='red'><b>rps_doc_word_count | The number of words in the content after normalization. | Natural Language | RefinedWeb, Gopher |
| rps_lines_ending_with_terminal_punctuation_mark | Indicates whether a line ends with a terminal punctuation mark. A terminal punctuation mark is defined as one of: ".", "!", "?", "”". | Natural Language | C4 |
| rps_lines_javascript_counts | The number of occurrences of the word "javascript" in each line. | Natural Language | C4 |
| <font color='red'><b>rps_lines_num_words | The number of words in each line. This is computed based on the normalized text. | Natural Language | C4 , RefinedWeb |
| rps_lines_numerical_chars_fraction | The ratio between the number of numerical characters and the total number of characters in each line. This is based on the normalized content. | Natural Language | RefinedWeb |
| rps_lines_start_with_bulletpoint | Whether the lines start with a bullet point symbol. The following set of unicodes are considered a bullet point: \u2022 (bullet point), \u2023 (triangular bullet point), \u25B6 (black right pointing triangle), \u25C0 (black left pointing triangle), \u25E6 (white bullet point), \u25A0 (black square), \u25A1 (white square), \u25AA (black small square), \u25AB (white small square), \u2013 (en dash). | Natural Language | RefinedWeb, Gopher |
| rps_lines_uppercase_letter_fraction | The ratio between the number of uppercase letters and the total number of characters in each line. This is based on the raw text. | Natural Language | RefinedWeb |
| rps_doc_num_sentences | The number of sentences in the content. This is calculated using the regular expression r'\b[^.!?]+[.!?]*'. | Natural Language | C4 |
| <font color='red'><b>rps_doc_frac_chars_dupe_10grams | The fraction of characters in duplicate word 10grams. This operates on the lower-cased, punctuation removed content. It is also ensured that characters in overlapping ngrams are only counted once. | Repetitiveness | RefinedWeb, Gopher |
| <font color='red'><b>rps_doc_frac_chars_dupe_5grams | The fraction of characters in duplicate word 5grams. | Repetitiveness | RefinedWeb, Gopher |
| rps_doc_frac_chars_dupe_6grams | The fraction of characters in duplicate word 6grams. | Repetitiveness | RefinedWeb, Gopher |
| rps_doc_frac_chars_dupe_7grams | The fraction of characters in duplicate word 7grams. | Repetitiveness | RefinedWeb, Gopher |
| rps_doc_frac_chars_dupe_8grams | The fraction of characters in duplicate word 8grams. | Repetitiveness | RefinedWeb, Gopher |
| rps_doc_frac_chars_dupe_9grams | The fraction of characters in duplicate word 9grams. | Repetitiveness | RefinedWeb, Gopher |
| rps_doc_frac_chars_top_2gram | The fraction of characters in the top word 2gram. | Repetitiveness | RefinedWeb, Gopher |
| rps_doc_frac_chars_top_3gram | The fraction of characters in the top word 3gram. | Repetitiveness | RefinedWeb, Gopher |
| rps_doc_frac_chars_top_4gram | The fraction of characters in the top word 4gram. | Repetitiveness | RefinedWeb, Gopher |
| <font color='red'> <b>rps_doc_ldnoobw_words | The number of sequences of words that are contained in the List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words blocklist. The blocklist is obtained from the LDNOOBW repo. | toxicity | C4 |
| <font color='blue'> <b>rps_doc_ut1_blacklist | A categorical id corresponding to the list of categories of the domain of the document. Categories are obtained from the UT1 blacklist. The list is obtained from UT-Capitole. | toxicity | RefinedWeb |
| minhash_signature_0.7 | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.7. The signature is based on 128 hash functions and grouped into 14 bands and 9 rows for LSH. | Deduplication | |
| minhash_signature_0.8 | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.8. The signature is based on 128 hash functions and grouped into 9 bands and 13 rows for LSH. | Deduplication | |
| minhash_signature_0.9 | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.9. The signature is based on 128 hash functions and grouped into 5 bands and 25 rows for LSH.. | Deduplication | |
| minhash_signature_1.0 | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 1.0. The signature is based on 128 hash functions and grouped into 1 band and 128 rows for LSH. | Deduplication | |



    

## "Task": Choose intuitively from these, but at least include the ones in CulturaX

The chosen metrics are:

    def quality_signals(signals, results):
   
        # word_count 
        number_of_words=signals["rps_doc_word_count"][0][2]

        # line count
        number_of_lines=signals["ccnet_nlines"][0][2]

        # character count
        number_of_characters=signals["ccnet_length"][0][2]

        # certainty of language
        language_identification=signals["ccnet_language_score"][0][2]

        # perplexity
        perplexity=signals["ccnet_perplexity"][0][2]

        # fraction of stop words vs all words
        stop_words=signals["rps_doc_stop_word_fraction"][0][2]

        # number of words that are non-aplhabetic
        special_characters=signals["rps_doc_frac_no_alph_words"][0][2]

        # fraction of flagged words to all words
        flagged_words=signals["rps_doc_ldnoobw_words"][0][2]

        # words per line => used to calculate two other measures
        words_per_line = np.array(signals["rps_lines_num_words"])[:,2]

        # mean words per line
        words_per_line_mean=np.mean(words_per_line)

        # lines that have 0-2 words / all lines
        short_line_ratio=np.count_nonzero(words_per_line<3)/number_of_lines

        #short_line_length_ratio ??????

        # charactes in duplicate 10 grams
        character_repetition10=signals["rps_doc_frac_chars_dupe_10grams"][0][2]

        # ... in duplicate 5 grams
        character_repetition5=signals["rps_doc_frac_chars_dupe_5grams"][0][2]

        # fraction of unique words => hypothesis: remove both quantiles, as both
        # repetitive content and just a list of words is bad
        word_repetition=signals["rps_doc_frac_unique_words"][0][2]

        # unigram entropy
        unigram_entropy=signals["rps_doc_unigram_entropy"][0][2]

        # lines that end in punctuation / all words => can be used to filter online shops etc.
        lines_end_in_punct=np.count_nonzero(np.array(signals["rps_lines_ending_with_terminal_punctution_mark"])[:,2]==1)/number_of_lines
        
        # add all to results
        for k,v in results.items():
            v.append(eval(k))
    
        return results

## Extracting the thresholds

1. Select a random sample: This happens in ``generate_sample_paths.sh``, by generating random number, seeing if it was generated before, if not, append ``data/redpajama-v2/quality-2023-14/{number}/{language}_{head/middle}.signals.json.gz`` to sample, else generate new number until sample size is N=100 or N=500. Saves paths to a file.
2. Read all sampled files and run ``get_quantiles.py {path to sample file}``, which contains the function above. This also calculates quantiles using ``numpy.percentage()``:

        def calculate_quantiles(arr):
            try:
                # Sorting the array
                arr = flatten(arr)
                sorted_arr = np.sort([i for i in arr if i!=None])
                # Calculating quantiles
                percentile_Q1 = np.percentile(sorted_arr, Q1)
                percentile_Q2 = np.percentile(sorted_arr, Q2)
                percentile_Q3 = np.percentile(sorted_arr, Q3)
                percentile_Q4 = np.percentile(sorted_arr, Q4)
                return percentile_Q1, percentile_Q2, percentile_Q3, percentile_Q4
            except:
                return "failed", "calculation","",""
        
    And saves values to ``results/{name of the file}_{sample size}_results.txt``.
 
 ***
You can run this on LUMI with 
        
        sbatch sl-quantiles.sh {name of the file} # <= like "en_head", "es_middle"
        
Modify sample size inside ``sl-quantiles.sh``! And make sure you have directories ``samples`` and ``results``.
***

## Lastly, which quantile to be used as threshold?

- for all "counts" (word, char, line) bigger is better => select > Q10
- language certainty: bigger is better => select > Q10
- perplexity: smaller better => select < Q90
- stop words, special char, flagged words: smaller better => select < Q90
- words_per_line, mean_words_per_line: bigger better => select > Q10
- short_line_ratio: smaller better => select < Q90
- repetition, unigram entropy: maybe both?
- lines end in punctuation: bigger better => select > Q10

NOTE: STOP WORDS WRONG????? -> correcting it

## Results :)

I did two samples, N=100 and N=500, the results were almost indentical. Using N=100 for all languages, since English timed out for N=500. *Probably* does not matter since as said previously, N=100 and N=500 results were almost identical. Started running ENG N=500 again, might correct this after it finishes. Here is an example of N=500 de_head:

### GERMAN HEAD
|metric|quantiles|choose|
|---|---|---|
|number_of_words: | [40.0, 121.0, 659.0, 1344.0]| >40|
|number_of_lines: | [3.0, 6.0, 26.0, 51.0]|>3|
|number_of_characters: | [302.0, 923.0, 4944.0, 10107.0]|>302|
|language_identification: | [0.97, 0.99, 1.0, 1.0]|>0.97|
|perplexity: | [187.7, 239.5, 332.5, 355.3]|<355.3|
|stop_words: | [0.18367347, 0.27448494, 0.36567164, 0.3988604]|<0.39886|
|special_characters: | [0.13100437, 0.15292096, 0.23450552, 0.33333333]|<0.3333|
|flagged_words: | [0.0, 0.0, 0.0, 0.0]|-|
|words_per_line: | [2.0, 4.0, 35.0, 68.0]|>2|
|words_per_line_mean: | [8.96265866927047, 14.61111111111111, 33.45454545454545, 46.666666666666664]|>8.96266|
|short_line_ratio: | [0.0, 0.0, 0.2, 0.3333333333333333]|-|
|character_repetition10: | [0.0, 0.0, 0.0, 0.0464846]|(maybe not needed)|
|character_repetition5: | [0.0, 0.0, 0.05250678, 0.14074074]|<0.14074|
|word_repetition: | [0.43010753, 0.52428811, 0.73205742, 0.85714286]|>0.43011, <0.85714|
|unigram_entropy: | [3.37919357, 4.29631906, 5.418824377500001, 5.774639442]|>3.37912, <5.7746|
|lines_end_in_punct: | [0.0, 0.25, 0.6, 0.75]|-|

#### ~Okay, now extracting the results~ This is moved to parse_quantiles.py you can still test it here :)
1. For each metric, choose the correct index (0 or 3) and whether we should be under or over it.
2. loop over results and save as json dictionary.

In [4]:
selection_index = {"number_of_words":[[0],[">"]],
                   "number_of_lines":[[0],[">"]],
                   "number_of_characters":[[0],[">"]],
                   "language_identification":[[0],[">"]],
                   "perplexity":[[3],["<"]],
                   "stop_words":[[0],[">"]],
                   "special_characters":[[3],["<"]],
                   "flagged_words":[[3],["<"]],
                   "words_per_line":[[0],[">"]],
                   "words_per_line_mean":[[0],[">"]],
                   "short_line_ratio":[[3],["<"]],
                   "character_repetition10":[[3],["<"]],  #TODO: think if 0<x<3 better
                   "character_repetition5":[[3],["<"]],   #TODO: same
                   "word_repetition":[[0,3],[">","<"]],
                   "unigram_entropy":[[0,3],[">","<"]],
                   "lines_end_in_punct":[[0],[">"]]
                   }

In [6]:
import os
import json

results = {}

dir = "/scratch/project_462000086/amanda/results/100c_limit_for_short_line/"
for file in os.listdir(dir):
    if ".txt" in file:
        with open(dir+file, 'r') as f:
            lines = f.readlines()
            lang = "_".join(file.split("_")[:1])
            results[lang] = {}
            for line in lines:
                if "json.gz" not in line and line!="\n":
                    l = line.replace("\n","").split("\t")
                    key = l[0].replace(": ", "")
                    results[lang][key] = {}
                    value = eval(l[1])
                    selection_values = selection_index[key]
                    thrshl_ind = selection_values[0]
                    thrshl_dir = selection_values[1]
                    for v,k in zip(thrshl_ind, thrshl_dir):
                        #print(f'results {lang} {key} {k} = {value} {v}')
                        results[lang][key][k] = str(value[v])
                    

print(json.dumps(results, indent=4))

# quality_thresholds.json in dir /scratch/project_462000086/data/redpajama-v2
#with open("/scratch/project_462000086/data/redpajama-v2/quality_thresholds.json", "w") as outfile:
#    outfile.write(json.dumps(results))


{
    "en": {
        "number_of_words": {
            ">": "56.0"
        },
        "number_of_lines": {
            ">": "3.0"
        },
        "number_of_characters": {
            ">": "349.0"
        },
        "language_identification": {
            ">": "0.85"
        },
        "perplexity": {
            "<": "485.69999999999993"
        },
        "stop_words": {
            ">": "0.19662921"
        },
        "special_characters": {
            "<": "0.3"
        },
        "flagged_words": {
            "<": "0.0"
        },
        "words_per_line": {
            ">": "3.0"
        },
        "words_per_line_mean": {
            ">": "11.0"
        },
        "short_line_ratio": {
            "<": "0.9000000000000001"
        },
        "character_repetition10": {
            "<": "0.06695012800000001"
        },
        "character_repetition5": {
            "<": "0.17425743"
        },
        "word_repetition": {
            ">": "0.35081615",
            "<": "0.7

#### This is how this format can be used:

In [63]:
for key,value in results["fr_head"]["number_of_words"].items():
    print(key, value)
    f = lambda x : eval("x"+key+value)

> 65.0


In [66]:
print(f(89))
print(f(4))


True
False


This ``eval()`` is nice in my opinion but it is slow hence the final filter.py uses two if conditions instead.

## Short line ratio revisited

CulturaX uses 100 character limit. As there is no chars per line in RedPajama v2, I calculated the mean word length in chars, and use that to approximate how many words fit in 100 chars. 

For this I ran ``count_mean.sh`` which can be found in the scripts directory.

Result are:

    Apptainer> ./count_mean.sh de
    Mean: 6.4507
    Apptainer> ./count_mean.sh it
    Mean: 5.54443
    Apptainer> ./count_mean.sh fr
    Mean: 5.44505
    Apptainer> ./count_mean.sh es
    Mean: 5.25742
    Apptainer> ./count_mean5.16533.sh en
    Mean: 5.16533