# How to use EasyLLM Quality data filters

EasyLLMs `data` package adds quality filters for preprocessing text data for improved pretraining. 

In [None]:
!pip install "easyllm[data]" --upgrade

## 1. Perplexity filtering

Perplexity filtering can be used to improve model quality, coherence, and training efficiency by removing confusing text segments and focusing model learning on more standard, comprehensible language.
Perplexity filtering is implemented using `KenLM` models trained on wikipedia. You just need to provide your language id, e.g. `de` and your perplexity `min_threshold` and `max_threshold` the filter will return `True` if the perplexity of the text outside of the threshold `False` otherwise.


In [2]:
from easyllm.data.filters import PerplexityFilter

ppl = PerplexityFilter("en",min_threshold=10,max_threshold=1000)

# Get perplexity
print(ppl.model.get_perplexity("I am very perplexed"))
# 341.3 (low perplexity, since sentence style is formal and with no grammar mistakes)

print(ppl.model.get_perplexity("im hella trippin"))
# 46793.5 (high perplexity, since the sentence is colloquial and contains grammar mistakes)

# testing the filter
assert ppl("I am very perplexed") == False


341.3
46793.5


## NonAlphaNumericFilter

The `NonAlphaNumericFilter` removes documents based on the number of non-alphanumeric characters in the document. Based on [Gopher (Rae et al., 2021)](https://arxiv.org/pdf/2112.11446.pdf), if the document has more then 20% non-alphanumeric characters, it is removed.

In [1]:
from easyllm.data.filters import NonAlphaNumericFilter

nam = NonAlphaNumericFilter()

# not filtered
assert nam("This is a test") == False

# filtered
assert nam("This is a test!!!!!!!") == True


## SymbolToWordFilter

The `SymbolToWordFilter` removes any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the ellipsis.Based on [Gopher (Rae et al., 2021)](https://arxiv.org/pdf/2112.11446.pdf)

In [1]:
from easyllm.data.filters import SymbolToWordFilter

stw = SymbolToWordFilter()

assert stw("This is a test") == False

assert stw("spam#spam#spam#spam#spam#spam#spam#spam") == True

## NumbersToCharacterFilter

The `NumbersToCharacterFilter` removes any document where the 20% of the document are numbers.

In [1]:
from easyllm.data.filters import DigitToCharacter

ntw = DigitToCharacter()

assert ntw("Hello 123 world 456 this text 789 contains 1234 numbers more words") == False

assert ntw("Hello 34534 34534 ") == True


num_digits: 13
total_chars: 66
num_digits / total_chars: 0.19696969696969696
num_digits: 10
total_chars: 18
num_digits / total_chars: 0.5555555555555556
