# How to use EasyLLM Quality data filters

EasyLLMs `data` package adds quality filters for preprocessing text data for improved pretraining. 

In [None]:
!pip install "easyllm[data]" --upgrade

## 1. Perplexity filtering

Perplexity filtering can be used to improve model quality, coherence, and training efficiency by removing confusing text segments and focusing model learning on more standard, comprehensible language.
Perplexity filtering is implemented using `KenLM` models trained on wikipedia. You just need to provide your language id, e.g. `de` and your perplexity `min_threshold` and `max_threshold` the filter will return `True` if the perplexity of the text outside of the threshold `False` otherwise.


In [2]:
from easyllm.data.filters import PerplexityFilter

ppl = PerplexityFilter("en",min_threshold=10,max_threshold=1000)

# Get perplexity
print(ppl.model.get_perplexity("I am very perplexed"))
# 341.3 (low perplexity, since sentence style is formal and with no grammar mistakes)

print(ppl.model.get_perplexity("im hella trippin"))
# 46793.5 (high perplexity, since the sentence is colloquial and contains grammar mistakes)

# testing the filter
assert ppl("I am very perplexed") == False


341.3
46793.5


## NonAlphaNumericFilter

The `NonAlphaNumericFilter` removes documents based on the number of non-alphanumeric characters in the document. Based on [Gopher (Rae et al., 2021)](https://arxiv.org/pdf/2112.11446.pdf), if the document has more then 20% non-alphanumeric characters, it is removed.

In [1]:
from easyllm.data.filters import NonAlphaNumericFilter

nam = NonAlphaNumericFilter()

# not filtered
assert nam("This is a test") == False

# filtered
assert nam("This is a test!!!!!!!") == True


## SymbolToWordFilter

The `SymbolToWordFilter` removes any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the ellipsis. Based on [Gopher (Rae et al., 2021)](https://arxiv.org/pdf/2112.11446.pdf)

In [1]:
from easyllm.data.filters import SymbolToWordFilter

stw = SymbolToWordFilter()

assert stw("This is a test") == False

assert stw("spam#spam#spam#spam#spam#spam#spam#spam") == True

## NumbersToCharacterFilter

The `NumbersToCharacterFilter` removes any document where the 20% of the document are numbers.

In [1]:
from easyllm.data.filters import DigitToCharacter

ntw = DigitToCharacter()

assert ntw("Hello 123 world 456 this text 789 contains 1234 numbers more words") == False

assert ntw("Hello 34534 34534 ") == True


## UrlRatioFilter

The `UrlRatioFilter` removes any document where 20% of the document is a URL.

In [2]:
from easyllm.data.filters import UrlRatioFilter 

ur = UrlRatioFilter()

assert ur("https://www.google.com") == True

assert ur("Example text with some urls http://www.example.com and more text https://www.example2.com and more text") == False

## BulletpointRatioFilter 

The `BulletpointRatioFilter` removes documents that have more than 90% bulletpoints. Based on [Gopher (Rae et al., 2021)](https://arxiv.org/pdf/2112.11446.pdf)

In [1]:
from easyllm.data.filters import BulletpointRatioFilter

br = BulletpointRatioFilter()

assert br("This is a text with \n- some bullets but\nnot all") == False

assert br("- some bullets and\n- some more") == True

## WhitespaceRatioFilter

The `WhitespaceRatioFilter` is a filter that removes documents that more than 25% of the text is whitespace.


In [2]:
from easyllm.data.filters import WhitespaceRatioFilter

wr = WhitespaceRatioFilter()

assert wr("This is a test") == False

assert wr("Hello world!      This text has    extra whitespace.") == True

## ParenthesesRationFilter

The `ParenthesesRationFilter` is a filter that removes all sentences that have a parentheses ratio greater than 10%.

In [1]:
from easyllm.data.filters import ParenthesesRationFilter

pr = ParenthesesRationFilter()

assert pr("This is a normal sentence") == False

assert pr("This a (with ) ] {(e)") == True

## LongWordFilter

The `LongWordFilter` is a filter that removes documents that include words longer > 1000 character, e.g. js minfied files.

In [2]:
from easyllm.data.filters import LongWordFilter

lw = LongWordFilter()

assert lw("This is a test") == False

assert lw(f"This is a test with a {'longword'*500}") == True