# Post Normalization

Trying to find a reuse a model to keep the development process simple.

Spelling is not a huge issue at the moment. 
From manual testing, politically aligned content is usually ok.

In [1]:
from transformers import pipeline

[Spelling Correction English Model](https://huggingface.co/oliverguhr/spelling-correction-english-base) 

Clever to generate data they took good spelling and made bad spelling out of it. 

In [None]:
# 558MB Model
# 66k+ downloads, actively used
spelling_corrector = pipeline(
    "text2text-generation", 
    model="oliverguhr/spelling-correction-english-base"
)

result = spelling_corrector("lets do a comparsion", max_length=2048)
print(result)  # "Let's do a comparison"
# TODO: Model complains I am setting max_new_tokens and the max_length are set

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "Let's do a comparison."}]


In [None]:
# Model looks about 892MB
grammar_corrector = pipeline(
    task="text2text-generation",
    model="vennify/t5-base-grammar-correction",
    device='cuda'
)

# NOTE: Model requires "grammar: " prefix
result = grammar_corrector("grammar: This sentences has has bads grammar.")
print(result)  # "This sentence has bad grammar."

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'generated_text': 'This sentence has bad grammar.'}]


In [6]:
def norm_sm_post(text):
    # correct spelling
    corrected = spelling_corrector(text)[0].get("generated_text")
    assert corrected is not None
    print(f"Spelling: {corrected}")

    # fix grammar
    result = grammar_corrector(f"grammar: {corrected}")[0].get("generated_text")
    assert result is not None
    print(f"Grammar: {result}")

    return result

In [9]:
# Testing

bad_posts = [
    "omg cant beleive trump said that smh 🤮",
]

for bad_post in bad_posts:
    print("Fixing: {bad_post}")
    norm_sm_post(bad_post)

Fixing: {bad_post}
Spelling: Some can't believe Trump said that she's desperate.
Grammar: Some can't believe Trump said that she's desperate.


https://www.interviewquery.com/p/social-media-datasets

https://huggingface.co/datasets/Exorde/exorde-social-media-december-2024-week1
L> this is very large dataset. 

```python
from datasets import load_dataset

ds = load_dataset("Exorde/exorde-social-media-december-2024-week1")
```

Isn't that cool! Looks like 16GB of data though. 
It's from many sources and looks like it's themed based. 