Skip to content

Huge LLM (GPT-3.5-Turbo, 175B parameters) Teach smaller model (distilbert-base-uncased, 134M parameters) to Check grammar and spelling on X messages ignoring the use of slang, shortcuts and hashtags.

License

Notifications You must be signed in to change notification settings

konnir/social_media_grammar_spelling

Repository files navigation

Nir_Kon_Logo_inverted (1)

Teacher Student - Social Media Spelling and Grammar checker

Check grammar and spelling on social media sytle messages with consideration on use of slang, shortcuts and hashtags.

Welcome to the Tweet Checker! It has trained on social media style messages and will support informal english, and all tweeter special symbols.

First, what is it and what it can do (demo taste)?

Live Demo - http://social.nir-kon.com

*. (Please allow about 1 minute for the machine to wake up)

image

***. Yes, it's a 2010 tweets - there was no Wassup back than.

How it was made?

Data:

  • Cheng-Caverlee-Lee geo-location tweets, 2010", 5M Tweets (used 50K for POC)

Models:

  • GPT-3.5-Turbo (OpenAI, 175B parameters)
  • DistilBERT (distilbert-base-uncased, 134M parameters, F32, Multilingual, 2019)

Train:

  • Fine tuned “DistilBERT” on 50k for binary classification task: :Valid / InValid,
  • PyTorch using Hugging Face

How?

  • Used prompt try and error to get the perfect prompt
  • Predicted 50K tweets with GPT-3.5-Turbo as “Source Of Truth”

Results:

  • General accuracy is 86% (F1, micro), for "Valid" 92% (F1) and for "InValid" 59% (F1, minority class 18%).

Idea Teacher student design (majority of LLM to decide on "ground trouth").

image

Poc minimal setup:

image

image

The code - all in the Git, follow this order for simplicity:

  • Server:
    • tweets_server.py - Fast API server code.
    • tweet_predict/tweets_predictor.py - Class to classify if Tweets are correct from the Grammar and Spelling only.
  • Prediction model:
    • /model - hold the model and the tokenizer files (GIT LFS).
  • Data Set Exploration:
    • ds_exploration/5_M_ds_set_up.ipynb - notebook for full data exploration and set up for creating labels and trains.
    • ds_exploration/5_M_ds_set_up.ipynb - small addition to the above (found later).
  • Labels Creations by OpenAI API:
    • classification_by_open_ai/llm_classifier.py - Classify the tweets to yes and no according to grammar using llm keeping track on work done
    • classification_by_open_ai/llm_clean_up_and_order.ipynb - notebook to explore and handle all the cleaning and ordering of the OpenAI result (sometimes messy).
    • prompts/x_classify_system_prompt.txt - system prompt for X classification (long).
  • Fine Tune DistilBERT
    • classification_distil_bert/distilbert_x_model.ipynb - notebook to explore option for DistilBERT fine tune on the data including class weight and - multiple models, epochs and save TEST for UI demo later.
  • UI and Demo:
    • templates/index.html - simpleJS based demo to test and get direct impression for demonstration and R&D purposes.
    • static/images - images folder for the project.
    • demo_texts/demo_messages.py - Class to create a demo of a random message from a list and give back prediction and errors
    • static/texts/balanced_test_df.csv - a TEST ds with balanced valid and invalid messages and their errors from OpenAI for the "Demo" button
  • Data:
    • data/raw_train_tweets_classified_open_ai.csv - about 3M of the full raw data set (git LFS, 688 MB)
    • data/clean_train_tweets.csv - DS after send to open AI and enriched with "labels" and "error" columns (git LFS, 8 MB)
    • data/clean_train_tweets_classified_open_ai.csv - ds for train after cleanup OpenAI issues (git LFS, 4 MB)
  • Other:
    • resources_and_plan/X_resources.ods - list of resources for tweets.

Content (short summary):

  • Collection and pre-processing of X messages:
    • Tried to look on X API to scrape tweets from there and got cold shoulder from the new Management, only 1500 can be scraped for day with frighting license.
    • Collected many tweets data sets from all over the internet (turn out that the license on X change and many were deleted).
    • The chosen data set is a huge (5M) messages:
    • Basic clean - empty, duplicated, URL, special characters
      • Fairly clean DS - main work was to remove just one @ annotation and to decode html.
    • Handle X special -
      • Tweeter has many signs and types like Hashtags and DM messages, decided to keep as is.
      • Emoji, many in my DS and also decided to keep them as is.
    • Handle slang and shortcuts:
      • After researching a little wi decided to keep all slang as is.
      • Same for shortcuts and other, seems like the LLM classifiers later can deal with them.
    • General:
      • DS is 140 character in the majority (see below for box plots), I decided to clean all above 280 characters and below 3 out of the DS and POC.
  • DS creation (with OpenAI GPT.3.5-Turbo):
    • Looking to create a ground of truth I choose the OpenAI due to the API availability.
    • The "playground" OpenAI offer helped me to form the right pompt for my tweets and evalute the quality.
    • pros:
      • Industry standard.
      • Availability.
      • low price.
      • Fast.
    • Cons:
      • Limit of 10k messages a day (I'm very low on the tears, company get much more)
      • Not as good as gpt-4.0-turbo.
      • API not perfect, you can't set temperature=0 and top_p=0.
  • Classifier Creation:
    • DistilBert was chosen due to the each of train and proven record on text simple classifications tasks.
    • Pros:
      • 65M parameters but get the work done for simple tasks.
      • Light, ~200MB with 65M parameters.
      • Fast on GPU and CPU if needed for inference.
      • Short train time compared to Bert / Roberta / GPT-2 that were candidate for this.
    • Cons:
      • I took Hugging Face version with PyTorch, it was probably not the best: change the class weights and everyhting didn't work from there.
  • Web service:
    • Create Rest server to receive messages and return classifications thorough predictor.
    • Expends to multiple messages for later GPU utilization.
  • Demo UI:
    • For visualization and deeper understanding of the data and the problem domain.
    • Create a simple ui to type the message and present if OK or FIX.
    • Expand ui to show the OpenAi answer on the message (from test).

Data Exploration (see /ds_exploration/5_M_ds_set_up.ipynb):

  • 3,609,675 tweets.

  • First step, take the first 5 lines to see how to work here:
    image

  • Let's go for df:
    image

  • 80,807 nulls were dropped.

  • 119,065 duplicates were dropped (left one).

  • 939 messages shorter than 4 characters were doped (our task is context).
    image

  • longest message is: 31,135 characters.

  • Looking on long messages:
    image

  • Decided to keep the limit of tweeter (280) and droped all longer than this.

  • Looking on messages length:
    image

  • most frequent length is 140 (box plot later)

  • Looking on Tweeter specials: RT (22K), DM(13K), "t.com" - no need to remove them.

  • Decide to convert all text from html to plain text (came out in the icons).

  • Looking into icons:
    image

  • Looking on icons and symboles, also no need to remove:
    image

  • save to clean_train_tweets.csv for later work.

Create Data Set with LLM, OpenAI GPT.3.5-Turbo - (see classification_by_open_ai/llm_classifier.py):

  • First step, adapt prompt in the playground of OpenAI and send a lot of messages:

  • https://platform.openai.com/playground/p/default-grammar?mode=chat&model=gpt-3.5-turbo
    image

  • Selected system promt (consulted with GPT4 * 4):
    You are a sophisticated tool developed for scrutinizing Twitter messages. Your primary responsibility is to identify and correct spelling and grammar mistakes within these messages. Although Twitter is known for its informal language and slang, your objective includes distinguishing between acceptable informal expressions and actual spelling or grammatical inaccuracies. This means contractions should be used correctly (e.g., "I'm" instead of "im"), and verbs should be in their proper form (e.g., "makin" instead of "makin"), even in the midst of slang or informal contexts. Your analysis should bypass the slang itself unless it directly leads to a spelling or grammatical mistake. Upon reviewing a message, respond with "yes" if it adheres to standard spelling and grammar rules, considering the nuances of Twitter's informal communication. If any errors are present, reply with "no" and concisely specify each error found, emphasizing solely the spelling and grammatical issues without critiquing the informal or slang usage, unless it constitutes an error in spelling or grammar.

  • Selected model: GPT3-3.5-Turbo:

  • Pros:

    • Fordable, classify 10K messages ~1.5 with this long prompt...
    • Quite accurate in most cases and know informal English and social media tweets.
  • Cons:

    • GPT-4.0-Turbo is much better but cost x10.
    • Limits on daily and hourly tokens.
  • Ho to classify?

  • It's a long task so we send 10 messages and save response (proved to be right, connection not always stable).

  • API: Temperature and top_p not workig on the API but result are generally good and we can deal with errors.

  • Creating "raw_train_tweets_classified_open_ai.csv"

Pre-Process for train (look at /classification_by_open_ai/llm_clean_up_and_order.ipynb):

  • First look:
    image

  • ones = 24K

  • zeros = 6.5K

  • fixes to OpenAI result:

    • "yes" -> 1:
      image

    • "no errors found" -> 1:
      image

    • "no errors" -> 1
      image

    • also "no, there are no errors" -> 1

  • Now the time to look on the DS (~30K):
    image image

  • saving to "clean_train_tweets_classified_open_ai.csv" for further work.

Train (look at /classification_by_open_ai/distilbert_x_model.ipynb):

  • Model is DistilBERT - small 66M parameters that is recommended everywhere to this task and can allow multiple train in low budget.

  • In the future bigger models shoudl be tested.

  • Framework is PyTorch on Hugging Face for simplicity.

  • First look the text length we can see a little bias between ones and zeros in term of length:
    image

  • Splitting to train and test.

  • For now validation is not splitted due to issues with HF infrastructure.

  • Another look on the test train:
    image image

  • Tokenizing and preparing data loader.

  • Pay attention max_lenght chosen to 350 tokens since our tweets are 280 Words top (English).

  • Model = DistilBertForSequenceClassification

  • training_args -> default laeaning rate, batch = 32 (about 11GB of GPU memory)

  • Train sample of 1K just to see it's working and predict on 1 -> all ok.

  • Try bigger train and result are not good so deciding on "Class Weight" (in latest HF version you need to do it by override).

  • CustomTrainer - specially to overwrite compute_loose for class weights.

  • CustomCallback - Save the models during the epochs.

  • Issues: couldn't set validation and on epoch there is no train loss data, managed to bypass with wandb but had to track all losses.

  • Train view of wandb:
    image

  • Result for class weights model:

  • Epoch 6:
    image image

  • Look on the real result show our data is still low:
    image

  • Epoch 2 (best on the minority class):
    image image

  • Trying with even_dataset (remove ones so zeros=ones):
    image image

  • Time to decide: for now the balanced result (ones=zeros) is not the best but with higher recall and will be too strict on fix (we miss less), going for it.

  • We must have more data, specially zeros.

Buiding Predict (see tweets_server.py and tweet_predict/tweets_predictor.py):

  • Prediction server, using the chosen model and predict result.
  • Future to predict on many request at ones.
  • Use GPU, future improvement is to use ONNX for up to *5 faster (need to know the hardware and the profile).
  • Rest Server based on Fast AI:
    • POST
    • /correct_tweet
    • data in form.
    • Response: Json with 0 for wrong and 1 for correct (Strings)
    • See image at the top for usage in postman.

Demo UI:

  • Simple demo ui in js.
  • Takes messages from test (pre-loaded csv) and show the OpenAI response to the text.
  • Allow to check any tweet for direct impression.
  • See image at the page top.

License:

  • All Original code done by the author is free to use in any way without any warrenty.
  • Data (like Cheng-Caverlee-Lee , 2010", 5M Tweets), or / and third party used here belong to their creators and it is user responsibility to comply with their licenses.
  • X dataset belong to X (Tweeter), it is user responsibility to comply with their licenses.
  • The purpose of this code and demo is for reaserch only.

About

Huge LLM (GPT-3.5-Turbo, 175B parameters) Teach smaller model (distilbert-base-uncased, 134M parameters) to Check grammar and spelling on X messages ignoring the use of slang, shortcuts and hashtags.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published