Work has already begun towards developing a COVID-19 vaccine. From measles to the common flu, vaccines have lowered the risk of illness and death, and have saved countless lives around the world. Unfortunately in some countries, the 'anti-vaxxer' movement has led to lower rates of vaccination and new outbreaks of old diseases.
Although it may be many months before we see COVID-19 vaccines available on a global scale, it is important to monitor public sentiment towards vaccinations now and especially in the future when COVID-19 vaccines are offered to the public. The anti-vaccination sentiment could pose a serious threat to the global efforts to get COVID-19 under control in the long term.
The objective of this challenge is to develop a machine learning model to assess if a Twitter post related to vaccinations is positive, neutral, or negative. This model will be deployed using streamlit on a Docker Container.
Tweets have been classified as pro-vaccine (1), neutral (0) or anti-vaccine (-1). The tweets have had usernames and web addresses removed.
Variable definition:
tweet_id: Unique identifier of the tweet
safe_tweet: Text contained in the tweet. Some sensitive information has been removed like usernames and urls
label: Sentiment of the tweet (-1 for negative, 0 for neutral, 1 for positive)
agreement: The tweets were labeled by three people. Agreement indicates the percentage of the three reviewers that agreed on the given label. You may use this column in your training, but agreement data will not be shared for the test set.
Files available for download are:
Train.csv - Labelled tweets on which to train your model
Test.csv - Tweets that you must classify using your trained model
SampleSubmission.csv - is an example of what your submission file should look like. The order of the rows does not matter, but the names of the ID must be correct. Values in the 'label' column should range between -1 and 1.
NLP_Primer_twitter_challenge.ipynb - is a starter notebook to help you make your first submission on this challenge.
Fork this repo and run the notebook on Google Colab. The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use Colab to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.
Note that Google Colab sessions have time limits and may disconnect after a period of inactivity. However, you can save your progress and re-establish the connection to the GPU when needed.
Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.
Please, go to the website and sign-in to access all the features of the platform.
Read more about Text classification with Hugging Face
The evaluation metric for this challenge is the Root Mean Squared Error.
- Quick intro to NLP
- Getting Started With Hugging Face in 15 Minutes
- Fine-tuning a Neural Network explained
- Fine-Tuning-DistilBert - Hugging Face Transformer for Poem Sentiment Prediction | NLP
- Introduction to NLP: Playlist
If you found this article helpful, please give it a clap or a star on GitHub!