# Predicting Sentiment from Tweets using BERT

The following file will take in a dataset containing Tweets, do some pre-processing on the Tweets, then use an Ensemble BERT model to predict the sentiments of Tweets. Just follow the instructions below and run each cell subsequently.

## Installing libraries

To run this code natively on your own computers instead of on Docker, you'll have to install the following libraries. 

* `pip install nltk`
* `pip3 install emoji`
* `pip install pandas`
* `pip install transformers`
* `pip install datasets`
* `pip3 install torch torchvision`
* `pip install -U scikit-learn`
* `pip install regex`

## Pre-processing

First, we'll pre-process your data to remove any URLs, emojis, or username mentions. Important to note before running this code is that you should have your dataset of Tweets in the same directory as this notebook, and you should replace the placeholder names in the code below. `"TWEETSETNAME.csv"` should be replaced by the name of your dataset, and `"TWEETSCOLUMN"` should be replaced by the name of the column in your CSV file that has the Tweets.

In [3]:
import pandas as pd
tweets = pd.read_csv("../TWEETSETNAME.csv") # Change TWEETSETNAME to the name of your CSV file
tweets['label'] = int(1)
from TweetNormalizer import normalizeTweet
tweets_column = "TWEETSCOLUMN" # Change TWEETSCOLUMN to the name of the column in your CSV with the text of the Tweets
tweets[tweets_column] = tweets[tweets_column].apply(normalizeTweet)
tweets.to_csv("tweets_to_predict.csv", index = False)
tweets[0:5].to_csv("tweets_to_predict_test.csv", index = False)

## Test Run

Running the full algorithm on all of your Tweets may take a while, so to make sure that everything is set up correctly run the cell below. If you get the first few rows of a dataframe at the end of the following output, you should be good!

In [4]:
from datasets import Dataset
test_tweets = pd.read_csv("tweets_to_predict_test.csv")
test_dataset = Dataset.from_pandas(test_tweets)

from sentiment_analysis import get_sentiment_predictions

test_tweets["label"] = get_sentiment_predictions(test_dataset[tweets_column])
test_tweets[[tweets_column, "label"]].head()

***** Running Prediction *****
  Num examples = 5
  Batch size = 4
100%|██████████| 2/2 [00:00<00:00, 17.37it/s]Didn't find file ../model/twitter-roberta_SA\added_tokens.json. We won't load it.
Didn't find file ../model/twitter-roberta_SA\tokenizer.json. We won't load it.
loading file ../model/twitter-roberta_SA\vocab.json
loading file ../model/twitter-roberta_SA\merges.txt
loading file None
loading file ../model/twitter-roberta_SA\special_tokens_map.json
loading file ../model/twitter-roberta_SA\tokenizer_config.json
loading file None
100%|██████████| 2/2 [00:00<00:00,  8.84it/s]
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file ../model/twitter-roberta_SA\config.json
Model config RobertaConfig {
  "_name_or_path": "c

Unnamed: 0,tweet_text,label
0,Weve heard that false information about the CO...,2
1,Completely agree . Be smart and get the vaccin...,0
2,You know whats even crazier than indiscriminat...,2
3,Not my problem . Ive not been seeing anyone si...,0
4,If those wanting a vaccine got the vaccine why...,2


## Actual Run

If that displayed correctly, then run the following chunk! Your output should be saved as `tweets_predicted.csv`. It might take a while.

In [5]:
tweets = pd.read_csv("tweets_to_predict.csv")
dataset = Dataset.from_pandas(tweets)

sentiment_integers =  get_sentiment_predictions(dataset[tweets_column])
sentiment_dict = {0: "positive", 1: "negative", 2: "neutral"}
tweets["label"] = [sentiment_dict[i] for i in sentiment_integers] # Translate integers into actual sentiment labels
tweets.to_csv("tweets_predicted.csv", index = False)

e ../model/vaccineBert_SA\bpe.codes
loading file ../model/vaccineBert_SA\added_tokens.json
loading file ../model/vaccineBert_SA\special_tokens_map.json
loading file ../model/vaccineBert_SA\tokenizer_config.json
loading file None
Adding <mask> to the vocabulary
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file ../model/vaccineBert_SA\config.json
Model config RobertaConfig {
  "_name_or_path": "vaccineBert",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    