# COVID-19 Tweet Predictions

This notebook is designed to perform predictions on COVID-19 related tweets. 
The goal is to classify the sentiment or topic of each tweet using pre-trained machine learning models.

## Outline

1. Installing Required Packages
2. Environment Setup
3. Data Loading
4. Model Inference
5. Results and Analysis

## How to Run

To execute the notebook, simply run each cell in order. 
If running on Google Colab, make sure to mount your Google Drive if necessary.


## Installing Required Packages

The following packages are essential for running the analyses in this notebook.


In [26]:
!pip install transformers
!pip install tweet-preprocessor
!pip install nltk
!pip install spacy
!pip install gensim



## Environment Setup (Google Drive Mounting)

If running this notebook on Google Colab, you'll need to mount your Google Drive to access files stored there.


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Directory Check

This cell lists the content of the data directory to ensure that all necessary files are present.


In [2]:
!ls "/content/drive/My Drive/JH_NLP"

 claim.png
 preprocess_data.ipynb
 preprocessing.py
 project_data
 __pycache__
 test_bert.ipynb
 test.ipynb
'test_roberta - Copy.ipynb'
 test_roberta.ipynb
 train_bert.ipynb
 train.ipynb
 train_roberta.ipynb
 twitter-rumour-classification_roberta_512_batch16_grad1


In [47]:
import sys
sys.path.append('/content/drive/MyDrive/JH_NLP')
import pandas as pd
from transformers import RobertaConfig, RobertaForSequenceClassification, RobertaTokenizerFast, Trainer, TrainingArguments

import torch.nn as nn
import torch
import numpy as np
import os
import json
from pprint import pprint
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
# from preprocessing import preprocess_data, get_dataset_and_labels
from preprocessing import *
from sklearn.model_selection import train_test_split

In [5]:
max_sequence_length = 512
device = "cuda:0" if torch.cuda.is_available() else "cpu"
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_length = max_sequence_length)
model = RobertaForSequenceClassification.from_pretrained('/content/drive/MyDrive/JH_NLP/twitter-rumour-classification_roberta_512_batch16_grad1/', local_files_only=True).to("cuda")

In [6]:
def convert_label(label):
    if label == "rumour":
        return 1
    elif label == "non-rumour":
        return 0
    else:
        raise Exception("label classes must be 'rumour' or 'non-rumour'")


def convert_label_to_rumour(label):
    if label == 1:
        return "rumour"
    elif label == 0:
        return "non-rumour"
    else:
        raise Exception("label classes must be 1 or 0")


def get_labels(label_path, sourceIds):
    with open(label_path) as f:
        labels = json.load(f)
    corresponding_labels = [labels[id] for id in sourceIds]
    numeric_labels = [convert_label(label) for label in corresponding_labels]

    return numeric_labels

class TestDataset:
    def __init__(self, tokenized_texts):
        self.tokenized_texts = tokenized_texts
    
    def __len__(self):
        return len(self.tokenized_texts["input_ids"])
    
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.tokenized_texts.items()}

# Load test data to get predictions

In [7]:
covid_path = "/content/drive/MyDrive/JH_NLP/project_data/covid.data.jsonl"
test_texts, sourceIds = preprocess_data(data_path=covid_path, max_sequence_length=max_sequence_length)
test_encodings = tokenizer(test_texts, padding = 'max_length', truncation=True, max_length = max_sequence_length)

# Initiate a TestDataset object
test_dataset = TestDataset(test_encodings)

In [10]:
from transformers import default_data_collator

label_ids: torch.Tensor = None
preds: torch.Tensor = None

with torch.no_grad():
    dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=8)

    for batch in tqdm(dataloader):

        batch['input_ids'] = batch['input_ids'].cuda()
            
        predictions = model(input_ids=batch['input_ids']
                                   )
        
        predictions = predictions[0]

        if preds is None:
            preds = predictions.detach().sigmoid()
        else:
            preds = torch.cat((preds, predictions.detach()), dim=0)


        # if label_ids is None:
        #     label_ids = batch["labels"].detach()
        # else:
        #     label_ids = torch.cat((label_ids, batch["labels"].detach()), dim=0)
        

100%|██████████| 2183/2183 [18:27<00:00,  1.97it/s]


# Save predictions

In [11]:
predictions = np.argmax(preds.to("cpu"), axis=1)
predictions_dict = {sourceId: convert_label_to_rumour(prediction) for sourceId, prediction in zip(sourceIds, predictions)}
with open("/content/drive/MyDrive/JH_NLP/covid-output.json", "w") as outputfile:
    json.dump(predictions_dict, outputfile)