# Sentiment Analysis on The Reddit Climate Change Dataset
by Santiago Segovia

### Analytical Process

1. Compare sentiment in data with Hugginface's approach
    * Use `distilbert-base-uncased` model to caluclate sentiment probability
    * Label data based on probabilities
    * Compare both sentiment metrics
2. Use data from Reddit to tune our own sentiment analysis model

### 1. Install Dependencies and Initial Setup

In [1]:
!pip install datasets transformers accelerate -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.[0m[31m
[0m

In [2]:
import pandas as pd
import torch

from google.colab import drive
from transformers import pipeline

In [3]:
# Mount GDrive
drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
# Load data (takes ~1 min to load)
data_path = "/content/drive/Shareddrives/adv-ml-project/Data/"
comments = pd.read_csv(data_path + "comments_filtered.csv")

In [6]:
comments.head(2)

Unnamed: 0,subreddit.name,date,body,sentiment,label
0,askreddit,2022-08-31 23:56:46,I think climate change tends to get some peopl...,0.6634,1.0
1,askreddit,2022-08-31 23:54:25,They need to change laws so it's more worth se...,0.469,1.0


### 2. Preprocess Data

In order to evaluate the performance of our model, we need to create a train-test split. We randomly pick 20% of the records and identify them as part of the testing dataset:

In [7]:
import random
random.seed(120938)

num_test_obs = round(comments.shape[0] * 0.2)
ids_test_obs = random.sample(range(comments.shape[0]), num_test_obs)

In [8]:
comments['test_split'] = 0
comments.loc[ids_test_obs,'test_split'] = 1

In [9]:
comments.head(2)

Unnamed: 0,subreddit.name,date,body,sentiment,label,test_split
0,askreddit,2022-08-31 23:56:46,I think climate change tends to get some peopl...,0.6634,1.0,0
1,askreddit,2022-08-31 23:54:25,They need to change laws so it's more worth se...,0.469,1.0,1


We need to convert our data to an iterable dataset to easily use Hugginface's functions. We use the `from_dict()` method from the `datasets` [module](https://huggingface.co/docs/datasets/en/create_dataset).

In [12]:
# Fill NaN values with empty strings, otherwise from_dict will raise an error
comments['body'] = comments['body'].fillna('')

In [13]:
from datasets import Dataset

# Create train and test data
train_data_dict = {"text": comments.loc[comments['test_split'] == 0, 'body'].tolist(),
                   "label": comments.loc[comments['test_split'] == 0, 'label'].tolist()}
test_data_dict = {"text": comments.loc[comments['test_split'] == 1, 'body'].tolist(),
                  "label": comments.loc[comments['test_split'] == 1, 'label'].tolist()}

train_data = Dataset.from_dict(train_data_dict)
test_data = Dataset.from_dict(test_data_dict)

In [16]:
# Create a smaller training dataset for faster training times
small_train_dataset = train_data.shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = test_data.shuffle(seed=42).select([i for i in list(range(300))])

In [17]:
# Example of structures
print(small_train_dataset[0])
print(small_test_dataset[0])

{'text': "This is fundamentally wrong. Democrats have confirmed more than 40 federal judges to benches across the country since Biden's election, which will shape the American legal landscape for generations. Democrats passed the American Rescue Plan, the Infrastructure Investment and Jobs Act, the Affordable Care Act, hate crimes legislation, gun violence prevention, set up and run the 1/6 investigations, and more. We've re-entered climate change agreements, staunchly supported Ukraine against Russia, rebuilt international alliances, and more.\n\nIt's not perfect and there's still a lot to be done but you know what we need to do more? *More Democrats in Congress.* If you don't like who's getting elected, vote in primaries and work to nominate more progressives. But if your candidate doesn't win the primary, taking your ball and going home only helps Republicans. We're a big tent party so our job is harder, but it's more necessary than ever.", 'label': -1.0}
{'text': "If you really are

To [preprocess](https://huggingface.co/docs/transformers/preprocessing#everything-you-always-wanted-to-know-about-padding-and-truncation) our data, we will use [DistilBERT tokenizer](https://huggingface.co/docs/transformers/v4.15.0/en/model_doc/distilbert#transformers.DistilBertTokenizer):


In [18]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Next, we prepare the text inputs for the model for both splits of our dataset (training and test) by using the [map method](https://huggingface.co/docs/datasets/en/process#map) with [batch processing](https://huggingface.co/docs/datasets/en/process#batch-processing):

In [20]:
# Prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Sentences aren’t always the same length which can be an issue because the model inputs need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences. We use a [Data Collator](https://huggingface.co/docs/transformers/en/main_classes/data_collator) to convert our training samples to PyTorch tensors and concatenate them with the correct amount of [padding](https://huggingface.co/docs/transformers/preprocessing#pad):

In [21]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [22]:
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

### 3. Training the model

For training, we will be using the [Trainer API](https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.Trainer), which is optimized for fine-tuning Transformers models such as DistilBERT.

First, we define DistilBERT as our base model:

In [23]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Even though the pre-trained model computes a loss, once we train the model it is just returning the logits. The following code ensures that a loss is returned during training:

In [18]:
# from transformers import AutoModelForSequenceClassification

# class CustomModel(AutoModelForSequenceClassification):
#     def forward(self, input_ids=None, attention_mask=None, labels=None):
#         outputs = super().forward(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

#         # Retrieve the loss from the outputs
#         loss = outputs.loss if labels is not None else None

#         return {"loss": loss, "logits": outputs.logits}

# # Create an instance of the custom model
# model = CustomModel.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Then, we define the metrics we'll be using to evaluate how good is the fine-tuned model ([accuracy and f1 score](https://huggingface.co/metrics)):

In [24]:
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

Before training our model, we need to define the training arguments and define a Trainer with all the objects we constructed up to this point:

In [25]:
from transformers import TrainingArguments, Trainer

output_name = data_path + "Output"

training_args = TrainingArguments(
   output_dir=output_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


Now, we train the model on the dataset:

In [26]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 2]))