# Sentiment Analysis on The Reddit Climate Change Dataset
by Santiago Segovia

### Motivation

We will use the Reddit Climate Change Dataset to fine-tune a DistilBERT model that is able to classify whether a comment in Reddit towards climate change is positive or negative. The analysis follows [this publication](https://huggingface.co/blog/sentiment-analysis-python) from Hugginface.

### 1. Install Dependencies and Initial Setup

In [1]:
!pip install datasets transformers accelerate -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.[0m[31m
[0m

In [2]:
import pandas as pd
import torch

from google.colab import drive
from transformers import pipeline

In [3]:
# Mount GDrive
drive.mount("/content/drive")

Mounted at /content/drive


In [27]:
# Load data (takes ~1 min to load)
import csv

small_data = False
dtype_dict = {'label': int}
data_path = "/content/drive/Shareddrives/adv-ml-project/Data/"
comments = pd.read_csv(data_path + "comments_filtered.csv", quoting=csv.QUOTE_NONNUMERIC, dtype=dtype_dict)

### 2. Preprocess Data

In order to evaluate the performance of our model, we need to create a train-test split. We randomly pick 20% of the records and identify them as part of the testing dataset:

In [28]:
import random
random.seed(120938)

num_test_obs = round(comments.shape[0] * 0.2)
ids_test_obs = random.sample(range(comments.shape[0]), num_test_obs)

In [29]:
comments['test_split'] = 0
comments.loc[ids_test_obs,'test_split'] = 1

In [30]:
comments.head(3)

Unnamed: 0,subreddit.name,date,body,sentiment,label,test_split
0,askreddit,2022-08-31 23:56:46,I think climate change tends to get some peopl...,0.6634,1,0
1,askreddit,2022-08-31 23:54:25,They need to change laws so it's more worth se...,0.469,1,1
2,askreddit,2022-08-31 23:52:41,That a big part of the solution to climate cha...,0.8937,1,0


We need to convert our data to an iterable dataset to easily use Hugginface's functions. We use the `from_dict()` method from the `datasets` [module](https://huggingface.co/docs/datasets/en/create_dataset).

In [31]:
# Fill NaN values with empty strings, otherwise from_dict will raise an error
comments['body'] = comments['body'].fillna('')

In [32]:
from datasets import Dataset

# Create train and test data
train_data_dict = {"text": comments.loc[comments['test_split'] == 0, 'body'].tolist(),
                   "label": comments.loc[comments['test_split'] == 0, 'label'].tolist()}
test_data_dict = {"text": comments.loc[comments['test_split'] == 1, 'body'].tolist(),
                  "label": comments.loc[comments['test_split'] == 1, 'label'].tolist()}

train_data = Dataset.from_dict(train_data_dict)
test_data = Dataset.from_dict(test_data_dict)

In [33]:
# Create a smaller training dataset for faster training times
small_train_dataset = train_data.shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = test_data.shuffle(seed=42).select([i for i in list(range(300))])

In [22]:
# Example of structures
print(small_train_dataset[0])
print(small_test_dataset[0])

{'text': "I did not say climate change is false. This is how views become misrepresented and conversations shut down. I couldn't have been more clear in my previous post that it's real. It's the first damn sentence after the definition.\n\nMy point was exactly the first sentence in your second paragraph. Propaganda doesn't have to be false. Climate change is not false. But we should not act as though the movie doesn't draw a political circle around the issue. Either way we can agree to disagree.", 'label': 0}
{'text': 'They really tried, they sent out a memo saying basically "don\'t do what Trump did" he may have actually broke the Republican party. I don\'t think he did it on purpose, I think he just showed the world what they have been dog whistling for years, he turned "the science isn\'t out on climate change" into "global warming it a hoax made up by the chinese to make us industry non competitive." ', 'label': 1}


To [preprocess](https://huggingface.co/docs/transformers/preprocessing#everything-you-always-wanted-to-know-about-padding-and-truncation) our data, we will use [DistilBERT tokenizer](https://huggingface.co/docs/transformers/v4.15.0/en/model_doc/distilbert#transformers.DistilBertTokenizer):


In [34]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Next, we prepare the text inputs for the model for both splits of our dataset (training and test) by using the [map method](https://huggingface.co/docs/datasets/en/process#map) with [batch processing](https://huggingface.co/docs/datasets/en/process#batch-processing):

In [36]:
# Prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

if small_data:
    tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
    tokenized_test = small_test_dataset.map(preprocess_function, batched=True)
else:
    tokenized_train = train_data.map(preprocess_function, batched=True)
    tokenized_test = test_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/729867 [00:00<?, ? examples/s]

Map:   0%|          | 0/182467 [00:00<?, ? examples/s]

Sentences aren’t always the same length which can be an issue because the model inputs need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences. We use a [Data Collator](https://huggingface.co/docs/transformers/en/main_classes/data_collator) to convert our training samples to PyTorch tensors and concatenate them with the correct amount of [padding](https://huggingface.co/docs/transformers/preprocessing#pad):

In [37]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [38]:
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

### 3. Training the model

For training, we will be using the [Trainer API](https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.Trainer), which is optimized for fine-tuning Transformers models such as DistilBERT.

First, we define DistilBERT as our base model:

In [39]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Then, we define the metrics we'll be using to evaluate how good is the fine-tuned model ([accuracy and f1 score](https://huggingface.co/metrics)):

In [40]:
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

Before training our model, we need to define the training arguments and define a Trainer with all the objects we constructed up to this point:

In [41]:
from transformers import TrainingArguments, Trainer

output_name = data_path + "Output"

training_args = TrainingArguments(
   output_dir=output_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

Now, we train the model on the dataset:

In [25]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=126, training_loss=0.3542396000453404, metrics={'train_runtime': 80.1626, 'train_samples_per_second': 24.949, 'train_steps_per_second': 1.572, 'total_flos': 216950551951968.0, 'train_loss': 0.3542396000453404, 'epoch': 2.0})

Finally, we compute the evaluation metrics to see how good your model is:

In [26]:
trainer.evaluate()

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'eval_loss': 0.6809479594230652,
 'eval_accuracy': 0.69,
 'eval_f1': 0.7256637168141592,
 'eval_runtime': 2.7248,
 'eval_samples_per_second': 36.7,
 'eval_steps_per_second': 2.569,
 'epoch': 2.0}