# Sentiment Analysis on The Reddit Climate Change Dataset
by Santiago Segovia

### Analytical Process

1. Compare sentiment in data with Hugginface's approach
    * Use `distilbert-base-uncased` model to caluclate sentiment probability
    * Label data based on probabilities
    * Compare both sentiment metrics
2. Use data from Reddit to tune our own sentiment analysis model

### 1. Install Dependencies and Initial Setup

In [1]:
!pip install datasets transformers -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.[0m[31m
[0m

In [2]:
import pandas as pd
import torch

from google.colab import drive
from datasets import Dataset
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import pipeline

In [3]:
# Mount GDrive
drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
# Load data (takes ~1 min to load)
data_path = "/content/drive/Shareddrives/adv-ml-project/Data/"
comments = pd.read_csv(data_path + "comments_filtered.csv")
comments['date'] = pd.to_datetime(comments['date'])

  comments = pd.read_csv(data_path + "comments_filtered.csv")


In [8]:
comments.head(2)

Unnamed: 0,subreddit.name,body,sentiment,date
0,askreddit,I think climate change tends to get some peopl...,0.6634,2022-08-31 23:56:46
1,askreddit,They need to change laws so it's more worth se...,0.469,2022-08-31 23:54:25


In [9]:
comments.dtypes

subreddit.name            object
body                      object
sentiment                 object
date              datetime64[ns]
dtype: object

### 2. Preprocess Data

In order to evaluate the performance of our model, we need to create a train-test split. We randomly pick 20% of the records and identify them as part of the testing dataset:

In [19]:
import random
random.seed(120938)

num_test_obs = round(comments.shape[0] * 0.2)
ids_test_obs = random.sample(range(comments.shape[0]), num_test_obs)

In [24]:
comments['test_split'] = 0
comments.loc[ids_test_obs,'test_split'] = 1

In [25]:
comments.head()

Unnamed: 0,subreddit.name,body,sentiment,date,test_split
0,askreddit,I think climate change tends to get some peopl...,0.6634,2022-08-31 23:56:46,0
1,askreddit,They need to change laws so it's more worth se...,0.469,2022-08-31 23:54:25,1
2,askreddit,That a big part of the solution to climate cha...,0.8937,2022-08-31 23:52:41,0
3,askreddit,&gt;Not climate change mind you\n\nHi. I have ...,0.0,2022-08-31 23:49:45,0
4,worldnews,"Climate change is not ""staring"" you in the fac...",-0.3453,2022-08-31 23:48:15,0


We need to convert our data to an iterable dataset to easily use Hugginface's functions. We use the `from_dict()` method from the `datasets` [module](https://huggingface.co/docs/datasets/en/create_dataset).

In [26]:
# Fill NaN values with empty strings, otherwise from_dict will raise an error
comments['body'] = comments['body'].fillna('')

# Create train and test data
train_data_dict = {"text": comments.loc[comments['test_split'] == 0, 'body'].tolist()}
test_data_dict = {"text": comments.loc[comments['test_split'] == 1, 'body'].tolist()}
train_data = Dataset.from_dict(train_data_dict)
test_data = Dataset.from_dict(test_data_dict)

In [30]:
# Example of structures
print(train_data[0])
print(test_data[0])

{'text': 'I think climate change tends to get some people riled up. \n\nWhen I was part of a debate club, they loved throwing that subject in. One case we had to discuss was whether or not it was okay to fly if it pollutes the air. A friend of mine on the team got very worked up because he loves to travel. At the end, we actually had to make up because our disagreement about flying got very heated.'}
{'text': "They need to change laws so it's more worth selling agriculture products in the US rather than export it.  They also need to change laws so there are monetary penalties for growing crops that are not particularly viable to an area's natural climate.  As it stands right now, my neighbor makes double the price per head of cattle by exporting out of country than he would selling right here.  All the people complaining about climate change on here should probably be complaining about this too."}


To [preprocess](https://huggingface.co/docs/transformers/preprocessing#everything-you-always-wanted-to-know-about-padding-and-truncation) our data, we will use [DistilBERT tokenizer](https://huggingface.co/docs/transformers/v4.15.0/en/model_doc/distilbert#transformers.DistilBertTokenizer):


In [31]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Next, we prepare the text inputs for the model for both splits of our dataset (training and test) by using the map method:

In [32]:
# Prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_test = test_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/729886 [00:00<?, ? examples/s]

Map:   0%|          | 0/182471 [00:00<?, ? examples/s]

Sentences aren’t always the same length which can be an issue because the model inputs need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences. We use a `data_collator` to convert our training samples to PyTorch tensors and concatenate them with the correct amount of [padding](https://huggingface.co/docs/transformers/preprocessing#everything-you-always-wanted-to-know-about-padding-and-truncation):

In [33]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### 3. Training the model