### **r/place 2023 Sentiment Analysis Model**
This Jupyter notebook will fine-tune the DistilBERT model to perform sentiment analysis on Reddit comments in July 2023. Feel free to tweak the variables and code here. Credits are included at the end of the notebook.

**Install Dependencies**<br>
This notebook has been tested on Python 3.11.2 and uses Pytorch.

In [43]:
import csv
import pandas as pd
import sklearn
import torch

**Load the Data**<br>
The target CSV file has Reddit comments in Column 0 and a score in Column 1. The scores correspond to the following sentiments: -1 = negative, 0 = neutral, 1 = positive. We will tweak the range from [-1, 1] to [0, 2] to match the model's labels.

In [44]:
# define the data path and store the comments in a list
data_path = "data/Reddit_Data.csv"
comments_and_scores = []

# read the csv and store each comment with its respective score
with open(data_path, "r", encoding="utf8") as f:
    csv_reader = csv.reader(f)
    next(csv_reader)
    for row in csv_reader:
        comment, score = row
        comments_and_scores.append((comment, int(score)+1))

print(comments_and_scores[0])

(' family mormon have never tried explain them they still stare puzzled from time time like some kind strange creature nonetheless they have come admire for the patience calmness equanimity acceptance and compassion have developed all the things buddhism teaches ', 2)


**Separate Training and Testing Datasets**<br>
We need to separate these comments into training and testing datasets.

In [45]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(comments_and_scores,
                                       test_size=0.2,
                                       random_state=24)

In [46]:
print(train_set[0])
print(test_set[0])

(' yeh hindutva promote kar rahe hai ', 1)
('vote for the best available candidate nota waste like you said ', 2)


In [47]:
# extract the training comments and scores
train_comments = [group[0] for group in train_set]
train_scores = [group[1] for group in train_set]

# extract the testing comments and scores
test_comments = [group[0] for group in test_set]
test_scores = [group[1] for group in test_set]

In [48]:
print(train_comments[0], train_scores[0])
print(test_comments[0], test_scores[0])

 yeh hindutva promote kar rahe hai  1
vote for the best available candidate nota waste like you said  2


Now that we have the training and testing datasets, we will convert them into Pandas DataFrame objects.

In [49]:
train_set = {"text": train_comments, "score": train_scores}
train_set = pd.DataFrame(data)
print(train_set)

                                                    text  score
0       family mormon have never tried explain them t...      2
1      buddhism has very much lot compatible with chr...      2
2      seriously don say thing first all they won get...      0
3      what you have learned yours and only yours wha...      1
4      for your own benefit you may want read living ...      2
...                                                  ...    ...
37244                                              jesus      1
37245  kya bhai pure saal chutiya banaya modi aur jab...      2
37246              downvote karna tha par upvote hogaya       1
37247                                         haha nice       2
37248             facebook itself now working bjp’ cell       1

[37249 rows x 2 columns]


In [None]:
test_set = {"text": test_comments, "score": test_scores}
test_set = pd.DataFrame(data)
print(test_set)

**Tokenize the Data**<br>
Prior to training the model, we will tokenize the Reddit comments into small pieces to make it easier for the model to identify the comment's sentiment.

In [50]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [51]:
# tokenize the training and testing datasets
# tokenized_train = [tokenizer(text) for train_set["text"]]