# Argumentative Reddit Bot Project

# Part 1. Data Collection and Processing

In this notebook, we'll pull together data from different sources to create a dataset that we can train our model on. We want data containing back and forth converstations between redditors who are arguing over potentially controversial topics. The data should have...


*   At least two back and forth remarks
*   Some topic to argue over

Our bot will..


In [None]:
!pip install -U accelerate



In [None]:
# Mount your Google Drive to access the data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Change working directory to your folder path
#%cd drive/MyDrive/AI\ Camp/Captivating\ Cupcakes/data

%cd /content/drive/MyDrive/Captivating Cupcakes/data

/content/drive/.shortcut-targets-by-id/1hRdudFgbMMqNyWwC1uvEQEX198UmxrEl/Captivating Cupcakes/data


In [None]:
# Run this to check if it worked
!pwd

/content/drive/.shortcut-targets-by-id/1hRdudFgbMMqNyWwC1uvEQEX198UmxrEl/Captivating Cupcakes/data


### I. Reddit Comment Data

This dataset contains comments from Reddit users with a score of how controversial each comment is. While this doesn't have replies, it can be useful if we want the model to also come up with the original content related to some user input (e.g. subreddit name). [Dataset Link.](https://www.kaggle.com/datasets/smagnan/1-million-reddit-comments-from-40-subreddits)

e.g.

User Input: /// Topic: Apples

/// Bot 1: Apples are actually not that good for you.

/// Bot 2: Are you dumb? How are apples not good for you?

/// Bot 3: I'm not dumb, you are. If apples are so good for you, why doesn't everyone eat apples everyday?

...

User Input: /// Comment: Are you dumb? How are apples not good for you?

/// Bot 1: How are apples good for you. Tell me one good reason why.

/// Bot 2: ...



In [None]:
# Pandas allows us to open .csv files
import pandas as pd

# Reading in the data
rc = pd.read_csv('reddit_comments.csv')

# Looking at first 5 rows
rc.head()

Unnamed: 0,subreddit,body,controversiality,score
0,gameofthrones,Your submission has been automatically removed...,0,1
1,aww,"Dont squeeze her with you massive hand, you me...",0,19
2,gaming,It's pretty well known and it was a paid produ...,0,3
3,news,You know we have laws against that currently c...,0,10
4,politics,"Yes, there is a difference between gentle supp...",0,1


In [None]:
# Here, we can extract comments that are controversial
rc_filtered = rc.query('controversiality == 1')
rc_filtered.sample(5) # Finds 5 random controversial comments

Unnamed: 0,subreddit,body,controversiality,score
548203,worldnews,"Ah, conspiracy theorists and unfalsifiable pre...",1,3
112058,hockey,There’s an unsurprising lack of Boston fans in...,1,0
813287,movies,Some people say when she smiles you see too mu...,1,-1
438584,ChapoTrapHouse,This guy thinks so https://www.reddit.com/r/An...,1,0
646326,news,"They aren't hypocrites though, for the reasons...",1,-2


In [None]:
# Let's take a look at one controversial comment
import random
random_idx = random.sample(range(len(rc_filtered)), 1)
rc_filtered.iloc[random_idx, 1].tolist()

['He made one three. If any refs call his elbows or travels where would he be?']

In [None]:
rc_filtered.iloc[random_idx, 0].tolist()

['nba']

In [None]:
# We can also see which comments have a negative score (negative upvotes)
rc_negative = rc.query('score < 0')
rc_negative.head(5)

Unnamed: 0,subreddit,body,controversiality,score
20,leagueoflegends,"TLDR:\n\n""We invested a lot of brain power and...",0,-4
23,nba,[the only reason anyone knows who Jared Dudley...,0,-5
30,news,This is disgusting every week,0,-1
31,nba,So you’d say 26 year old Buddy Hield isn’t a p...,1,-2
49,AskReddit,Not in my findings,1,-1


In [None]:
rc_negative.body.tolist()[9]

'&gt; If you remove suicides and gang violence our "gun violence epidemic" suddenly disappears entirely.\n\nMass-shootings in Universities are not by gangs, nor suicide cases (despite shooters routinely taking their own lives afterwards). It is something that just doesn\'t happen on the scale it does in America, and a large factor in that is the ease of gun access.\n\n6 people a year (wherever you pulled that stat) is 6 too many.'

In [None]:
# Same idea, let's look at one negative score comment
random_idx = random.sample(range(len(rc_negative)), 1)
rc_negative.iloc[random_idx, 1].tolist()

['Keep the animal caged.']

In [None]:
rc_negative.iloc[random_idx, 0].tolist()

['todayilearned']

In [None]:
!pip install better_profanity

Collecting better_profanity
  Downloading better_profanity-0.7.0-py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.1/46.1 kB[0m [31m841.8 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: better_profanity
Successfully installed better_profanity-0.7.0


In [None]:
from better_profanity import profanity

profanity.contains_profanity(rc_negative.body.tolist()[0])

In [None]:
filtered_rc = [profanity.contains_profanity(comment) for comment in rc_negative.body.tolist()[0:500]]

# filtered_rc_data = rc_negative[~filtered_rc]

print(filtered_rc)

[False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, False, False, True, False, False, True, False, False, True, False, False, True, False, False, False, False, False, False, True, True, False, False, True, False, True, False, False, True, True, False, False, True, False, False, False, False, True, True, False, False, True, False, True, True, False, False, False, True, True, False, True, False, False, False, True, False, True, False, False, False, False, False, False, False, True, False, True, False, False, True, True, False, False, False, False, False, True, True, False, False, False, False, True, False, True, False, False, True, True, False, False, False, False, False, True, False, True, True, False, False, False, False, False, False, False, True, True, False, True, False, True, False, False, False, True, False, True, False, False, False, False, False, False, False,

In [None]:
import csv

csv_file_path = 'rc_results.csv'

# Write the list to the CSV file
with open(csv_file_path, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['Value'])  # Write header
    for boolean_value in filtered_rc:
        csv_writer.writerow([boolean_value])

### II. r/CasualConversation Conversational Data

This dataset contains conversations between Reddit users in the r/CasualConversations subreddit. While this isn't specific to arguments or controversial topics, we can extract those conversations with sentiment analysis to get the good stuff out of [this dataset](https://www.kaggle.com/datasets/jerryqu/reddit-conversations).

In [None]:
import pandas as pd
cc = pd.read_csv('casual_conversations.csv', index_col = 0)
cc.head()

Unnamed: 0,0,1,2
0,What kind of phone(s) do you guys have?,I have a pixel. It's pretty great. Much better...,Does it really charge all the way in 15 min?
1,I have a pixel. It's pretty great. Much better...,Does it really charge all the way in 15 min?,"Pretty fast. I've never timed it, but it's und..."
2,Does it really charge all the way in 15 min?,"Pretty fast. I've never timed it, but it's und...","cool. I've been thinking of getting one, my ph..."
3,What kind of phone(s) do you guys have?,Samsung Galaxy J1. It's my first cell phone an...,What do you think of it? Anything you don't like?
4,Samsung Galaxy J1. It's my first cell phone an...,What do you think of it? Anything you don't like?,I love it. I can't think of anything I don't l...


In [None]:
def combine_row(row):
  result = ""
  for comment in row:
    result += comment + ' | '
  return result

def process_row(row):
  result = ""
  for i, comment in enumerate(row):
    if i % 2:
      user = 2
    else:
      user = 1
    result += f'$$$Bot_{user}: ' + comment + ' \n'
  return result

In [None]:
!pip install git+https://github.com/chriswales95/Canary.git@development

In [None]:
from canary.argument_pipeline import download_model, load_model, analyse_file
download_model("all")

detector = load_model("argument_detector")

In [None]:
results = [detector.predict(comment) for comment in cc.iloc[:,0]]

In [None]:
results = pd.read_csv('cc_results.csv')
cc_results = results.iloc[:, 0].tolist()

In [None]:
cc_data = cc[cc_results].reset_index(drop=True)

In [None]:
cc_data

Unnamed: 0,0,1,2
0,I have a pixel. It's pretty great. Much better...,Does it really charge all the way in 15 min?,"Pretty fast. I've never timed it, but it's und..."
1,Samsung Galaxy J1. It's my first cell phone an...,What do you think of it? Anything you don't like?,I love it. I can't think of anything I don't l...
2,CONGRATS! Hope you are as happy as you could p...,Oh I definitely am! I still find myself questi...,"Don't question it, just enjoy every moment you..."
3,"so mate, you missed out the important bit, did...",Um. Im a girl. Also. Is that your only motivat...,"Not my only motivation, but life isnt worth li..."
4,"Good luck. I don't know why you'd need it, but...",To not cry or trip. Haha. Thank you. :),&gt; not cry\n\nare you insane\n\nyou ought to...
...,...,...,...
18053,You are the second post I read where they are ...,I think so. Trolls are rampant.,overtly rampant
18054,You should probably try spending less time on ...,wat,&gt;You should probably try spending less time...
18055,Are you from Singapore?Because I recognised th...,It's also in the Philippines! But our prizes a...,Also in Australia!
18056,It's also in the Philippines! But our prizes a...,Also in Australia!,and New Zealand!


In [None]:
cc_data = cc_data.apply(lambda row : process_row(row), axis=1).tolist()

In [None]:
cc_data[0:2]

["$$$Bot_1: I have a pixel. It's pretty great. Much better than what I had before.  \n$$$Bot_2: Does it really charge all the way in 15 min? \n$$$Bot_1: Pretty fast. I've never timed it, but it's under half an hour.  \n",
 "$$$Bot_1: Samsung Galaxy J1. It's my first cell phone and I've had it for 7 months. \n$$$Bot_2: What do you think of it? Anything you don't like? \n$$$Bot_1: I love it. I can't think of anything I don't like about it. \n"]

### III. THRED Dataset

Same as the r/CasualConversations dataset, but with examples with 3, 4, and 5 turns per line. We can apply the same data processing logic as part II to this dataset to get similar data.

In [None]:
# Read the text file and split it into conversations
with open('thred_dev.txt', 'r') as file:
    conversations = file.read().split('\n')

# Define lists to store conversations and their turns
all_conversations = []

# Process each conversation
for conversation in conversations:
    if conversation:
        # Split the conversation into three turns
        turns = conversation.split('\t')
        if len(turns) == 3:
            # Store the conversation and its turns
            all_conversations.append(turns)


In [None]:
thred_results = [detector.predict(conv[0]) for conv in all_conversations]

In [None]:
import csv

csv_file_path = 'thred_results.csv'

# Write the list to the CSV file
with open(csv_file_path, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['Value'])  # Write header
    for boolean_value in thred_results:
        csv_writer.writerow([boolean_value])

csv_file_path = 'cc_results.csv'

# Write the list to the CSV file
with open(csv_file_path, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['Value'])  # Write header
    for boolean_value in results:
        csv_writer.writerow([boolean_value])

In [None]:
import numpy as np
thred_results = pd.read_csv('thred_results.csv').iloc[:, 0].tolist()
thred_data = np.array(all_conversations)[thred_results]

In [None]:
thred_data[0]

array(["that 's the secret though . if you only drink soda and beer the water weight just stays off !",
       "i really hope you 're kidding . please be kidding .",
       "there 's a lot of pseudoscience surrounding water and weight so people are bypassing the problem by drinking soft drinks ."],
      dtype='<U177')

In [None]:
thred_data = [process_row(convo) for convo in thred_data]

In [None]:
thred_data[0:2]

["$$$Bot_1: that 's the secret though . if you only drink soda and beer the water weight just stays off ! \n$$$Bot_2: i really hope you 're kidding . please be kidding . \n$$$Bot_1: there 's a lot of pseudoscience surrounding water and weight so people are bypassing the problem by drinking soft drinks . \n",
 "$$$Bot_1: what song is a solid 10/10 , that not many people know about ? \n$$$Bot_2: the xx - intro \n$$$Bot_1: this . i genuinely love this track and no nothing else about this artist . it 's an incredibly track though . \n"]

### IV. Combining Data

In [None]:
cc_data.extend(thred_data)


NameError: ignored

In [None]:
import csv
# Specify the CSV file path
csv_file_path = "conversation_data.csv"

# Write the list to the CSV file
with open(csv_file_path, mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['Value'])  # Write header
    for item in cc_data:
        csv_writer.writerow([item])

NameError: ignored

In [None]:
cc_data[0]

NameError: ignored

# Part 2. Training our Model

In [None]:
import csv

# Specify the CSV file path
csv_file_path = "conversation_data.csv"

# Read the list from the CSV file
data = []
with open(csv_file_path, mode='r') as csv_file:
    csv_reader = csv.reader(csv_file)
    next(csv_reader)  # Skip the header row
    for row in csv_reader:
        data.append(row[0])

In [None]:
len(data)

0

In [None]:
!pip install transformers --quiet

In [None]:
import torch
from torch.utils.data import Dataset, random_split
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, GPT2Tokenizer

# Custom dataset class to load dataset
class ConversationDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            # Encode the descriptions using the GPT-Neo tokenizer
            encodings_dict = tokenizer('<|startoftext|>'
                                        + txt +
                                        '<|endoftext|>',
                                        truncation=True,
                                        max_length=max_length,
                                            padding="max_length")
            input_ids = torch.tensor(encodings_dict['input_ids'])
            self.input_ids.append(input_ids)
            mask = torch.tensor(encodings_dict['attention_mask'])
            self.attn_masks.append(mask)

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Set the random seed to a fixed value to get reproducible results
torch.manual_seed(42)

# Download the pre-trained GPT-Neo model's tokenizer
# Add the custom tokens denoting the beginning and the end
# of the sequence and a special token for padding
tokenizer = AutoTokenizer.from_pretrained("gpt2",
                            bos_token='<|startoftext|>',
                            eos_token='<|endoftext|>',
                            pad_token='<|pad|>')

# Download the pre-trained GPT-Neo model and transfer it to the GPU
model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()

# Resize the token embeddings because we've just added 3 new tokens
model.resize_token_embeddings(len(tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50259. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Embedding(50259, 768)

In [None]:
max_length = 300

# Load dataset
dataset = ConversationDataset(data[0:10000], tokenizer, max_length)

# Split data into train/val
train_size = int(0.9 * len(dataset))

train_data, val_data = random_split(dataset, [train_size, len(dataset) - train_size])

max_length

300

In [None]:
len(data)

275442

In [None]:
tokenizer.batch_decode(val_data[10])

["<|startoftext|>$$$Bot_1: You're the best, but the purple alligator is sitting too far to ypur left. \n$$$Bot_2: Is Gumby purple? Or is that Barny? \n$$$Bot_1: I've nary a clue. But if you Rollerblade over that cloud you could find out. \n<|endoftext|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad

In [None]:
training_args = TrainingArguments(output_dir=f'theo/results',
                                      num_train_epochs=5,
                                      logging_steps=1000,
                                      save_steps=5000,
                                      evaluation_strategy='steps',
                                      eval_steps=1000,
                                      per_device_train_batch_size=2,
                                      per_device_eval_batch_size=2,
                                      warmup_steps=100,
                                      learning_rate=5e-5,
                                      weight_decay=0.01,
                                      logging_dir=f'theo/logs')

trainer = Trainer(model=model, args=training_args,
                  train_dataset=train_data,
                  eval_dataset=val_data,
                  # This custom collate function is necessary
                  # to built batches of data
                  data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
              'attention_mask': torch.stack([f[1] for f in data]),
              'labels': torch.stack([f[0] for f in data])})

# Start training process!
trainer.train()

Step,Training Loss,Validation Loss
1000,0.9904,0.54625
2000,0.5488,0.521343
3000,0.5301,0.508532
4000,0.5172,0.492302
5000,0.4766,0.485061
6000,0.4486,0.479542
7000,0.4378,0.473392
8000,0.4365,0.465997
9000,0.4277,0.459471
10000,0.3832,0.459657


Step,Training Loss,Validation Loss
1000,0.9904,0.54625
2000,0.5488,0.521343
3000,0.5301,0.508532
4000,0.5172,0.492302
5000,0.4766,0.485061
6000,0.4486,0.479542
7000,0.4378,0.473392
8000,0.4365,0.465997
9000,0.4277,0.459471
10000,0.3832,0.459657


TrainOutput(global_step=22500, training_loss=0.4201080559624566, metrics={'train_runtime': 5762.5966, 'train_samples_per_second': 7.809, 'train_steps_per_second': 3.904, 'total_flos': 6889536000000000.0, 'train_loss': 0.4201080559624566, 'epoch': 5.0})

In [None]:
# Save model in the specified file path
trainer.save_model("theo/model")
tokenizer.save_pretrained("theo/model")

('theo/model/tokenizer_config.json',
 'theo/model/special_tokens_map.json',
 'theo/model/vocab.json',
 'theo/model/merges.txt',
 'theo/model/added_tokens.json',
 'theo/model/tokenizer.json')

In [None]:
!pip install huggingface_hub



In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import HfApi

api = HfApi()

In [None]:
# Create your repo first to upload the model
api.create_repo(repo_id="convobot")

ValueError: ignored

In [None]:
# Upload your model to huggingface. You can clone the repo anytime to use the model.
import os

model_pth = "theo/model"

files = os.listdir(model_pth)
for fi in files:
    print(os.path.join(model_pth, fi))

    api.upload_file(
        path_or_fileobj=os.path.join(model_pth, fi),
        path_in_repo=fi,
        repo_id="onyxify/convobot",
        repo_type="model",
    )