# Training Data Collection

The given dataset may be too small to fine-tune a model. Assuming the given dataset is a good representation of real world production data, I wanted like to keep that data separate as a final test. So I needed to create a new training dataset.

Given the limited time, I decided to use readily available crypto related reddit data found on Kaggle:

* https://www.kaggle.com/datasets/leukipp/reddit-crypto-data: Reddit posts from various crypto-related subreddits.
* https://www.kaggle.com/datasets/gpreda/reddit-cryptocurrency: Reddit post and comments from CryptoCurrency Subreddit. 

I created a new training dataset using the above data sources. 

The goal is to create a new training dataset and fine-tune the fastest well performing model: DistilBert. 

Some things to watch out when creating dataset:

* Be careful with data leakage. We don't want to introduce the same data as the test set during training. 


In [2]:
import pandas as pd
import re

import emoji
from transformers import pipeline
from tqdm.notebook import tqdm_notebook

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
tqdm_notebook.pandas()

In [4]:
pd.set_option('max_colwidth', None)

In [5]:
# posts published between 2022-01-01 00:00:13 and 2022-12-31 23:17:08
communities = ['cryptocurrency', 'bitcoin', 'ethereum', 'dogecoin']

In [6]:
df = pd.DataFrame()

for community in communities:
    comments = pd.read_csv(f'../datasets/reddit_posts/{community}/submission.csv', usecols=['submission', 'subreddit', 'selftext'])
    df = pd.concat([df,comments])
    
df = df.rename(columns={'selftext': 'text'})

In [7]:
len(df)

391312

In [211]:
# Augment data from https://www.kaggle.com/datasets/gpreda/reddit-cryptocurrency
comments = pd.read_csv('../datasets/reddit_crypto_posts_comments/reddit_cc.csv', usecols=['id', 'body'])

In [212]:
comments['subreddit'] = 'cryptocurrency'
comments = comments.rename(columns={'id': 'submission', 'body': 'text'})

In [213]:
len(comments)

40918

In [214]:
comments = comments.drop_duplicates(subset='text')

In [215]:
df = pd.concat([df, comments])

In [216]:
len(df)

424371

In [218]:
df['subreddit'].value_counts()

cryptocurrency    276246
bitcoin            66576
dogecoin           58320
ethereum           23229
Name: subreddit, dtype: int64

In [222]:
# Check for duplicate text
df['text'].value_counts()[:10]

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [225]:
df = df.dropna()

# Removing duplicate text and the original because duplicated texts are most likely moderator code of conduct related text
df = df.drop_duplicates(subset='text', keep=False)
df = df[(df['text'] != '[removed]') & (df['text'] != '[deleted]')]

In [226]:
len(df)

90810

### Remove same comments from original crypto dataset to prevent data leakage

In [227]:
crypto_comments = pd.read_csv('../datasets/crypto_reddit_sentiment.csv', usecols=['Reddit URL', 'Comment Text'])

In [228]:
def extract_reddit_community(url):
    community_search = re.search('reddit.com\/r\/(\w+)\/', url, re.IGNORECASE)
    return community_search.group(1).lower()


def extract_submission(url):
    community_search = re.search('reddit.com\/r\/\w+\/comments/(\w+)', url, re.IGNORECASE)
    return community_search.group(1).lower()

In [229]:
crypto_comments['subreddit'] = crypto_comments['Reddit URL'].apply(extract_reddit_community)
crypto_comments['submission'] = crypto_comments['Reddit URL'].apply(extract_submission)

In [230]:
crypto_comments = crypto_comments[crypto_comments['subreddit'].isin(communities)]

In [233]:
overlap = crypto_comments.merge(df, how='inner', left_on='Comment Text', right_on='text')

In [234]:
overlap

Unnamed: 0,Comment Text,Reddit URL,subreddit_x,submission_x,submission_y,subreddit_y,text


In [237]:
df.columns

Index(['submission', 'subreddit', 'text'], dtype='object')

In [240]:
# Only converting the emojis into text. Leaving the urls alone because distilbert can handle them (doesn't process them as UNK tokens). 
df['text'] = df['text'].apply(emoji.demojize)

In [None]:
df.to_csv('training.csv', index=None)