## Subsetting Reddit Climate Data
by *Santiago Segovia*

Lines of code: ~ 100

This notebook subsets the data based on date and subreddits. The newly created structures are used to fine-tune a sentiment analysis model. Additionally, the code creates labels for `positive` and `negative` sentiment once we remove comments defined as neutral.

## I. Initial Set-up

In [None]:
import pandas as pd

from google.colab import drive

In [None]:
# Mount GDrive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# Load data (takes 2 mins to load `comments`)
data_path = "/content/drive/Shareddrives/adv-ml-project/Data/"
comments = pd.read_csv(data_path + "the-reddit-climate-change-dataset-comments.csv")
posts = pd.read_csv(data_path + "the-reddit-climate-change-dataset-posts.csv")

In [None]:
# Define how the subreddit subset will happen
by_num_comments = False
by_categories = not by_num_comments
if by_categories:
  categories_subset = ['collapse','futurology','canada','australia','the_donald']

In [None]:
# Create date variable based on utc
comments['date'] = pd.to_datetime(comments['created_utc'], unit='s')
posts['date'] = pd.to_datetime(posts['created_utc'], unit='s')

We create a function to assign labels in the dataset. For this, we use the `sentiment` variable. We define the 0 value as neutrality:

In [None]:
# Define label
def create_label(sentiment):
    if sentiment < 0:
        return 0
    elif sentiment > 0:
        return 1

In [None]:
comments['label'] = comments['sentiment'].apply(create_label)

## II. Data Subsetting

In [None]:
initial_comments_shape = comments.shape
initial_posts_shape = posts.shape
print("Number of records in comments df:", initial_comments_shape[0])
print("Number of records in posts df:", initial_posts_shape[0])

Number of records in comments df: 4600698
Number of records in posts df: 620908


In [None]:
# We keep columns we'll use in the analysis
comments = comments[['subreddit.name','date','body','sentiment','label']]
posts = posts[['subreddit.name','date','title']]

In [None]:
# Subset by date (keep every record from 2015 onwards)
comments = comments[comments['date']>='2015-01-01']
posts = posts[posts['date']>='2015-01-01']

In [None]:
mid_comments_shape = comments.shape
mid_posts_shape = posts.shape
print("Number of records in comments df:", mid_comments_shape[0])
print(" Reduction of", round((initial_comments_shape[0] - mid_comments_shape[0]) * 100 / initial_comments_shape[0] - 1,2), "% vs. original")
print("Number of records in posts df:", mid_posts_shape[0])
print(" Reduction of", round((initial_posts_shape[0] - mid_posts_shape[0]) * 100 / initial_posts_shape[0] - 1,2), "% vs. original")

Number of records in comments df: 4338011
 Reduction of 4.71 % vs. original
Number of records in posts df: 566808
 Reduction of 7.71 % vs. original


In [None]:
# Subset by label (remove neutrality)
comments = comments[~comments['label'].isna()]

In [None]:
mid_comments_shape = comments.shape
print("Number of records in comments df:", mid_comments_shape[0])
print(" Reduction of", round((initial_comments_shape[0] - mid_comments_shape[0]) * 100 / initial_comments_shape[0] - 1,2), "% vs. original")

Number of records in comments df: 4013116
 Reduction of 11.77 % vs. original


In [None]:
# Subset by number of subreddits that have 5000 or more comments
def count_categories(categories):
    category_counts = {}
    for category in categories:
        if category in category_counts:
            category_counts[category] += 1
        else:
            category_counts[category] = 1

    return list(category_counts.items())

In [None]:
subreddits = count_categories(comments['subreddit.name'])
sorted_subreddits = sorted(subreddits, key=lambda x: x[1], reverse=True)

In [None]:
def drop_tuples_below_threshold(tuples_list, by_num_comments, by_categories,
                                threshold=None, cat_to_keep=None):
    if by_num_comments:
        to_keep = []
        cat_num = []
        for name, count in tuples_list:
            if count >= threshold:
              to_keep.append(name)
              cat_num.append((name, count))
        return to_keep, cat_num
    elif by_categories:
        cat_num = []
        for name, count in tuples_list:
            if name in cat_to_keep:
                cat_num.append((name, count))

        return cat_to_keep, cat_num

In [None]:
#Dropping subreddits based on condition
if by_num_comments:
    categories, counts_categories  = drop_tuples_below_threshold(sorted_subreddits, by_num_comments, by_categories, threshold=100000)
else:
    categories, counts_categories  = drop_tuples_below_threshold(sorted_subreddits, by_num_comments, by_categories,
                                                             cat_to_keep=categories_subset)

In [None]:
counts_categories

[('collapse', 88010),
 ('futurology', 83235),
 ('canada', 59037),
 ('australia', 46267),
 ('the_donald', 30492)]

In [None]:
comments = comments[comments['subreddit.name'].isin(categories)]
posts = posts[posts['subreddit.name'].isin(categories)]

In [None]:
end_comments_shape = comments.shape
end_posts_shape = posts.shape
print("Number of records in comments df:", end_comments_shape[0])
print(" Reduction of", round((initial_comments_shape[0] - end_comments_shape[0]) * 100 / initial_comments_shape[0] - 1,2),"% vs. original")
print("Number of records in posts df:", end_posts_shape[0])
print(" Reduction of", round((initial_posts_shape[0] - end_posts_shape[0]) * 100 / initial_posts_shape[0] - 1,2),"% vs. original")

Number of records in comments df: 307041
 Reduction of 92.33 % vs. original
Number of records in posts df: 18775
 Reduction of 95.98 % vs. original


## III. Export Data

In [None]:
# Export files
import csv

comments.reset_index(drop=True, inplace=True)
posts.reset_index(drop=True, inplace=True)
if by_num_comments:
    comments.to_csv(data_path + '/by_threshold/comments_filtered.csv', quoting=csv.QUOTE_NONNUMERIC, index=False)
    posts.to_csv(data_path + '/by_threshold/posts_filtered.csv', quoting=csv.QUOTE_NONNUMERIC, index=False)
else:
    comments.to_csv(data_path + '/by_category/comments_filtered.csv', quoting=csv.QUOTE_NONNUMERIC, index=False)
    posts.to_csv(data_path + '/by_category/posts_filtered.csv', quoting=csv.QUOTE_NONNUMERIC, index=False)