## Sample Reddit comments

Collect Reddit comment data to build a classifier that can identify toxic comments.

The script samples all comments "hot" sorted posts within a list of subreddits.

Stores text comments to CSV with features:
- comment ID#
- subreddit name
- post ID#
- parent ID#
- comment timestamp
- comment age since post time
- comment age since now
- user ID#
- user name
- user created date
- user comment karma
- user link karma
- #replies to the comment
- contoversial flag state
- comment vote score
- comment text (converted to ascii)

Note: PRAW install:
pip install praw


In [None]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

import pandas as pd
import numpy as np
import datetime
import time
import csv

## create reddit instance and log in

In [None]:
import praw

reddit = praw.Reddit(client_id='7BHzw3jn54Hm7Q',
                     client_secret='Qw9lMWDx99daGcJ1vX6xX_peL3c',
                     user_agent='testscript',
                    )



## Collecting posts and comments 

This cell collects all comments from 100 'top' sorted posts in a given list of subreddits. The collected comments include a number of features such as time of comment, user info, number of replies and voting score.

The comments are written to a CSV file.

NOTE 1: Reddit limits access speed so the sampling will take a very long time - hours for large comment trees.

NOTE 2: 2/19/19 sub.comments.replace_more(limit=0) (delete all morecomments) was changed to sub.comments.replace_more(limit=None) (expand all morecomments). Earlier data sampled than this did not include any deeper comment tree comments.


In [None]:
header = ['comment_ID', 'sub_name','post_ID', 'parent_ID', 
          'time', 'age_re_post','age_re_now',
          'u_id', 'u_name', 'u_created', 'u_comment_karma', 'u_link_karma',
          'num_replies', 'controversy', 'score', 'text']

# give list of subreddit names to sample from
# subnames = ['politics', 'democrats', 'republicans']
subnames = ['politics']
# subnames = ['aww']
# subnames = ['photography']
#subnames = ['todayilearned']

# create output filename, appending unique time string
csvfilename = ('comment_sample_' + '_'.join(s for s in subnames) + 
    datetime.datetime.now().strftime('%y%m%d_%H%M%S') + '.csv')

# number of posts ('submissions') to sample from each subreddit
numsubs = 100

with open(csvfilename, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)          
    writer.writerow(header)

    # sample from each sub in sublist
    for subname in subnames:
        i = 1
        # sample all comments from each of numsubs posts
        for sub in reddit.subreddit(subnames[0]).top(limit=numsubs):
            print('post %d : %s'%(i,sub.title))
            i += 1
            # expand comment tree to include all comments 
            sub.comments.replace_more(limit=None)
            print(' ',len(sub.comments.list()),'comments')
            print(' #collected: ',end='')
            for com, comnum in zip(sub.comments.list(), range(len(sub.comments.list()))):
                try:
                    if com.score_hidden != True:
                            text = com.body.encode().decode('ascii',errors='ignore')
                            writer.writerow([com.id, subname, sub.id, com.parent_id,
                                             com.created_utc, 
                                             com.created_utc-sub.created_utc,
                                             datetime.datetime.utcnow().timestamp() - com.created_utc,
                                             com.author.id, com.author.name,
                                             com.author.created_utc, 
                                             com.author.comment_karma,
                                             com.author.link_karma,
                                             len(com.replies.list()),
                                             com.controversiality,
                                             com.score,
                                             text])
                except:
                    pass
#                     print('**error:', com.body)

                if comnum % 250 == 0:
                    print(comnum,',',end='')
                    
            # write this post's comments to disk
            csvfile.flush()
print('\n\ndone')



In [None]:
# run this cell if you interrupt the kernel to close the open output file 
csvfile.close()