# Hong Kong Subreddit Data 2010-2018 - Pushshift API (Part 1)
HongKong subreddit first started in Nov, 2009.

There are a number of ways to obtain reddit data:
- Web scraping using packages such as beautifulsoup, selenium, 
- Reddit API using PRAW
- Third Party API such as PushShift.io

The first method is requires a large amount of time to collect 8 years worth of submissions (posts created by users) and comments.
The second method has a limitation - 1000 lastest submissions/comments only.
Given the constraints, I chose to go with the third method. More information on Pushshift can be found [here](https://pushshift.io/author/stuck_in_the_matrix/).

Note: Pushshift's API documentation is not well up-to-date so browsing the pushshift subreddit may help solve your issues.

In [None]:
import time
import datetime
import json
import requests

# Getting the data

Pushshift.io provides an API that allows users to query submissions and comments of reddit. Submissions and comments data are separated into different sections, so it will need to run the function twice. 

Although each query provides a maximum of 1000 posts, it can search for comments days after/before the current date. If I want to obtain data from 1 year ago until now, you would add 365d in the after parameter of the API. 

Since our aim is to obtain 8 years of data, the script below will collect 1000 posts for each iteration and the next query will slightly overlap the previous posts.

In [None]:
def get_reddit_data(start_date_epoch, data_type):
    """
    Obtain all HongKong subreddit data from a start date until today.

    :param start_date_epoch: Epoch timestamp format. If you provide epoch in milliseconds divide by 1000.
    :param data_type: String Type. data_type is comment or submission
    :returns: List of data_type
    """
    end_time = datetime.datetime.fromtimestamp(round(datetime.datetime.now().timestamp()))
    start_time = datetime.datetime.fromtimestamp(round(start_date_epoch))
    remaining_days = (end_time-start_time).days
    data = []
    
    while remaining_days > 0:
    
        #size = max number of retrievials
        #after = search posts after # days
        response = requests.get('https://api.pushshift.io/reddit/search/{}/?subreddit=HongKong&size=1000&after={}d'\
                                .format(data_type, str(remaining_days))) 
        json = response.json()
        data.extend(json['data'])
        
        last_post_time = json['data'][-1]['created_utc']
        start_time = datetime.datetime.fromtimestamp(round(last_post_time))
        remaining_days = (end_time-start_time).days
        
        #minimum delay required as requested by pushshift to prevent hitting servers too hard
        time.sleep(1)
        
    return data

## Obtain a list of raw submissions and comments data
- Obtaining comments data may take awhile.

In [None]:
# obtain submissions data starting from Jan, 2010
submissions = get_reddit_data(1262304000, 'submission')
#Json format
json_data_submissions ={'data':submissions}

In [None]:
# output submissions data into file as a checkpoint
with open('hongkong_submissions.json', 'w') as outfile:  
    json.dump(json_data_submissions, outfile)

In [None]:
# obtain comments data starting from Jan, 2010
comments = get_reddit_data(1262304000, 'comment')
#Json format
json_data_comments ={'data':comments}

In [None]:
# output comments data into file as a checkpoint
with open('hongkong_comments.json', 'w') as outfile:  
    json.dump(json_data_comments, outfile)

# Removing duplicates from the raw comments and submissions data
- Since calling the API introduced some overlapping, we will need to de-duplicate the comments and submissions.

## Submissions

In [None]:
submission_list = submissions.copy()

In [None]:
def get_duplicate_id_list(data_type_list):
    """
    Gets a list of duplicate items by id
    
    :param data_type_list: A list of submissions/comments.
    :returns: List of duplicated items within data_type_list
    """
    duplicate_ids = []
    
    for i in range(len(data_type_list)-1):
        data_id = data_type_list[i]['id']
        start_time = datetime.datetime.fromtimestamp(round(data_type_list[i]['created_utc']))
        
        for j in range(i+1,len(data_type_list)):
            end_time = datetime.datetime.fromtimestamp(round(data_type_list[j]['created_utc']))
            difference_days = (end_time-start_time).days
            
            if data_id == data_type_list[j]['id']: # if duplicate add to list
                duplicate_ids.append(data_id)
                break
            elif difference_days > 2: #break if cannot find duplicate after 2 days of post
                break
                
    return duplicate_ids

"""
Gets a list of duplicate items
"""
def get_duplicate_id_list(submission_list):
    duplicate_ids = []
    for i in range(len(submission_list)-1):
        submission_id = submission_list[i]['id']
        for j in range(i+1,len(submission_list)):
            if submission_id == submission_list[j]['id']:
                duplicate_ids.append(submission_id)
                break
    return duplicate_ids

In [None]:
def remove_duplicates(data_type_list):
    """
    Removes duplicate items from data_type_list
    
    :param data_type_list: A list of submissions/comments.
    :returns: List of de-duplicated data_type_list
    """
    duplicate_ids = get_duplicate_id_list(data_type_list)
    
    for ids in duplicate_ids:
        for index, element in reversed(list(enumerate(data_type_list))):
            if element['id'] == ids: # find and remove the first duplicate from the back of the list
                data_type_list.pop(index)
                break
                
    return data_type_list

In [None]:
filtered_submission_list = remove_duplicates(submission_list)

In [None]:
# Json format
json_data_submission_filtered ={'data':filtered_submission_list}
# output submissions data into file as a checkpoint
with open('hongkong_submissions_filtered.json', 'w') as outfile:  
    json.dump(json_data_submission_filtered, outfile)

## Comments

In [None]:
comments_list = comments.copy()

In [None]:
filtered_comments_list = remove_duplicates(comments_list)

In [None]:
# Json format
json_data_comments_filtered ={'data':filtered_comments_list}
# output comments data into file as a checkpoint
with open('hongkong_comments_filtered.json', 'w') as outfile:  
    json.dump(json_data_comments_filtered, outfile)

## Combining comments to their submission posts.

Submission ids and comments ids are matched by id and linked_id respectively. 

In order to efficiently match submissions and comments, comments are grouped together. Then iterate through the submission list and insert the relevant grouped comments.

In [None]:
def combine_comments(json_comments):
    """
    Combine comments that share the same submission id
    
    :param data_type_list: A list of comments
    :returns: Dictionary. key: submission id, value: list of comments
    """
    submission_record = {}
    
    for i in range(len(json_comments)):
        if json_comments[i]['link_id'][3:] not in submission_record: # create new key if id doesn't exist in submission_record
            comments = [json_comments[i]]
            comment_first_date = datetime.datetime.fromtimestamp(round(json_comments[i]['created_utc']))
            
            for j in range(i+1,len(json_comments)):
                #if id matches key in submission_record, add it.
                if json_comments[j]['link_id'][3:] == json_comments[i]['link_id'][3:]: 
                    comments.append(json_comments[j])
                else:
                    # if id doesn't match and if post is greater than 60 days old, stop the search.
                    comment_latest_date = datetime.datetime.fromtimestamp(round(json_comments[j]['created_utc']))
                    remaining_days = (comment_latest_date-comment_first_date).days
                    if remaining_days > 60:
                        break
            # add list of comments to submission_record key (id)        
            submission_record[json_comments[i]['link_id'][3:]] = comments
            
    return submission_record 

In [None]:
dict_comments = combine_comments(filtered_comments_list)

In [None]:
# output grouped comments data into file as a checkpoint
with open('hongkong_combined_comments.json', 'w') as outfile:  
    json.dump(dict_comments, outfile)

In [None]:
def insert_comments_submission(submission_list, dict_comments):
    """
    Insert grouped comments to submission list by id
    
    :param submission_list: A list of submissions
    :param dict_comments: A dictionary of comments
    :returns: A list of submissions
    """
    submissions = submission_list.copy()
    for submission in submissions:
        if submission['id'] in dict_comments:
            submission['comments'] = dict_comments[submission['id']]
    return submissions

In [None]:
complete_json = insert_comments_submission(filtered_submission_list, dict_comments)

In [None]:
# Complete processing output to json file
with open('hongkong_complete.json', 'w') as outfile:  
    json.dump(complete_json, outfile)