# Using the Pushshift API

Pushshift is a service that archives and indexes Reddit at regular intervals. It allows for higher-level search functionality and querying for Reddit comments and submissions, facilitating data collection for analysis and modeling. It leverages the requests library to return a json response that can then be parsed for the data of interest.

Resources: 
- Pushshift Endpoints: https://pushshift.io/
- Pushshift Documentation: https://github.com/pushshift/api
- Pushshift Subreddit: https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/


In [5]:
import requests, time, csv, json, re
import pandas as pd

## Setting the base query syntax:

Setting the query url to the pushshift api

In [6]:
url = 'https://api.pushshift.io/reddit/search/'

Setting the parameters for the query. A full list of parameters can be found on: https://pushshift.io/api-parameters/

In [7]:
params = {'searchType':'submission',
          'subreddit':'conservative,libertarian',
          'sort':'desc',
          'size':10,
#           'before':,
#           'after':,
         }

Making the request.

In [8]:
response = requests.get(url, params=params)

Checking the url to make sure the query terms are correct and the server is responsive

In [9]:
response.status_code

200

The status code returned from the server is 200, meaning the query was accepted and there aren't any connection issues. Checking length of the json file.

In [10]:
len(response.json()['data'])

10

Length is 10, as expected. Assessing the file structure for keys of interest.

In [11]:
response.json()

{'data': [{'author': 'TheDailyCosco',
   'author_flair_background_color': '',
   'author_flair_css_class': 'Conservative',
   'author_flair_richtext': [{'e': 'text', 't': 'Conservative'}],
   'author_flair_template_id': '421b644c-8b38-11e1-bc5f-12313d18ad57',
   'author_flair_text': 'Conservative',
   'author_flair_text_color': 'dark',
   'author_flair_type': 'richtext',
   'author_fullname': 't2_30vro8jr',
   'author_patreon_flair': False,
   'body': "Your son's history teacher isn't teaching history. They're also an idiot.",
   'created_utc': 1553726504,
   'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
   'id': 'ejj4mia',
   'link_id': 't3_b6aulk',
   'no_follow': True,
   'parent_id': 't3_b6aulk',
   'permalink': '/r/Conservative/comments/b6aulk/my_sons_history_teacher_taught_him_that_the/ejj4mia/',
   'retrieved_on': 1553726505,
   'score': 1,
   'send_replies': True,
   'stickied': False,
   'subreddit': 'Conservative',
   'subreddit_id': 't5_2qh6p'},
  {'author': '[deleted]',

Keys of interest are:
- author
- body
- created_utc
- link_id
- parent_id
- permalink
- subreddit
- subreddit_id

In [12]:
col_list = ['author',
            'body',
            'subreddit',
            'subreddit_id',
            'created_utc',
            'retrieved_on',
            'link_id',
            'parent_id',
            'permalink',
            ]

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.

In [13]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [20]:
def reddit_query(subreddits, n_samples=1500, searchType='comment', before=None, after=None):
    url = f'https://api.pushshift.io/reddit/search/'
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {'searchType':searchType,
              'subreddit':subreddits,
              'sort':'desc',
              'size':1000,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_{searchType}s.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} {searchType}s.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function to collect 15 comments from the conservative subreddit.

In [21]:
reddit_query(subreddits='conservative',
             n_samples=15,
             searchType='comment')

Starting query 1
Saved and completed query and returned 15 comments.
Reddit text is ready for processing.
Last timestamp was 1553726115.


In [26]:
with open(f'../data/1553726115_raw_comments.json', 'r') as f:
    cons_sample_list = json.load(f)

In [27]:
len(cons_sample_list)

15

In [28]:
cons_sample_list[0]

{'author': 'banes_rule_of_two',
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_x1rvf',
 'author_patreon_flair': False,
 'body': "And Christians get a lot of shit for that. I'm not going to defend Christianity, I'm not a Christian. You asserted that Islam is fair and just. I've pointed out some evil shit that Islam endorses and promotes that leads me to conclude that Islam is also evil.",
 'created_utc': 1553726582,
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'ejj4qpr',
 'link_id': 't3_b5tngt',
 'no_follow': True,
 'parent_id': 't1_ejizf2k',
 'permalink': '/r/Conservative/comments/b5tngt/the_uk/ejj4qpr/',
 'retrieved_on': 1553726583,
 'score': 1,
 'send_replies': True,
 'stickied': False,
 'subreddit': 'Conservative',
 'subreddit_id': 't5_2qh6p'}

Parsing the json file into a dataframe containing the features of interest.

In [29]:
def reddit_parse(sample):
    
    col_list = ['author',
                'body',
                'subreddit',
                'subreddit_id',
                'created_utc',
                'link_id',
                'parent_id',
                'permalink',
                ]
    
    comments_df = pd.DataFrame(sample)
    comments_df = comments_df[col_list]
    
    comments_df.rename(columns={'subreddit':'libertarian'}, inplace=True)
    comments_df['libertarian'] = comments_df['libertarian'].map({'Conservative':0, 'Libertarian':1})
    
    col_order = ['author',
                 'body',
                 'libertarian',
                 'created_utc',
                 'subreddit_id',
                 'parent_id',
                 'link_id',
                 'permalink',
                ]

    return comments_df[col_order]

Reviewing the shape of the dataframe to ensure correct transformation

In [30]:
cons_comments = reddit_parse(cons_sample_list)

In [31]:
cons_comments_df.shape

NameError: name 'cons_comments_df' is not defined

Shape corresponds with expected values. Reviewing the head of the dataframe to ensure data was correctly labeled. 

In [44]:
cons_comments_df.head()

Unnamed: 0,author,body,libertarian,created_on,retrieved_on,created_utc,subreddit_id,parent_id,link_id,permalink
0,rojindahar,"He already does Ketamine, 58 second mark: http...",0,Sun Dec 16 01:14:01 2018,Sun Dec 16 01:14:10 2018,1544922841,t5_2qh6p,t3_a6krv2,t3_a6krv2,/r/Conservative/comments/a6krv2/10yearold_boy_...
1,[deleted],[removed],0,Sun Dec 16 01:13:28 2018,Sun Dec 16 01:13:39 2018,1544922808,t5_2qh6p,t1_ebvclrz,t3_a6icni,/r/Conservative/comments/a6icni/to_guarantee_a...
2,[deleted],[removed],0,Sun Dec 16 01:13:15 2018,Sun Dec 16 01:13:26 2018,1544922795,t5_2qh6p,t1_ebvg44o,t3_a6icni,/r/Conservative/comments/a6icni/to_guarantee_a...
3,leadrain86,Actually that is quite the opposite. Conservat...,0,Sun Dec 16 01:13:09 2018,Sun Dec 16 01:13:20 2018,1544922789,t5_2qh6p,t1_ebvhrk2,t3_a6a7h7,/r/Conservative/comments/a6a7h7/one_year_ago_t...
4,[deleted],[removed],0,Sun Dec 16 01:11:49 2018,Sun Dec 16 01:12:00 2018,1544922709,t5_2qh6p,t3_a4llsj,t3_a4llsj,/r/Conservative/comments/a4llsj/keep_tyrants_l...
