# Retrieving data for our Tips bot #

In this notebook we will retrieve posts for *LifeProTips*, *ShittyLifeProTips*, *UnethicalLifeProTips* and *IllegalLifeProTips*.

---

To start, we import `praw` and configure the Reddit API

In [1]:
import praw
from psaw import PushshiftAPI
import json

reddit = praw.Reddit(client_id='<your client id>',
                     client_secret='<your client secret>',
                     user_agent='<your user id>',
                     username='<your user id>',
                     password='<your password>')

api = PushshiftAPI(reddit)

Define retrieval method

In [2]:
def label(post_dict, tag):
    '''Put <tag> as the value for the post_dict key <label>.'''
    
    post_dict['label'] = tag

In [10]:
import datetime as dt

def post_retrieval(file, subreddit, date, count=500):
    '''Retrieve <count> posts from <subreddit> from <date> onwards and dump them into <file>.'''
    
    # Create empty post list
    posts = []
    
    submissions = api.search_submissions(after=date, subreddit=subreddit, limit=None, sort="date:asc")
    counter = 0
    
    for submission in submissions:
        #Retrieve set variables from the submission object and store them in a post dictionary
        post = vars(submission)
        post_dict = {field:post[field] for field in post_fields[:-1]} # Don't fill label yet
        
        label(post_dict, subreddit)
        posts.append(post_dict)
        
        # After 1000 posts are added they're written into the data file, this is done to ensure that the progress
        # is not lost in case of internet connection issues or other problems
        if len(posts)%1000 == 0:
            print("We're at: " + str(len(posts)))
            with open(file, 'w+') as f:
                json.dump(posts, f)
        
        # Check if we have arrived to <count>
        if counter == count:
                return
        
        # Update counter
        counter += 1

Now, we a list with the names of the subreddits we want

In [11]:
subreddits = ('lifeprotips', 'shittylifeprotips', 'unethicallifeprotips', 'illegallifeprotips')

Manage for several retrievals, put an `file_id`

In [12]:
file_id = 1

Define the desired fields and retrieve

In [13]:
import os

# Fields to be retrieved from submissions and comments and subreddit from which we will be extracting them
#comment_fields = ('id', 'body', 'link_id', 'permalink', 'score', 'subreddit_id')
post_fields = ('id','name', 'created_utc', 'title', 'selftext', 'score', 'label')

count = 30000

# Retrieve for subreddits
for subreddit in subreddits:
    
    # Period from which we want to retrieve the comments
    date = int(dt.datetime(2018,1,1).timestamp())
    
    # Check if there are previous dumps of the subreddit, so that we continue from its last date
    old_dump = 'tips/{}_dump_{}.json'.format(subreddit, file_id-1)
    if os.path.isfile(old_dump):
        with open(old_dump, 'r') as f:
            dump_list = json.load(f)
            date = int(dump_list[-1]['created_utc'])
    
    print('\nRetrieving {} posts from {} starting by {}'.format(count, subreddit, date))
    
    post_retrieval('tips/{}_dump_{}.json'.format(subreddit, file_id), subreddit, date, count)


Retrieving 30000 posts from lifeprotips starting by 1514761200
We're at: 1000
We're at: 2000
We're at: 3000
We're at: 4000
We're at: 5000
We're at: 6000
We're at: 7000
We're at: 8000
We're at: 9000
We're at: 10000
We're at: 11000
We're at: 12000
We're at: 13000
We're at: 14000
We're at: 15000
We're at: 16000
We're at: 17000
We're at: 18000
We're at: 19000
We're at: 20000
We're at: 21000
We're at: 22000
We're at: 23000
We're at: 24000
We're at: 25000
We're at: 26000
We're at: 27000
We're at: 28000
We're at: 29000
We're at: 30000

Retrieving 30000 posts from shittylifeprotips starting by 1514761200
We're at: 1000
We're at: 2000
We're at: 3000
We're at: 4000
We're at: 5000
We're at: 6000
We're at: 7000
We're at: 8000
We're at: 9000
We're at: 10000
We're at: 11000
We're at: 12000
We're at: 13000
We're at: 14000
We're at: 15000
We're at: 16000
We're at: 17000
We're at: 18000
We're at: 19000
We're at: 20000
We're at: 21000
We're at: 22000
We're at: 23000
We're at: 24000
We're at: 25000
We'r

DONE