## Failed Approach (Using API)

The following blocks try to use the API approach, which failed fantastically. I suggest you go to the next section, which works well.

You can also use directly PushishiftAPI() without psaw.

In [2]:
from pushshift_py import PushshiftAPI
import datetime as dt
import psaw
import pandas as pd
import requests
import json
import csv
import time
api = psaw.PushshiftAPI()

In [3]:
startEpoch = int(dt.datetime(2020,1,1).timestamp())

The following block shows how we can get information using pushshift. It shows how we can specify the features and get them. The returned data type is a generator with "submission" type as elements, though we can certainly make them into a list.

In [6]:
features = ['url','author', 'title', 'subreddit', 'id', 'created', 'score']
subreddit = 'NBA'

data = api.search_submissions(after=startEpoch,
                            subreddit=subreddit,
                            filter= features,
                            limit=10)

for datum in data:
    print(datum.id, datum.subreddit, datum.title, datum.author, datum.url, datum.created, datum.score)

n7owcn nba Jay Williams on the Celtics struggles: "It's the job of the head coach to make the talent mesh. Kyrie was scapegoated. Kyrie is no longer there. What's the issue now? It seems like there's always an issue in Boston to a degree. It has to be geared a little bit more towards Brad Stevens." LeUnbeatableJames23 https://streamable.com/xhihoh 1620497330.0 1
n7or7u nba Malachi Flynn has passed Lamelo Ball in games played MasterPsaysUgh https://www.reddit.com/r/nba/comments/n7or7u/malachi_flynn_has_passed_lamelo_ball_in_games/ 1620496846.0 1
n7oqjz nba Shitpost Saturday + Game Thread Index NBA_MOD https://www.reddit.com/r/nba/comments/n7oqjz/shitpost_saturday_game_thread_index/ 1620496802.0 1
n7ook2 nba A breakdown of Stanley Johnson's playmaking vs the Wizards SkycapRex https://youtu.be/ACOYfg-KK3s 1620496607.0 1
n7of27 nba Winners of this year's final will be somek20 https://www.reddit.com/r/nba/comments/n7of27/winners_of_this_years_final_will_be/ 1620495655.0 1
n7o30e nba Google 

In [7]:
import praw

reddit = praw.Reddit(
    client_id="kxbUr-4PyE7DlQ",
    client_secret="Q5rIAPS9IHZ1QgOIkHNY09Y9VMxDsA",
    password="AACAXZDE",
    user_agent="testscript by u/kc_the_scraper",
    username="kc_the_scraper",
)

We can use praw to get the post body using the following block.

In [7]:
reddit.submission(id='eiev5d').selftext

'first off, happy new years. feel free to spam this with “new years shit post” and get your upvotes. anyways,\n\n ben and embiid won’t win together. ever. if ben simmons got built around like giannis did, aka shooters all around him, as well as a coach that would push him to shoot 1-2 threes per game, that said team would be unstoppable and he would be great.\n\nyes, ben is a coward. but, the great sub known as r/nba, i am proposing you the idea that if ben had shooters around him he would be amazing. imagine him driving inside and using his fantastic passing ability to kick it to shooters. it would be great. instead, embiid is in the paint, and he won’t shoot it, and it’s just a mess.\n\na trade idea i thought if could be ben for dlo, once ben’s max kicks in, so next season. (we’ll see how this one plays out) ben simmons would be playing with klay and steph, could maybe play that forward role, and he would be golden. him with his passing ability and steph and klay with their shooting 

In the following blocks, we create tables and store the information. For some reason, though, the api often acts up and freezes when we loop through the data.

In [2]:
import sqlite3

In [3]:
conn = sqlite3.connect('redditPosts.sqlite')
cur = conn.cursor()

In [10]:
cur.execute('''CREATE TABLE IF NOT EXISTS Posts(
                id TEXT PRIMARY KEY,
                subreddit TEXT,
                title TEXT,
                author TEXT,
                url TEXT,
                created int)
                ''')

<sqlite3.Cursor at 0x22b90141340>

In [11]:
features = ['url','author', 'title', 'subreddit', 'id', 'created']
subreddit = 'stocks'
latest = dt.datetime(2021,5,8).timestamp()
earliest = dt.datetime(2020,1,1).timestamp()

startEpoch = earliest

while startEpoch <= latest:
    data = api.search_submissions(after=startEpoch,
                            subreddit=subreddit,
                            filter= features,
                            limit=100)
    
    for datum in data:
        print('Got here 2.')
        cur.execute('''INSERT OR IGNORE INTO Posts VALUES (?,?,?,?,?,?)'''
                    , (datum.id, datum.subreddit, datum.title, datum.author, datum.url, datum.created))
        
        currentTime = datum.created
    
    conn.commit()
    if currentTime == startEpoch:
        break
    startEpoch = currentTime + 1
    print(dt.datetime.fromtimestamp(startEpoch))    





KeyboardInterrupt: 

## Another Approach (Getting JSON)

The method above is shaky at best. A lot of times the api just freezes. On the other hand, I find using requests much easier. The following code blocks contain what you need for storing reddit data you need.

In [1]:
import requests
import datetime as dt
import sqlite3
import json
import time

In [2]:
def getPushShiftData(after,before, sub):
    url = 'https://api.pushshift.io/reddit/search/submission/?size=100&after='+str(int(after))+'&before='+str(int(before))+'&subreddit='+str(sub)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

In [3]:
def extractInfo(datum,features):
    info = {}
    
    for feature in features:
        info[feature] = datum[feature]
    
    return info

In [45]:
print(extractInfo(data[0], ['url','author', 'title', 'subreddit', 'id', 'created_utc']))

{'url': 'https://www.reddit.com/r/stocks/comments/eifodm/annual_forecast_for_facebook_fb_2020/', 'author': 'Ituglobal', 'title': 'Annual Forecast for Facebook (FB) – 2020', 'subreddit': 'stocks', 'id': 'eifodm', 'created_utc': 1577863982}


In [4]:
def getLatestTime(data):
    return data[-1]['created_utc']

In [5]:
def dataStoragePipeline(after, before, sub, conn):
    features = ['full_link','author', 'title', 'subreddit', 'id', 'created_utc']
    cursor = conn.cursor()
    while after < before:
        data = getPushShiftData(after, before, sub)
        if not data:
            break
        for datum in data:
            cursor.execute('''INSERT OR IGNORE INTO Posts 
                                VALUES (?,?,?,?,?,?)'''
                              , (datum['id'], datum['subreddit'], datum['title'], datum['author'], datum['full_link'], datum['created_utc']))
        
        after = getLatestTime(data) + 1
        conn.commit()
        print("The latest post is submitted at", dt.datetime.fromtimestamp(after-1))
        time.sleep(0.1)
        
        

In [7]:
import sqlite3
conn = sqlite3.connect('redditPosts.sqlite')
cur = conn.cursor()
subreddit = 'wallstreetbets'
latest = int(time.time())
earliest = dt.datetime(2021,1,1).timestamp()
cur.execute('''SELECT MAX(created) FROM Posts
                WHERE subreddit = ?''', (subreddit,))
databaseLatest = cur.fetchone()[0]

if databaseLatest:
    start = max(earliest, databaseLatest)
else:
    start = earliest
print(dt.datetime.fromtimestamp(start))
while start < latest:
    try:
        dataStoragePipeline(after = start, before = latest, sub = subreddit, conn = conn)
    except KeyboardInterrupt:
        print("Interrupted by keyboard. Stopping.")
        break
        
    except:
        print("Error occurred. Probably due to frequent requests. Will resume working in 1 seconds.")
        time.sleep(1)
        cur.execute('''SELECT MAX(created) FROM Posts
                        WHERE subreddit = ?''', (subreddit,))
        databaseLatest = cur.fetchone()[0]
        
        if databaseLatest:
            start = max(earliest, databaseLatest)
        else:
            start = earliest
        

2021-01-25 11:31:27
The latest post is submitted at 2021-01-25 11:34:42
The latest post is submitted at 2021-01-25 11:37:59
The latest post is submitted at 2021-01-25 11:41:17
The latest post is submitted at 2021-01-25 11:45:11
The latest post is submitted at 2021-01-25 11:48:33
The latest post is submitted at 2021-01-25 11:53:15
The latest post is submitted at 2021-01-25 11:57:54
The latest post is submitted at 2021-01-25 12:02:10
The latest post is submitted at 2021-01-25 12:07:04
The latest post is submitted at 2021-01-25 12:11:14
The latest post is submitted at 2021-01-25 12:15:57
The latest post is submitted at 2021-01-25 12:20:44
The latest post is submitted at 2021-01-25 12:25:07
The latest post is submitted at 2021-01-25 12:30:13
The latest post is submitted at 2021-01-25 12:34:48
The latest post is submitted at 2021-01-25 12:39:36
Error occurred. Probably due to frequent requests. Will resume working in 1 seconds.
The latest post is submitted at 2021-01-25 12:45:51
The latest 