## Failed Approach (Using API)

The following blocks try to use the API approach, which failed fantastically. I suggest you go to the next section, which works well.

You can also use directly PushishiftAPI() without psaw.

```Python
from pushshift_py import PushshiftAPI
import datetime as dt
import psaw
import pandas as pd
import requests
import json
import csv
import time
api = psaw.PushshiftAPI()

startEpoch = int(dt.datetime(2020,1,1).timestamp())
```
    
The following block shows how we can get information using pushshift. It shows how we can specify the features and get them. The returned data type is a generator with "submission" type as elements, though we can certainly make them into a list.

```Python
features = ['url','author', 'title', 'subreddit', 'id', 'created', 'score']
subreddit = 'NBA'

data = api.search_submissions(after=startEpoch,
                            subreddit=subreddit,
                            filter= features,
                            limit=10)

for datum in data:
    print(datum.id, datum.subreddit, datum.title, datum.author, datum.url, datum.created, datum.score)

import praw

reddit = praw.Reddit(
    client_id="kxbUr-4PyE7DlQ",
    client_secret="Q5rIAPS9IHZ1QgOIkHNY09Y9VMxDsA",
    password="AACAXZDE",
    user_agent="testscript by u/kc_the_scraper",
    username="kc_the_scraper",
)
```

We can use praw to get the post body using the following block.
```Python
reddit.submission(id='eiev5d').selftext
```



In the following blocks, we create tables and store the information. For some reason, though, the api often acts up and freezes when we loop through the data.
```Python
import sqlite3

conn = sqlite3.connect('redditPosts.sqlite')
cur = conn.cursor()

cur.execute('''CREATE TABLE IF NOT EXISTS Posts(
                id TEXT PRIMARY KEY,
                subreddit TEXT,
                title TEXT,
                author TEXT,
                url TEXT,
                created int)
                ''')

features = ['url','author', 'title', 'subreddit', 'id', 'created']
subreddit = 'stocks'
latest = dt.datetime(2021,5,8).timestamp()
earliest = dt.datetime(2020,1,1).timestamp()

startEpoch = earliest

while startEpoch <= latest:
    data = api.search_submissions(after=startEpoch,
                            subreddit=subreddit,
                            filter= features,
                            limit=100)
    
    for datum in data:
        print('Got here 2.')
        cur.execute('''INSERT OR IGNORE INTO Posts VALUES (?,?,?,?,?,?)'''
                    , (datum.id, datum.subreddit, datum.title, datum.author, datum.url, datum.created))
        
        currentTime = datum.created
    
    conn.commit()
    if currentTime == startEpoch:
        break
    startEpoch = currentTime + 1
    print(dt.datetime.fromtimestamp(startEpoch))    

```

## Another Approach (Getting JSON)

The method above is shaky at best. A lot of times the api just freezes. On the other hand, I find using requests much easier. The following code blocks contain what you need for storing reddit data you need.

In [1]:
import requests
import datetime as dt
import sqlite3
import json
import time

In [8]:
def getPushShiftData(after,before, sub):
    url = 'https://api.pushshift.io/reddit/search/submission/?size=100&after='+str(int(after))+'&before='+str(int(before))+'&subreddit='+str(sub)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

def extractInfo(datum,features):
    info = {}
    
    for feature in features:
        info[feature] = datum[feature]
    
    return info

def getLatestTime(data):
    return data[-1]['created_utc']

def dataStoragePipeline(after, before, sub, conn):
    features = ['full_link','author', 'title', 'subreddit', 'id', 'created_utc']
    cursor = conn.cursor()
    while after < before:
        data = getPushShiftData(after, before, sub)
        if not data:
            break
        for datum in data:
            cursor.execute('''INSERT OR IGNORE INTO Posts 
                                VALUES (?,?,?,?,?,?)'''
                              , (datum['id'], datum['subreddit'], datum['title'], datum['author'], datum['full_link'], datum['created_utc']))
        
        after = getLatestTime(data) + 1
        conn.commit()
        print("The latest post is submitted at", dt.datetime.fromtimestamp(after-1))
        time.sleep(0.1)
        
        

In [18]:
import sqlite3
conn = sqlite3.connect('redditPosts.sqlite')
cur = conn.cursor()
subreddit = 'wallstreetbets'
end = int(time.time())
start = dt.datetime(2021,1,1).timestamp()
cur.execute('''SELECT MIN(created), MAX(created) FROM Posts
                WHERE subreddit = ?''', (subreddit,))
datatimes = cur.fetchone()

if datatimes:
    dataEarly, dataLate = datatimes
    if end < dataEarly:
        end = dataEarly
    elif start < dataLate:
        start =dataLate

In [21]:
while start < end:
    try:
        dataStoragePipeline(after = start, before = latest, sub = subreddit, conn = conn)
    except KeyboardInterrupt:
        print("Interrupted by keyboard. Stopping.")
        break
        
    except:
        print("Error occurred. Probably due to frequent requests. Will resume working in 1 seconds.")
        time.sleep(1)
        cur.execute('''SELECT MIN(created), MAX(created) FROM Posts
                        WHERE subreddit = ?''', (subreddit,))
        datatimes = cur.fetchone()
        
        if datatimes:
            dataEarly, dataLate = datatimes
            if end < dataEarly:
                end = dataEarly
            elif start < dataLate:
                start =dataLate
        

The latest post is submitted at 2021-01-25 14:18:06
The latest post is submitted at 2021-01-25 14:23:28
The latest post is submitted at 2021-01-25 14:29:08
The latest post is submitted at 2021-01-25 14:34:22
The latest post is submitted at 2021-01-25 14:40:07
Error occurred. Probably due to frequent requests. Will resume working in 1 seconds.
The latest post is submitted at 2021-01-25 14:44:50
The latest post is submitted at 2021-01-25 14:51:04
The latest post is submitted at 2021-01-25 14:57:59
The latest post is submitted at 2021-01-25 15:03:49
Error occurred. Probably due to frequent requests. Will resume working in 1 seconds.
The latest post is submitted at 2021-01-25 15:11:34
The latest post is submitted at 2021-01-25 15:17:19
The latest post is submitted at 2021-01-25 15:24:34
Error occurred. Probably due to frequent requests. Will resume working in 1 seconds.
The latest post is submitted at 2021-01-25 15:31:01
The latest post is submitted at 2021-01-25 15:38:34
The latest post i

In [12]:
cur.execute('''SELECT MIN(created), MAX(created) FROM Posts
                WHERE subreddit = ?''', (subreddit,))
print(cur.fetchone())

(1611600943, 1609480876)
