## Getting Reddit Data
There are two options for extracting data from Reddit:

The requests library, which will allow us to interface directly with the Reddit API.

The PRAW library, which is a wrapper library that adds an extra layer of abstraction in accessing the Reddit API.

Here we will cover the first option, using the requests library to interface directly with the API.

The final extraction script will look like this:

In [3]:
import requests
import pandas as pd


class Reddit:
    def __init__(self, client_id, secret_token, username, password):
        # first create authentication object
        auth = requests.auth.HTTPBasicAuth(client_id, secret_token)
        # build login dictionary
        login = {'grant_type': 'password',
                 'username': username,
                 'password': password}
        # setup header info (incl description of API)
        headers = {'User-Agent': 'hptest/0.0.1'}
        # send request for OAuth token
        res = requests.post(f'https://www.reddit.com/api/v1/access_token',
                            auth=auth, data=login, headers=headers)
        # pull auth bearer token from response
        token = res.json()['access_token']
        # add authorization to headers dictionary
        headers['Authorization'] = f'bearer {token}'
        # add headers dict to internal attributes
        self.headers = headers
        # and api
        self.api = 'https://oauth.reddit.com'

    def get_new(self, subreddit, iters):
        # initialize dataframe to store data
        df = pd.DataFrame()
        # initialize parameters dictionary
        params = {'limit': 100}
        # iterate through several times to make sure we get all the data available
        for i in range(iters):
            # make request
            res = requests.get(f'{self.api}/r/{subreddit}/new',
                               headers=self.headers,
                               params=params)
            # check that we returned something (if not we reached end)
            if len(res.json()['data']['children']) == 0:
                print('No more found')
                return df
            # iterate through each thread recieved
            for thread in res.json()['data']['children']:
                # add info to dataframe
                df = df.append({
                    'id': thread['data']['name'],
                    'created_utc': int(thread['data']['created_utc']),
                    'subreddit': thread['data']['subreddit'],
                    'title': thread['data']['title'],
                    'selftext': thread['data']['selftext'],
                    'upvote_ratio': thread['data']['upvote_ratio'],
                    'ups': thread['data']['ups'],
                    'downs': thread['data']['downs'],
                    'score': thread['data']['score']
                }, ignore_index=True)
            # get earliest ID
            earliest = df['id'].iloc[len(df)-1]
            # add earliest ID to params
            params['after'] = earliest
        return df

In [4]:
SUB = 'investing'

In [5]:
CLIENT_ID = 'jnSuSTCO2WHq3PIHm43YRQ'
SECRET_TOKEN = '-Oj2oApu_HDlSrlOa_F9uynXmQDkcQ'

In [8]:
USER = 'Direct-Assistance-62'
PWD = '***'

In [9]:
reddit = Reddit(CLIENT_ID, SECRET_TOKEN, USER, PWD)

In [10]:
data = reddit.get_new(SUB, 20)

No more found


In [12]:
data

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio
0,1.642328e+09,0.0,t3_s58zdo,0.0,\n\nThis past September ClimeWorks launched t...,investing,Breakthrough That Could Reverse Climate Change...,0.0,0.13
1,1.642327e+09,0.0,t3_s58p7l,19.0,Have a general question? Want to offer some c...,investing,Daily General Discussion and Advice Thread - J...,19.0,0.80
2,1.642322e+09,0.0,t3_s57c11,0.0,I tried using crypto as a savings account but ...,investing,I've come in to a little money recently due to...,0.0,0.45
3,1.642312e+09,0.0,t3_s54zb3,0.0,I am closing my Betterment account after exper...,investing,Tax Loss Harvesting When Using a VTI and Chill...,0.0,0.50
4,1.642306e+09,0.0,t3_s53082,79.0,All around the news that US inflation is at 4...,investing,High inflationary environment: Warren Buffett ...,79.0,0.87
...,...,...,...,...,...,...,...,...,...
892,1.635097e+09,0.0,t3_qex2rl,626.0,"As title says what is your stocks, bonds, prop...",investing,"What is the % split of your assets? Stocks, Bo...",626.0,0.91
893,1.635087e+09,0.0,t3_qetphf,14.0,Let's say the FED starts rising rates to conta...,investing,Are there any asset classes from which one cou...,14.0,0.67
894,1.635079e+09,0.0,t3_qer7uo,16.0,"Hi folks,\n\nI have been debating with myself ...",investing,Investing for the metaverse? Are Facebook and ...,16.0,0.69
895,1.635078e+09,0.0,t3_qer227,305.0,I'm reading through some material about the su...,investing,How big of a risk is the supply chain crisis?,305.0,0.92


In [11]:
data = data.replace({'|': ''}, regex=True)

In [14]:
data.to_csv(f'reddit_{SUB}.csv', sep='|', index=False)