## Hayden French (hmf9kx) ETL Project

The purpose of this project is to extract and transform data from the social media website reddit using their API. The user will be able to choose a subreddit (a community on the website) and select a particular number of popular posts from a particular time frame. The script will then sort the posts by how "controversial" they are using a metric I created. The resulting dataframe can be exported to a .csv file if desired.

In [1]:
# Importing necessary packages
import requests
import pandas as pd
from configparser import ConfigParser

In [2]:
# Public credentials
username = 'DS3002API'
client_id = 'no9xCWocGbF-E_D-6U8Meg'

In [3]:
# Reading confidential credentials from a seperate file
parser = ConfigParser()

try:
    _ = parser.read('notebook.cfg')
except:
    print('Unable to read configuration file')
    
try:
    secret_key = parser.get('my_api', 'secret_key')
    pw = parser.get('my_api', 'pw')
except:
    print('Unable to fetch secret key and/or password')

In [4]:
# Authenticating credentials
try:
    auth = requests.auth.HTTPBasicAuth(client_id, secret_key)
except:
    print('Unable to authenticate client id and/or secret key')

In [5]:
data = {
    'grant_type': 'password',
    'username': username,
    'password': pw
}

headers = {'User-Agent': 'My DS3002 API'}

In [6]:
# Retrieving access token from the api
try:
    res = requests.post('https://www.reddit.com/api/v1/access_token', auth=auth, data=data, headers=headers)
    token = res.json()['access_token']
except:
    print('Unable to retrieve access token')

In [7]:
headers['Authorization'] = f'bearer {token}'

The user can choose certain parameters for the api requests. The first is the specific subreddit to extract posts from. The second is the maximum number of posts to retrieve. The request is set up to fetch the most popular posts from a certain time frame. The last parameter lets the user choose this time frame. If you are unfamiliar with reddit, a good example input would be 'news', '50', and 'month' (If you are viewing this on github, this is the result you will see). This takes the 50 most popular posts in the last month posted to the news subreddit.

In [8]:
sub = input('What subreddit would you like to fetch posts from?: ')
limit = int(input('How many posts would you like to retrieve (1-100)?: '))
time = input('I would like to view the top posts from the past (hour, day, week, month, year, all): ').lower()

What subreddit would you like to fetch posts from?:  news
How many posts would you like to retrieve (1-100)?:  50
I would like to view the top posts from the past (hour, day, week, month, year, all):  month


In [9]:
# Requesting data as a json
try:
    r = requests.get(f'https://oauth.reddit.com/r/{sub}/top', headers=headers, params={'t': time, 'limit': limit}).json()
except:
    print('Unable to fetch posts')

In [10]:
# Navigating the json
l = r['data']['children']

In [11]:
# Creating the blank dataframe which we will populate later
df = pd.DataFrame(data={'Title': [], 'Number of Upvotes': [], 'Percent Upvoted': [], 'Number of Comments': [], 
                        'Link': []}).astype({"Number of Upvotes": int, "Number of Comments": int})

In [12]:
# Populating our dataframe with select information from the json
for i in range(0, len(l)):
    j = l[i]['data']
    temp_df = pd.DataFrame(data={'Title': j['title'], 'Number of Upvotes': int(j['score']), 
                                 'Percent Upvoted': j['upvote_ratio'], 
                                 'Number of Comments': int(j['num_comments']), 
                                 'Link': j['url_overridden_by_dest']}, index=[0])
    df = df.append(temp_df)
df = df.reset_index(drop=True)

In [13]:
# Creating two new columns from existing columns in the dataframe
df['Upvote to Comment Ratio'] = round(df['Number of Upvotes'] / df['Number of Comments'], 2)
df['Controversy Rating'] = 1 / (df['Percent Upvoted'] * df['Upvote to Comment Ratio'])
df = df.sort_values('Controversy Rating', ascending=False)

I created two new columns. The upvote to comment ratio is pretty self explanatory, and the 'controvery rating' is the inverse of the upvote to comment ratio times the percent upvoted. The idea here is that more controversial posts will have more comments and a lower ratio of upvotes to downvotes. This is obviously not a perfect system, but it definitely has some merit. For best results, I recommend selecting a subreddit that is prone to potentially controversial topics. Some examples are 'news', 'pics', or 'politics'.

In [14]:
# Displaying result sorted by controversy rating
df

Unnamed: 0,Title,Number of Upvotes,Percent Upvoted,Number of Comments,Link,Upvote to Comment Ratio,Controversy Rating
32,Utah bans transgender athletes in girls sports...,55739,0.83,13803,https://apnews.com/article/3ffc9205bfbeb05ae3a...,4.04,0.298223
20,New Secret Service report details growing ince...,65846,0.8,11642,https://www.cbsnews.com/news/incel-threat-secr...,5.66,0.220848
46,"National average for a gallon of gas tops $4, ...",42925,0.9,6445,https://www.cnbc.com/2022/03/06/national-avera...,6.66,0.166834
47,"U.S. home sales tumble; higher prices, mortgag...",40748,0.94,6126,https://www.reuters.com/world/us/us-existing-h...,6.65,0.159974
5,"As inflation heats up, 64% of Americans are no...",91959,0.9,12945,https://www.cnbc.com/2022/03/08/as-prices-rise...,7.1,0.156495
42,Plane carrying Donald Trump made emergency lan...,48065,0.79,5677,https://www.washingtonpost.com/nation/2022/03/...,8.47,0.149448
40,Foo Fighters' Taylor Hawkins had 10 different ...,51694,0.88,5669,https://www.cbsnews.com/amp/news/taylor-hawkin...,9.12,0.124601
6,Russia's 40-mile convoy has stalled on its way...,90274,0.92,8936,https://www.npr.org/2022/03/01/1083733700/russ...,10.1,0.107619
33,Russian fast-food chain backed by parliament t...,55297,0.92,4790,https://www.independent.co.uk/news/world/europ...,11.54,0.09419
22,Capitol riot suspect is granted refugee status...,65599,0.86,5227,https://www.cnn.com/2022/03/22/politics/evan-n...,12.55,0.092653


In [15]:
export = input("Would you like to export to csv? (y/n): ").lower()

Would you like to export to csv? (y/n):  n


In [16]:
if export:
    try:
        df.to_csv('output.csv', index=False)
    except:
        print('Unable to export')