# Exercise

For your exercise do the following:

1. Choose a reddit page you want to crawl
2. The following fields should be present when you crawl **(10 points)**:
    - author
    - subreddit
    - date created 
    - number of comments 
    - score
    - submission title 
    - submission description
3. After crawling, save your results to a pandas dataframe **(3 points)**. 
4. Answer the following questions **(12 points)**:
    - How many submissions were you able to gather? 
    - Who has the most submissions? 
    - Which submission has the highest score? 
    - Which submission has the highest number of comments?
    - Which day of the week has the most submissions? 
    
**Tip:** _For item#4, recall how to use the aggregation functions in `pandas` like count, value_counts, sum, etc. For getting the day of the week, look into how to get the `dayofweek` from a datetime object in `pandas`. (Hint: You may need to use `pd.to_datetime` to convert your date column...)_

In [1]:
import pandas as pd 
import datetime
import csv
import os 
import requests 
import datetime as dt
import time

In [2]:
URL = "https://api.pushshift.io/reddit/submission/search/"  #query submissions
PARAMS = {
    'after': 1591056000, #get dates after June 2, 2020
    'before': 1597017600, #get dates before August 10, 2020
    'sort_type': 'score', # sort by score
    'sort': 'desc', # sort in descending order
    'subreddit': 'valorant', # do a search on valorant subreddit
    'size': 30, # give only 20 search results
}

#use the requests library to query pushshift api
r = requests.get(url = URL, params=PARAMS)
#parse returned data to a json object
r.json()

{'data': [{'all_awardings': [{'award_sub_type': 'GLOBAL',
     'award_type': 'global',
     'coin_price': 70,
     'coin_reward': 0,
     'count': 1,
     'days_of_drip_extension': 0,
     'days_of_premium': 0,
     'description': '*Lowers face into palm*',
     'end_date': None,
     'giver_coin_reward': 0,
     'icon_format': 'PNG',
     'icon_height': 2048,
     'icon_url': 'https://i.redd.it/award_images/t5_22cerq/ey2iodron2s41_Facepalm.png',
     'icon_width': 2048,
     'id': 'award_b1b44fa1-8179-4d84-a9ed-f25bb81f1c5f',
     'is_enabled': True,
     'is_new': False,
     'name': 'Facepalm',
     'penny_donate': 0,
     'penny_price': 0,
     'resized_icons': [{'height': 16,
       'url': 'https://preview.redd.it/award_images/t5_22cerq/ey2iodron2s41_Facepalm.png?width=16&amp;height=16&amp;auto=webp&amp;s=d06b7de23ce8b8ea0f3e7cfd15033ac4893b72f0',
       'width': 16},
      {'height': 32,
       'url': 'https://preview.redd.it/award_images/t5_22cerq/ey2iodron2s41_Facepalm.png?widt

In [3]:
# Question 2 - Present field in the crawl

def to_utc(date):
    #This function converts an object to UTC. This is to automate the conversion 
    #of dates instead of going to https://www.unixtimeconverter.io/ 
    return int(date.replace(tzinfo=dt.timezone.utc).timestamp())
    
def to_readable_date(timestamp):
    #This function converts the UTC format to a Year-Month-Day format 
    return dt.datetime.fromtimestamp(timestamp).strftime("%Y-%m-%d")

#Declare start and end of reddit posts to extract 
start_date = dt.datetime.strptime("2020-06-02", "%Y-%m-%d")
end_date = dt.datetime.strptime("2020-08-10", "%Y-%m-%d")

#Create a range of dates to iterate 
#Note: Periods here represents the number of days it will create from the start date 
#We also do a +2 since it will only generate up to April 29. We inlcude May 1 
#since we want to get data from the last day which is April 30 to May 1 
date_range = (pd.date_range(
                start_date, 
                periods=(end_date - start_date).days + 2)
              .tolist())

#prepare the parameters needed to call the API
sort_type="score"
sort="desc"
# Question 2 - Present field in the crawl
fields=["author","subreddit","created_utc","num_comments","score", "title", "selftext", "id"]
subreddit = 'valorant'
url = "https://api.pushshift.io/reddit/submission/search/"
results = []
#loop through the dates 
for i, s_date in enumerate(date_range):
    #prevents us from getting an index out of range error
    if i != len(date_range)-1:
        #declare end date 
        e_date = date_range[i+1]
        #call the API
        r = requests.get(url = url, params={
            'after': to_utc(s_date),
            'before': to_utc(e_date),
            'sort_type': sort_type,
            'sort': sort,
            'subreddit': subreddit,
            'fields': fields,
            "size": 500
        })

        #add logs 
        print(f"Doing {s_date.strftime('%Y-%m-%d')} to {e_date.strftime('%Y-%m-%d')}")
        if r.status_code == 200:
            results.append(r.json()['data'])
            print("=====Done")
        else:
            print("=====Skipped")
        #so that we dont get blocked from abusing the API we call it after pausing for 1 second
        time.sleep(1)

Doing 2020-06-02 to 2020-06-03
=====Done
Doing 2020-06-03 to 2020-06-04
=====Done
Doing 2020-06-04 to 2020-06-05
=====Done
Doing 2020-06-05 to 2020-06-06
=====Done
Doing 2020-06-06 to 2020-06-07
=====Done
Doing 2020-06-07 to 2020-06-08
=====Done
Doing 2020-06-08 to 2020-06-09
=====Done
Doing 2020-06-09 to 2020-06-10
=====Done
Doing 2020-06-10 to 2020-06-11
=====Done
Doing 2020-06-11 to 2020-06-12
=====Done
Doing 2020-06-12 to 2020-06-13
=====Done
Doing 2020-06-13 to 2020-06-14
=====Done
Doing 2020-06-14 to 2020-06-15
=====Done
Doing 2020-06-15 to 2020-06-16
=====Done
Doing 2020-06-16 to 2020-06-17
=====Done
Doing 2020-06-17 to 2020-06-18
=====Done
Doing 2020-06-18 to 2020-06-19
=====Done
Doing 2020-06-19 to 2020-06-20
=====Done
Doing 2020-06-20 to 2020-06-21
=====Done
Doing 2020-06-21 to 2020-06-22
=====Done
Doing 2020-06-22 to 2020-06-23
=====Done
Doing 2020-06-23 to 2020-06-24
=====Done
Doing 2020-06-24 to 2020-06-25
=====Done
Doing 2020-06-25 to 2020-06-26
=====Done
Doing 2020-06-26

In [4]:
results

[[{'author': 'NadnerbEey_',
   'created_utc': 1591114811,
   'id': 'gvabt4',
   'num_comments': 226,
   'score': 52,
   'selftext': '',
   'subreddit': 'VALORANT',
   'title': "A simple reload cockblocked Reyna's potential."},
  {'author': 'justemaaz',
   'created_utc': 1591111018,
   'id': 'gv9552',
   'num_comments': 177,
   'score': 47,
   'selftext': '• Mercato del pesco = “pesco” is the tree that produces peaches; the correct translation for “fish market” is “mercato del pesce”\n\n• Meccanico di barche = it does not make sense, since we don’t have “boat mechanics” in Italy; the closest thing we have is “officina navale”, which is the translation of “naval/marine workshop”\n\n• Polpo troppo = Not sure what this is supposed to mean, since the literal translation is “too much octopus” (not even this, cause the correct form would be “troppo polpo” not “polpo troppo”). Is the fish market having an abundance of octopuses and its informing customers about that? \n\n• Gelato = ice cream s

In [5]:
# Question 3 - Save results to pandas dataframe

flat_list = []

for sublist in results:
    if sublist is not None:
        for item in sublist:
            flat_list.append(item)

df = pd.DataFrame.from_dict(flat_list)
display(df.head())
df.to_csv("reddit_valorant.csv")

Unnamed: 0,author,created_utc,id,num_comments,score,selftext,subreddit,title
0,NadnerbEey_,1591114811,gvabt4,226,52,,VALORANT,A simple reload cockblocked Reyna's potential.
1,justemaaz,1591111018,gv9552,177,47,• Mercato del pesco = “pesco” is the tree that...,VALORANT,Italian in the new map is an atrocity- here al...
2,molenzwiebel,1591107983,gv8a0o,2,41,Looking for players to play with? Check out ou...,VALORANT,"Patch 1.0 Bug Megathread, Known Launch Issues,..."
3,OWPD,1591114871,gvaciy,1140,41,here is all the juice enjoy :D [https://twitte...,VALORANT,Valorant cheaters heartbroken knowing they sti...
4,mixtoday,1591107733,gv87jx,26,27,I'm actually trying to start my 4th game in a ...,VALORANT,Servers down?


In [6]:
df = pd.read_csv('C:/Users/ACER/Desktop/DLSU/Module_1_Data_Collection/reddit_valorant.csv')
df.head(10)

Unnamed: 0.1,Unnamed: 0,author,created_utc,id,num_comments,score,selftext,subreddit,title
0,0,NadnerbEey_,1591114811,gvabt4,226,52,,VALORANT,A simple reload cockblocked Reyna's potential.
1,1,justemaaz,1591111018,gv9552,177,47,• Mercato del pesco = “pesco” is the tree that...,VALORANT,Italian in the new map is an atrocity- here al...
2,2,molenzwiebel,1591107983,gv8a0o,2,41,Looking for players to play with? Check out ou...,VALORANT,"Patch 1.0 Bug Megathread, Known Launch Issues,..."
3,3,OWPD,1591114871,gvaciy,1140,41,here is all the juice enjoy :D [https://twitte...,VALORANT,Valorant cheaters heartbroken knowing they sti...
4,4,mixtoday,1591107733,gv87jx,26,27,I'm actually trying to start my 4th game in a ...,VALORANT,Servers down?
5,5,abh998,1591107796,gv885v,14,21,I've been getting this error when I'm trying t...,VALORANT,Unexpected provisioning error?
6,6,molenzwiebel,1591107903,gv896w,878,19,"No launch is perfect, and as such there are pl...",VALORANT,Known Launch Issues Megathread
7,7,mirecarrot,1591114377,gva6uv,28,18,I made this version of valorant wallpapers for...,VALORANT,VALORANT Background Ultra-wide
8,8,Noobface_,1591122322,gvcqkb,185,18,,VALORANT,New spot on Ascent might need a patch
9,9,World_of_tech,1591110063,gv8v4a,41,18,I cannot get in to match game just stuck afte...,VALORANT,What happened to servers?


In [7]:
df.describe()

Unnamed: 0.1,Unnamed: 0,created_utc,num_comments,score
count,7000.0,7000.0,7000.0,7000.0
mean,3499.5,1594085000.0,29.752143,79.726
std,2020.870275,1739784.0,115.962755,629.317295
min,0.0,1591108000.0,0.0,1.0
25%,1749.75,1592592000.0,1.0,1.0
50%,3499.5,1594080000.0,4.0,3.0
75%,5249.25,1595595000.0,16.0,9.0
max,6999.0,1597065000.0,2614.0,17301.0


In [8]:
#Question 4.1 How many submissions were you able to gather?

len(df)
# df.count()

7000

In [9]:
#Question 4.2 Who has the most submissions?

df['author'].value_counts(sort=True).head(2)

[deleted]    123
Darkoplax     12
Name: author, dtype: int64

In [10]:
#Question 4.3 Which submission has the highest score?

#  scoresubmissions = df[['id', 'score']]
#  print(scoresubmissions[scoresubmissions['score']==scoresubmissions['score'].max()])

df.nlargest(1, 'score')

Unnamed: 0.1,Unnamed: 0,author,created_utc,id,num_comments,score,selftext,subreddit,title
700,700,AlexT__,1591682373,gzh5x4,406,17301,,VALORANT,Funny glitch: spectating makes the Prime colle...


In [11]:
#Question 4.4 Which submission has the highest number of comments?

#  commentsubmission = df[['id', 'num_comments']]
#  print(commentsubmission[commentsubmission['num_comments']==commentsubmission['num_comments'].max()])

df.nlargest(1, 'num_comments')

Unnamed: 0.1,Unnamed: 0,author,created_utc,id,num_comments,score,selftext,subreddit,title
1102,1102,JustGavinBennett,1592027539,h81o1f,2614,2115,"The skins in this game look fantastic, but rea...",VALORANT,$71 Skin Bundles Shouldn’t Be A Thing


In [12]:
#Question 4.5 Which day of the week has the most submissions?

# code guide: Epoch Timestamps: https://pandas-docs.github.io/pandas-docs-travis/user_guide/timeseries.html
# code guide: Day of Week: https://stackoverflow.com/questions/9847213/how-do-i-get-the-day-of-week-given-a-date
# code guide: Day int to Calendat day: https://stackoverflow.com/questions/36341484/get-day-name-from-weekday-int

df['timeStamp'] = pd.to_datetime(df['created_utc'], unit='s')
df['Day of Week'] = df['timeStamp'].apply(lambda time: time.dayofweek)
dayofweek = df['Day of Week']
dayofweek.nlargest(1)

500    6
Name: Day of Week, dtype: int64

In [13]:
#Question 4.5 Which day of the week has the most submissions?

import calendar
calendar.day_name[6]

'Sunday'