# Applying API Requests On Hacker News Search API. #

---

## Purpose of this project: ##

The objective is to showcase my familiarity using API requests technique to collect data from [Hacker News Search][1]. [Here's][2] the link to the API documentation. I will be collecting about 500,000 stories data from Hacker News. After collecting the data, I will build a simple [data pipeline][3] to perform data cleaning.

__Background info:__

Hacker News is a sites where people can share stories, sites, and articles or ask questions or post hiring notices.


## Table of content: ##

* Find an efficient API requests framework and do a test run.
    * [Strategising the API request procedure.](#Strategising-the-API-request-procedure.)
    * [Creating an API requests framework and do a test run.](#Creating-an-API-requests-framework-and-do-a-test-run.)
* Implement an actual run with the API requests framework and save it into CSV file.
    * [Implementing an actual run with created API requests framework.](#Implementing-an-actual-run-with-created-API-requests-framework.)
    * [Save the dataset into CSV file.](#Save-the-dataset-into-CSV-file.)
* [Notes for future reference.](#Notes-for-future-reference.)
    
[1]:https://hn.algolia.com/
[2]:https://hn.algolia.com/api
[3]:https://hn.algolia.com/api
[4]:https://hn.algolia.com/api

In [1]:
import requests
import pandas as pd
from pprint import pprint

## Strategising the API request procedure. ##

Return to [Table of content:](#Table-of-content:)

---

Hacker News API limits the number of API requests from a single IP to 10,000 per hour according to the API documentation. I will have to ensure the requests runs within the rate limit at a constant interval and monitor the requests frequency to avoid breaching the rate limit rule and overloading the server. It will take 1,000 requests to collect about 500,000 stories data if each request page is able to contain 500 stories. 

__Here's the endpoint that I will make a `get` request to:__

- __hn.algolia.com/api/v1/search_by_date?__: This endpoint will collect stories starting from the lastest date.

__Here're the `get` requests parameters:__

- __page:__ From page 0 to 4,999.
- __hitsPerPage:__ 300 stories per request page.
- __tags:__ Data must be "story" type, not "ask".
- __numericFilters:__ Story must have more than 100 points.

## Creating an API requests framework and do a test run. ##

Return to [Table of content:](#Table-of-content:)

---

Before doing an actual run, I will first __observe the result of the API endpoint, create an API requests framework, and do a test run.__

In [2]:
params = {
    "page": 0,
    "hitsPerPage": 10,
    "tags": "story",
    "numericFilters": "points > 100"
}

response = requests.get("https://hn.algolia.com/api/v1/search_by_date?", params=params)
print("Request status: {}".format(response.status_code))
print("Type: {}; Result: {}".format(type(response.json()), pprint(response.json())))

Request status: 200
{'exhaustiveNbHits': False,
 'hits': [{'_highlightResult': {'author': {'matchLevel': 'none',
                                           'matchedWords': [],
                                           'value': 'yskchu'},
                                'title': {'matchLevel': 'none',
                                          'matchedWords': [],
                                          'value': 'Now Googlers Are '
                                                   'Protesting Company’s Deals '
                                                   'with Big Oil'},
                                'url': {'matchLevel': 'none',
                                        'matchedWords': [],
                                        'value': 'https://www.bloomberg.com/news/articles/2019-11-04/now-googlers-are-protesting-company-s-cloud-deals-with-big-oil'}},
           '_tags': ['story', 'author_yskchu', 'story_21451917', 'front_page'],
           'author': 'yskchu',
           'co

---

The requests to the API endpoint is successful. The collected data is a __"story" type__ and the stories are __ordered by date__ starting from the latest. Each story contains __more than 100 points__. Next is to ensure the `hitsPerPage` outputs __500 stories__ exactly. The `hits` key inside the dictionary contains a list of all the stories. Hence, I only have to count the length of the list in order to find out the number of results. 

In [3]:
params = {
    "page": 0,
    "hitsPerPage": 500,
    "tags": "story",
    "numericFilters": "points > 100",
}

response = requests.get("https://hn.algolia.com/api/v1/search_by_date?", params=params)
print("Request status: {}".format(response.status_code))
print("Type: {}; Result: {}".format(type(response.json()), len(response.json()["hits"])))

Request status: 200
Type: <class 'dict'>; Result: 500


--- 

Looking at the result above, the request outputs 500 stories exactly. The next step is __locate the target keys__ inside the dictionary in order to collect relevant data for analysis. 

__Here're the target keys that can be useful for analysis later:__ 

1. __objectID:__ It is a story ID. Might be useful if I need to make an API request to gather more information on this specific story via __hn.algolia.com/api/v1/items/:id__ endpoint.
1. __title:__ For finding out the topic. 
1. __created_at:__ For determining how current the story is.
1. __points:__ For measuring the popularity of the story. 
1. __num_comments:__ For measuring the readers' participation and interest on the particular story.
1. __url:__ For finding the most referred sites. 

The `hits` key inside the dictionary contains a list of all the stories. Hence, I just have to convert the list of stories into a DataFrame and filter the columns by the target keys. 

In [4]:
ls_hits = response.json()["hits"]

columns = [
    "objectID", "title", "created_at", "points", "num_comments", "url"
]

pd.DataFrame(data=ls_hits)[columns]

Unnamed: 0,objectID,title,created_at,points,num_comments,url
0,21451917,Now Googlers Are Protesting Company’s Deals wi...,2019-11-05T12:55:24.000Z,124,142,https://www.bloomberg.com/news/articles/2019-1...
1,21451847,Facebook Libra Is Architecturally Unsound,2019-11-05T12:44:52.000Z,251,113,http://www.stephendiehl.com/posts/libra.html
2,21451434,The Windows Update Marathon in a VM: From Wind...,2019-11-05T11:36:56.000Z,154,78,https://www.winhistory.de/more/386/updatem.htm
3,21450402,Unofficial Windows XP SP4,2019-11-05T07:29:09.000Z,123,77,https://ryanvm.net/forum/viewtopic.php?t=10321
4,21450126,Free Online Courses from Top Universities,2019-11-05T06:17:18.000Z,237,33,http://www.openculture.com/freeonlinecourses
...,...,...,...,...,...,...
495,21345142,Obscure Charges That Utility Companies Add to ...,2019-10-24T14:43:18.000Z,119,29,https://www.propublica.org/article/the-obscure...
496,21344522,Congressman's phone password is 111111,2019-10-24T13:36:35.000Z,321,194,https://gfycat.com/uncommonacclaimedboar
497,21343989,Show HN: Passbox – Give access to your data on...,2019-10-24T12:32:52.000Z,110,133,https://passbox.co
498,21343860,Web Based Qt Design Viewer,2019-10-24T12:19:35.000Z,132,88,https://www.qt.io/blog/web-based-qt-design-viewer


---

__Find the maximum request frequency to avoid breaching the rate limit rule.__

In [5]:
# This is to find the maximum request frequency to avoid breaching the rate limit rule. 
def rate_limit(total_requests, seconds):
    rate_limit = total_requests / seconds
    print("Request: {}; Max frequency: {} requests/s".format(total_requests, rate_limit))
    
rate_limit(10000, 3600)

Request: 10000; Max frequency: 2.7777777777777777 requests/s


---

Looking at the result above, the request frequency should not exceed 2.7 requests per second. If I were to make 1 request per second, it will take 1,000 seconds to complete it, which is about 17 minutes. 

__Find the optimal request rate.__

In [6]:
def optimal_request_rate(n_requests, sleep_time):
    from time import time, sleep
    from IPython.core.display import clear_output
    
    # Create a 'requested' variable to keep track of the current request made.
    requested = 0
    # Time the request to track the request frequency.
    start = time() 
    
    for n in range(0, n_requests):
        requested += 1
        sleep(sleep_time)

        elapsed_time = time() - start

        print("Request: {}; Frequency: {} requests/s".format(requested, requested / elapsed_time))
        clear_output(wait=True)
        
optimal_request_rate(10, 1)

Request: 10; Frequency: 0.9953925021145811 requests/s


In [7]:
params = {
    "page": 0,
    "hitsPerPage": 500,
    "tags": "story",
    "numericFilters": "points > 100",
}

target_keys = [
    "objectID", "title", "created_at", "points", "num_comments", "url"
]

url = "https://hn.algolia.com/api/v1/search_by_date?"

def api_requests(n_requests, url, sleep_time, target_keys=None, params={}):
    import numpy as np
    from time import time, sleep
    from IPython.core.display import clear_output
    from warnings import warn
    
    # Create a list to store unsuccessful requests for reference later.
    # Create a 'requested' variable to keep track of the current request made.
    request_failed = np.array([])
    requested = 0
    
    # Create empty DataFrame to store the data.
    df = pd.DataFrame(data=[])

    # Time the request to ensure it is staying within the rate limit.
    start = time() 
    # Set the parameters when the running the api requests.
    params = params

    # Make n number of requests, which equals to n pages.
    for page in range(0, n_requests):

        # Break the loop if the total request is greater than n_requests (just in case).
        if requested > n_requests:
            warn("Request number is greater than {}}.".format(n_requests))
            break
            
        requested += 1
        response = requests.get("{}".format(url), params=params)
        
        # Sleep for n seconds.
        sleep(sleep_time)
        elapsed_time = time() - start
        
        # Warn unsuccessful status code.
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requested, response.status_code))
            request_failed = np.append(request_failed, [requested, response.status_code])
        
        # Append the items to the list if the request is successful.
        if response.status_code == 200:
            ls_data = response.json()["hits"]
        
        # Print out the number of requests and time to monitor the progress.
        print("Request: {}; Frequency: {} requests/s".format(requested, requested / elapsed_time))
        clear_output(wait=True)
        
        # Add the collected data to DataFrame. 
        if target_keys: 
            df_collected_data = pd.DataFrame(data=ls_data)[target_keys]
        else:
            df_collected_data = pd.DataFrame(data=ls_data)
        df = pd.concat(objs=[df, df_collected_data], axis=0)

    return df, request_failed

__Do a test run with only 5 API requests.__

In [8]:
hn_stories, request_failed = api_requests(n_requests=5, url=url, sleep_time=1, 
                                       target_keys=target_keys, params=params)

Request: 5; Frequency: 0.4291873916544104 requests/s


In [9]:
hn_stories

Unnamed: 0,objectID,title,created_at,points,num_comments,url
0,21451917,Now Googlers Are Protesting Company’s Deals wi...,2019-11-05T12:55:24.000Z,125,142,https://www.bloomberg.com/news/articles/2019-1...
1,21451847,Facebook Libra Is Architecturally Unsound,2019-11-05T12:44:52.000Z,251,113,http://www.stephendiehl.com/posts/libra.html
2,21451434,The Windows Update Marathon in a VM: From Wind...,2019-11-05T11:36:56.000Z,154,78,https://www.winhistory.de/more/386/updatem.htm
3,21450402,Unofficial Windows XP SP4,2019-11-05T07:29:09.000Z,123,77,https://ryanvm.net/forum/viewtopic.php?t=10321
4,21450126,Free Online Courses from Top Universities,2019-11-05T06:17:18.000Z,237,33,http://www.openculture.com/freeonlinecourses
...,...,...,...,...,...,...
495,21345142,Obscure Charges That Utility Companies Add to ...,2019-10-24T14:43:18.000Z,119,29,https://www.propublica.org/article/the-obscure...
496,21344522,Congressman's phone password is 111111,2019-10-24T13:36:35.000Z,321,194,https://gfycat.com/uncommonacclaimedboar
497,21343989,Show HN: Passbox – Give access to your data on...,2019-10-24T12:32:52.000Z,110,133,https://passbox.co
498,21343860,Web Based Qt Design Viewer,2019-10-24T12:19:35.000Z,132,88,https://www.qt.io/blog/web-based-qt-design-viewer


In [10]:
request_failed

array([], dtype=float64)

---

The API requests framework is successful and there is no failed request. The only the exception is that the time interval between each request was halfed, so there might be room to lower the `sleep_time` slightly. 

## Implementing an actual run with created API requests framework. ##

Return to [Table of content:](#Table-of-content:)

In [11]:
hn_stories, request_failed = api_requests(n_requests=1000, url=url, sleep_time=.75, 
                                       target_keys=target_keys, params=params)

Request: 1000; Frequency: 0.4687252955027714 requests/s


In [12]:
hn_stories

Unnamed: 0,objectID,title,created_at,points,num_comments,url
0,21456399,Ask HN: How is your mental health?,2019-11-05T20:36:52.000Z,108,88,
1,21455739,Factual Inaccuracies of “Facebook Libra Is Arc...,2019-11-05T19:26:41.000Z,101,28,https://tonyarcieri.com/factual-inaccuracies-o...
2,21455276,AT&T to pay $60M over U.S. allegations it lied...,2019-11-05T18:39:30.000Z,123,75,https://www.reuters.com/article/us-at-t-settle...
3,21455231,Risky Mortgage Bonds Are Back and Delinquencie...,2019-11-05T18:35:20.000Z,152,90,https://www.bloomberg.com/news/articles/2019-1...
4,21455033,World Scientists’ Warning of a Climate Emergency,2019-11-05T18:17:21.000Z,103,71,https://academic.oup.com/bioscience/advance-ar...
...,...,...,...,...,...,...
495,21352277,Mir Books,2019-10-25T05:58:41.000Z,132,28,https://mirtitles.org/
496,21352239,Double Slit Experiment No Mystery,2019-10-25T05:50:39.000Z,106,116,https://billwadge.wordpress.com/2019/10/25/dou...
497,21352161,Gmail marking email from me as spam,2019-10-25T05:31:49.000Z,475,392,https://www.mail-archive.com/mailop@mailop.org...
498,21352045,An Illustrated Guide to OAuth and OpenID Connect,2019-10-25T05:02:18.000Z,431,74,https://developer.okta.com/blog/2019/10/21/ill...


In [13]:
request_failed

array([], dtype=float64)

## Save the dataset into CSV file. ##

Return to [Table of content:](#Table-of-content:)

---
__Save the dataset as .csv format.__

In [14]:
hn_stories.to_csv("hn_stories.csv", index=False, header=True, sep=",")

## Notes for future reference. ##

__List of things to take note in the future:__

* The API endpoints may have changed.
* Useful data such as `points` and `num_comments` may be outdated. 
* Hacker News might impose a requirement for API authentication. 
* Terms & condition, and license may be updated. It might introduce some limitations to data usage. 

Return to [Table of content:](#Table-of-content:)