# Applying API Requests On Hacker News Search API. #

---

## Purpose of this project: ##

The objective is to showcase my familiarity using API requests technique to collect data from [Hacker News Search][1]. [Here's][2] the link to the API documentation. After collecting the data, I will build a simple [data pipeline][3] to perform data cleaning.

__Background info:__

Hacker News is a sites where people can share stories, sites, and articles or ask questions or post hiring notices.


## Table of content: ##

* Find an efficient API requests framework and do a test run.
    * [Strategising the API request procedure.](#Strategising-the-API-request-procedure.)
    * [Creating an API requests framework and do a test run.](#Creating-an-API-requests-framework-and-do-a-test-run.)
* Implement an actual run with the API requests framework and save it into CSV file.
    * [Implementing an actual run with created API requests framework.](#Implementing-an-actual-run-with-created-API-requests-framework.)
    * [Save the dataset into CSV file.](#Save-the-dataset-into-CSV-file.)
* [Notes for future reference.](#Notes-for-future-reference.)
    
[1]:https://hn.algolia.com/
[2]:https://hn.algolia.com/api
[3]:https://hn.algolia.com/api
[4]:https://hn.algolia.com/api

In [1]:
import requests
import pandas as pd
from pprint import pprint

## Strategising the API request procedure. ##

Return to [Table of content:](#Table-of-content:)

---

Hacker News API limits the number of API requests from a single IP to 10,000 per hour according to the API documentation. I will have to ensure the requests runs within the rate limit at a constant interval and monitor the requests frequency to avoid breaching the rate limit rule and overloading the server. Each request page will contain 500 stories. 

__Here's the endpoint that I will make a `get` request to:__

- __hn.algolia.com/api/v1/search_by_date?__: This endpoint will collect stories starting from the lastest date.

__Here're the `get` requests parameters:__

- __page:__ From page 0 to 4,999.
- __hitsPerPage:__ 300 stories per request page.
- __tags:__ Data must be "story" type, not "ask".
- __numericFilters:__ Story must have more than 100 points.

## Creating an API requests framework and do a test run. ##

Return to [Table of content:](#Table-of-content:)

---

Before doing an actual run, I will first __observe the result of the API endpoint, create an API requests framework, and do a test run.__

In [2]:
params = {
    "page": 0,
    "hitsPerPage": 10,
    "tags": "story",
    "numericFilters": "points > 100",
    "numericFilters": "created_at_i < 1573344000",
}

response = requests.get("https://hn.algolia.com/api/v1/search_by_date?", params=params)
print("Request status: {}".format(response.status_code))
print("Type: {}; Result: {}".format(type(response.json()), pprint(response.json())))

Request status: 200
{'exhaustiveNbHits': False,
 'hits': [{'_highlightResult': {'author': {'matchLevel': 'none',
                                           'matchedWords': [],
                                           'value': 'doener'},
                                'title': {'matchLevel': 'none',
                                          'matchedWords': [],
                                          'value': 'Fritz AI is the machine '
                                                   'learning platform for iOS '
                                                   'and Android'},
                                'url': {'matchLevel': 'none',
                                        'matchedWords': [],
                                        'value': 'https://www.fritz.ai/'}},
           '_tags': ['story', 'author_doener', 'story_21495397'],
           'author': 'doener',
           'comment_text': None,
           'created_at': '2019-11-09T23:58:56.000Z',
           'created_at_i': 15

---

The requests to the API endpoint is successful. The collected data is a __"story" type__ and the stories are __ordered by date__ starting from the latest. Each story contains __more than 100 points__. Next is to ensure the `hitsPerPage` outputs __500 stories__ exactly. The `hits` key inside the dictionary contains a list of all the stories. Hence, I only have to count the length of the list in order to find out the number of results. 

In [3]:
params = {
    "page": 0,
    "hitsPerPage": 500,
    "tags": "story",
    "numericFilters": "points > 100",
    "numericFilters": "created_at_i < 1573344000",
}

response = requests.get("https://hn.algolia.com/api/v1/search_by_date?", params=params)
print("Request status: {}".format(response.status_code))
print("Type: {}; Result: {}".format(type(response.json()), len(response.json()["hits"])))

Request status: 200
Type: <class 'dict'>; Result: 500


--- 

Looking at the result above, the request outputs 500 stories exactly. The next step is __locate the target keys__ inside the dictionary in order to collect relevant data for analysis. 

__Here're the target keys that can be useful for analysis later:__ 

1. __objectID:__ It is a story ID. Might be useful if I need to make an API request to gather more information on this specific story via __hn.algolia.com/api/v1/items/:id__ endpoint.
1. __title:__ For finding out the topic. 
1. __created_at:__ For determining how current the story is.
1. __points:__ For measuring the popularity of the story. 
1. __num_comments:__ For measuring the readers' participation and interest on the particular story.
1. __url:__ For finding the most referred sites. 

The `hits` key inside the dictionary contains a list of all the stories. Hence, I just have to convert the list of stories into a DataFrame and filter the columns by the target keys. 

In [4]:
ls_hits = response.json()["hits"]

columns = [
    "objectID", "title", "created_at", "points", "num_comments", "url"
]

pd.DataFrame(data=ls_hits)[columns]

Unnamed: 0,objectID,title,created_at,points,num_comments,url
0,21495397,Fritz AI is the machine learning platform for ...,2019-11-09T23:58:56.000Z,3,0,https://www.fritz.ai/
1,21495395,Ask HN: What is a good book to learn about the...,2019-11-09T23:58:26.000Z,4,1,
2,21495392,Python Overtakes Java on GitHub,2019-11-09T23:58:09.000Z,3,0,https://www.zdnet.com/article/programming-lang...
3,21495386,Phase 1 Trial of Genetically Modified Autologo...,2019-11-09T23:56:39.000Z,2,0,https://www.americangene.com/press-releases/in...
4,21495377,Hacktuite (The True Decentralized Static Micro...,2019-11-09T23:54:45.000Z,1,0,
...,...,...,...,...,...,...
495,21489568,TikTok time-bomb: Silly clips raise some serio...,2019-11-09T04:19:40.000Z,2,0,https://www.economist.com/business/2019/11/07/...
496,21489539,Ask HN: What are the best resources for learni...,2019-11-09T04:12:52.000Z,63,13,
497,21489492,Brains Create Identical Objects,2019-11-09T04:00:01.000Z,3,0,https://www.scaruffi.com/phi/syn220.html
498,21489469,Ask HN: Why aren't more companies source code ...,2019-11-09T03:55:51.000Z,1,1,


---

__Find the maximum request frequency to avoid breaching the rate limit rule.__

In [5]:
# This is to find the maximum request frequency to avoid breaching the rate limit rule. 
def rate_limit(total_requests, seconds):
    rate_limit = total_requests / seconds
    print("Request: {}; Max frequency: {} requests/s".format(total_requests, rate_limit))
    
rate_limit(10000, 3600)

Request: 10000; Max frequency: 2.7777777777777777 requests/s


---

Looking at the result above, the request frequency should not exceed 2.7 requests per second. If I were to make 1 request per second, it will take 1,000 seconds to complete it, which is about 17 minutes. 

__Find the optimal request rate.__

In [6]:
def optimal_request_rate(n_requests, sleep_time):
    from time import time, sleep
    from IPython.core.display import clear_output
    
    # Create a 'requested' variable to keep track of the current request made.
    requested = 0
    # Time the request to track the request frequency.
    start = time() 
    
    for n in range(0, n_requests):
        requested += 1
        sleep(sleep_time)

        elapsed_time = time() - start

        print("Request: {}; Frequency: {} requests/s".format(requested, requested / elapsed_time))
        clear_output(wait=True)
        
optimal_request_rate(10, 1)

Request: 10; Frequency: 0.9948941246319378 requests/s


In [7]:
params = {
    "page": 0,
    "hitsPerPage": 500,
    "tags": "story",
    "numericFilters": "points > 100",
    "numericFilters": "created_at_i < 1573344000",
}

target_keys = [
    "objectID", "title", "created_at", "points", "num_comments", "url"
]

url = "https://hn.algolia.com/api/v1/search_by_date?"

def api_requests(n_requests, url, sleep_time, target_keys=None, params={}):
    import numpy as np
    from time import time, sleep
    from IPython.core.display import clear_output
    from warnings import warn
    
    # Create a list to store unsuccessful requests for reference later.
    # Create a 'requested' variable to keep track of the current request made.
    request_failed = np.array([])
    requested = 0
    
    # Create empty DataFrame to store the data.
    df = pd.DataFrame(data=[])

    # Time the request to ensure it is staying within the rate limit.
    start = time() 
    # Set the parameters when the running the api requests.
    params_dict = params.copy()

    # Make n number of requests, which equals to n pages.
    for page in range(0, n_requests):

        # Break the loop if the total request is greater than n_requests (just in case).
        if requested > n_requests:
            warn("Request number is greater than {}}.".format(n_requests))
            break
            
        requested += 1
        response = requests.get("{}".format(url), params=params_dict)
        
        # Rewrite the page number for each request.
        params_dict["page"] += 1
        
        # Sleep for n seconds.
        sleep(sleep_time)
        elapsed_time = time() - start
        
        # Warn unsuccessful status code.
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requested, response.status_code))
            request_failed = np.append(request_failed, [requested, response.status_code])
        
        # Append the items to the list if the request is successful.
        if response.status_code == 200:
            ls_data = response.json()["hits"]
        
        # Print out the number of requests and time to monitor the progress.
        print("Request: {}; Frequency: {} requests/s".format(requested, requested / elapsed_time))
        clear_output(wait=True)
        
        # If the requests outputs 0 result, stop the API requests.
        results = len(response.json()["hits"])
        if  results == 0:
            warn("Request outputs no more result.")
            return df, request_failed
        
        # Add the collected data to DataFrame. 
        if target_keys: 
            df_collected_data = pd.DataFrame(data=ls_data)[target_keys]
        else:
            df_collected_data = pd.DataFrame(data=ls_data)
        df = pd.concat(objs=[df, df_collected_data], axis=0, join='outer')

    return df, request_failed

__Do a test run with only 5 API requests.__

In [8]:
hn_stories, request_failed = api_requests(n_requests=5, url=url, sleep_time=1, 
                                          target_keys=target_keys, params=params)



In [9]:
hn_stories

Unnamed: 0,objectID,title,created_at,points,num_comments,url
0,21495397,Fritz AI is the machine learning platform for ...,2019-11-09T23:58:56.000Z,3,0,https://www.fritz.ai/
1,21495395,Ask HN: What is a good book to learn about the...,2019-11-09T23:58:26.000Z,4,1,
2,21495392,Python Overtakes Java on GitHub,2019-11-09T23:58:09.000Z,3,0,https://www.zdnet.com/article/programming-lang...
3,21495386,Phase 1 Trial of Genetically Modified Autologo...,2019-11-09T23:56:39.000Z,2,0,https://www.americangene.com/press-releases/in...
4,21495377,Hacktuite (The True Decentralized Static Micro...,2019-11-09T23:54:45.000Z,1,0,
...,...,...,...,...,...,...
495,21482808,Edge on Chromium: Stable Release Comes at the ...,2019-11-08T15:05:20.000Z,4,0,https://techplanet.today/post/edge-on-chromium...
496,21482806,Sir Martin Sorrell’s Silicon Valley Charm Offe...,2019-11-08T15:05:10.000Z,1,0,https://techcrunch.com/2019/11/07/sir-martin-s...
497,21482795,"Kubernetes development, simplified–Skaffold is...",2019-11-08T15:03:21.000Z,3,0,https://cloud.google.com/blog/products/applica...
498,21482781,How to Land a Journal Cover,2019-11-08T15:01:40.000Z,3,0,https://www.nature.com/articles/d41586-019-003...


In [10]:
request_failed

array([], dtype=float64)

In [11]:
hn_stories.duplicated().sum()

0

---

The API requests framework is successful and there is no failed request. The only the exception is that the time interval between each request was halfed, so there might be room to lower the `sleep_time` slightly. 

## Implementing an actual run with created API requests framework. ##

Return to [Table of content:](#Table-of-content:)

In [12]:
hn_stories, request_failed = api_requests(n_requests=1000, url=url, sleep_time=.75, 
                                          target_keys=target_keys, params=params)



In [13]:
hn_stories

Unnamed: 0,objectID,title,created_at,points,num_comments,url
0,21495397,Fritz AI is the machine learning platform for ...,2019-11-09T23:58:56.000Z,3,0,https://www.fritz.ai/
1,21495395,Ask HN: What is a good book to learn about the...,2019-11-09T23:58:26.000Z,4,1,
2,21495392,Python Overtakes Java on GitHub,2019-11-09T23:58:09.000Z,3,0,https://www.zdnet.com/article/programming-lang...
3,21495386,Phase 1 Trial of Genetically Modified Autologo...,2019-11-09T23:56:39.000Z,2,0,https://www.americangene.com/press-releases/in...
4,21495377,Hacktuite (The True Decentralized Static Micro...,2019-11-09T23:54:45.000Z,1,0,
...,...,...,...,...,...,...
495,21482808,Edge on Chromium: Stable Release Comes at the ...,2019-11-08T15:05:20.000Z,4,0,https://techplanet.today/post/edge-on-chromium...
496,21482806,Sir Martin Sorrell’s Silicon Valley Charm Offe...,2019-11-08T15:05:10.000Z,1,0,https://techcrunch.com/2019/11/07/sir-martin-s...
497,21482795,"Kubernetes development, simplified–Skaffold is...",2019-11-08T15:03:21.000Z,3,0,https://cloud.google.com/blog/products/applica...
498,21482781,How to Land a Journal Cover,2019-11-08T15:01:40.000Z,3,0,https://www.nature.com/articles/d41586-019-003...


In [14]:
request_failed

array([], dtype=float64)

## Save the dataset into CSV file. ##

Return to [Table of content:](#Table-of-content:)

---
__Save the dataset as .csv format.__

In [15]:
hn_stories.to_csv("hn_stories.csv", index=False, header=True, sep=",")

## Notes for future reference. ##

__List of things to take note in the future:__

* The API endpoints may have changed.
* Useful data such as `points` and `num_comments` may be outdated. 
* Hacker News might impose a requirement for API authentication. 
* Terms & condition, and license may be updated. It might introduce some limitations to data usage. 

Return to [Table of content:](#Table-of-content:)