# Applying API Requests On Hacker News Stories. #

---

## Purpose of this project: ##

The objective is to showcase my familiarity using API requests technique to collect stories data from [Hacker News Y Combinator][1]. [Here's][2] the link to the API documentation.


__Background info:__

Hacker News is a sites where people can share stories, sites, and articles or ask questions or post hiring notices.

## Table of content: ##

* Find an efficient API requests framework and do a test run.
    * [Strategising the API request procedure.](#Strategising-the-API-request-procedure.)
    * [Creating the 'get' requests framework and do a test run.](#Creating-the-'get'-requests-framework-and-do-a-test-run.)
    * [Making 'get' request to collect more ids and compile them into a list.](#Making-'get'-request-to-collect-more-ids-and-compile-them-into-a-list.)
    
* Implement an actual run with the API requests framework and save it into CSV file.
    * [Running an actual 'get' request.](#Running-an-actual-'get'-request.)
    * [Convert into 'dataframe' object and save the dataset into CSV file.](#Convert-into-'dataframe'-object-and-save-the-dataset-into-CSV-file.)
* [Notes for future reference.](#Notes-for-future-reference.)
    
[1]:https://news.ycombinator.com/
[2]:https://github.com/HackerNews/API

In [1]:
import requests
import pandas as pd

## Strategising the API request procedure. ##

Return to [Table of content:](#Table-of-content:)

---

Each stories has its own unique id. The API endpoints to collecting those ids can only be extracted up to a certain number according to the API documentation. Here are the endpoints that I will make a `get` request to collect ids from:

1. /beststories.json
1. /topstories.json
1. /askstories.json
1. /showstories.json

After collecting those ids, I will then make a get request to __/item/id__ endpoint. Due to the API's limitation, get request has to be made individually for each id by looping through an id list. Although this API has no rate limit, it does consume a lot of time.

__Based on the API documentation, each id may contain the following data:__

* __id:__ The item's unique id.
* __deleted:__ `true` if the item is deleted.
* __type:__ The type of item. One of "job", "story", "comment", "poll", or "pollopt".
* __by:__ The username of the item's author.
* __time:__ Creation date of the item, in [Unix Time](http://en.wikipedia.org/wiki/Unix_time).
* __text:__ The comment, story or poll text. HTML.
* __dead:__ `true` if the item is dead.
* __parent:__ The comment's parent: either another comment or the relevant story.
* __poll:__ The pollopt's associated poll.
* __kids:__ The ids of the item's comments, in ranked display order.
* __url:__ The URL of the story.
* __score:__ The story's score, or the votes for a pollopt.
* __title:__ The title of the story, poll or job.
* __parts:__ A list of related pollopts, in display order.
* __descendants:__ In the case of stories or polls, the total comment count.

__Here're the data that can be useful for analysis later:__ 

1. __type:__ It seems that each news may either be about job, story, comment, or poll. This will help to filter the news later. 
1. __time:__ For determining how current the news is, but we will have to convert it to datetime.
1. __text:__ For text analysis exploration. 
1. __url:__ For finding the popular sites where most readers go to. 
1. __score:__ For measuring the popularity of the news. 
1. __title:__ For finding out the topic. 
1. __descendants:__ For measuring the readers' participation and interest on the particular news topic. 

---

Before doing an actual run, I will first create a framework and do a test run. I will start out by getting the best stories via the __/beststories.json__ endpoint.

__Get a list of best stories.__

In [2]:
response1 = requests.get("https://hacker-news.firebaseio.com/v0/beststories.json")
print("Request status: {}".format(response1.status_code))
print("Type: {}; Result: {}".format(type(response1.text), response1.text))

Request status: <class 'int'>
Type: <class 'str'>; Result: [20542107,20531987,20532763,20533576,20549457,20542862,20537409,20540852,20533026,20530046,20531039,20541492,20535698,20535008,20536288,20530350,20551615,20538614,20549804,20533923,20533318,20543646,20531955,20541781,20545561,20535385,20547731,20537941,20552797,20539387,20533446,20529001,20539012,20552054,20547228,20531541,20546356,20555249,20554194,20541446,20531394,20545276,20546288,20531334,20539978,20535390,20532785,20549685,20544058,20543495,20542470,20550165,20550709,20549422,20547441,20544395,20540795,20542002,20535654,20536652,20535628,20544076,20552752,20544222,20542738,20541188,20545257,20549424,20546231,20554806,20550656,20541422,20540693,20541275,20552675,20530813,20529314,20540379,20556095,20550167,20549056,20535293,20536828,20552365,20549706,20538158,20533096,20555229,20534963,20529689,20549080,20539922,20551148,20549354,20544564,20542606,20535984,20531616,20529312,20553024,20549442,20548115,20540724,20535872,2055

---
Looking at the result above, the object type is a string. I will be converting them into integer before making a `get` request. 

In [3]:
import re

# Split the string values.
# Create a list to contain each of the values and convert every string value into integer.
id_ls = response1.text.split(",")
id_ls = [int(re.sub(r"\D", "", id_)) for id_ in id_ls]
id_ls[:10]

[20542107,
 20531987,
 20532763,
 20533576,
 20549457,
 20542862,
 20537409,
 20540852,
 20533026,
 20530046]

## Creating the 'get' requests framework and do a test run. ##

Return to [Table of content:](#Table-of-content:)

In [4]:
from time import time
from IPython.core.display import clear_output

# Create 2 separate lists. 
# One to contain unsuccessful requests.
# Another to contain data from `response.json()`.
request_failed = []
stories = []

# Create a variable for counting requests.
request = 0

# Mark the time taken for each request.
start = time()

# Adding parameter to url to display JSON format nicely.
params = {"print":"pretty"}

# Loop over 10 ids for testing.
for id_ in id_ls[:10]:
    request += 1
    response = requests.get("https://hacker-news.firebaseio.com/v0/item/{}.json".format(id_), params=params)
    elapsed_time = time() - start
    
    # Check whether the response is successful or not.
    # If unsuccessful, add failed requests to the indicated list.
    if response.raise_for_status() is None:
        stories.append(response.json())
    else:
        request_failed.append([id_, response.raise_for_status()])
    
    # Print out the number of requests and time to monitor the progress.
    print("Request: {}; Frequency: {} requests/s".format(request, request / elapsed_time))
    clear_output(wait=True)

Request: 10; Frequency: 0.8756282918753179 requests/s


---
__Check the compiled stories and convert the list of stories into `DataFrame` object for review.__

In [5]:
stories[:5]

[{'by': 'dredmorbius',
  'descendants': 530,
  'id': 20542107,
  'kids': [20542705,
   20542852,
   20556909,
   20542612,
   20542618,
   20542656,
   20542368,
   20543637,
   20543155,
   20542851,
   20543535,
   20544503,
   20543092,
   20542505,
   20542390,
   20542427,
   20544736,
   20543355,
   20542906,
   20544023,
   20552083,
   20543126,
   20543328,
   20545816,
   20543422,
   20546392,
   20542523,
   20553103,
   20547999,
   20545315,
   20544694,
   20550303,
   20545010,
   20545070,
   20544898,
   20547338,
   20545154,
   20543949,
   20546825,
   20545782,
   20546652,
   20542354,
   20548675,
   20544245,
   20544661,
   20547432,
   20545191,
   20544173,
   20545183,
   20545361,
   20545420,
   20544140,
   20545457,
   20542443,
   20542689,
   20542723,
   20546523,
   20544271,
   20545422,
   20548145,
   20542905,
   20542879,
   20544063,
   20543699,
   20544264,
   20544984,
   20546265,
   20543996,
   20543567,
   20542383,
   20542804,
   205

In [6]:
pd.DataFrame(stories)

Unnamed: 0,by,descendants,id,kids,score,time,title,type,url
0,dredmorbius,530,20542107,"[20542705, 20542852, 20556909, 20542612, 20542...",884,1564236658,Adblocking: How about Nah?,story,https://www.eff.org/deeplinks/2019/07/adblocki...
1,parsimo2010,70,20531987,"[20532148, 20532695, 20532176, 20540584, 20532...",740,1564109820,Decades-Old Computer Science Conjecture Solved...,story,https://www.quantamagazine.org/mathematician-s...
2,cpeterso,162,20532763,"[20533597, 20533371, 20533348, 20533647, 20534...",545,1564121664,Mozilla debuts implementation of WebThings Gat...,story,https://venturebeat.com/2019/07/25/mozilla-deb...
3,zoobab,223,20533576,"[20534838, 20533782, 20533727, 20534294, 20535...",523,1564134694,The UK authorities made illegal copies of the ...,story,https://twitter.com/SophieintVeld/status/11546...
4,braythwayt,252,20549457,"[20549523, 20550613, 20550475, 20549622, 20552...",488,1564343108,Malicious code in the purescript NPM installer,story,https://harry.garrood.me/blog/malicious-code-i...
5,tafda,197,20542862,"[20543176, 20543063, 20543844, 20543058, 20543...",376,1564246439,The Roots of Boeing’s 737 Max Crisis: A Regula...,story,https://www.nytimes.com/2019/07/27/business/bo...
6,carapace,326,20537409,"[20537922, 20538634, 20539303, 20543060, 20539...",586,1564167557,How is China able to provide enough food to fe...,story,https://www.quora.com/How-is-China-able-to-pro...
7,timafuyc,68,20540852,"[20549190, 20547705, 20547233, 20549106, 20547...",498,1564212569,Tokyo subway’s humble duct-tape typographer,story,https://medium.com/@chrisgaul/tokyo-subways-hu...
8,InvOfSmallC,308,20533026,"[20534135, 20534614, 20534300, 20537640, 20534...",457,1564126368,Users hate change,story,https://gist.github.com/sleepyfox/a4d311ffcdc4...
9,girlwhocodes,237,20530046,"[20530409, 20531859, 20530242, 20531878, 20530...",483,1564090771,Square’s Growth Framework for Engineers and En...,story,https://developer.squareup.com/blog/squares-gr...


---
The result looks fine. The next step is to collect top stories, "ask" stories, and "show" stories. Each with its own unique endpoint. 

## Making 'get' request to collect more ids and compile them into a list. ##

Return to [Table of content:](#Table-of-content:)

---

__Get a list of top stories.__

In [7]:
response2 = requests.get("https://hacker-news.firebaseio.com/v0/topstories.json")
print("Request status: {}".format(type(response2.status_code)))
print("Type: {}; Result: {}".format(type(response2.text), response2.text))

Request status: <class 'int'>
Type: <class 'str'>; Result: [20556095,20556201,20556145,20556382,20556217,20555550,20556045,20556068,20556346,20555463,20554240,20555785,20547295,20554194,20552752,20554540,20553024,20555351,20556734,20546539,20546955,20544628,20541446,20554806,20545850,20556043,20556336,20555249,20547252,20546503,20556494,20556869,20555503,20547824,20550656,20551571,20552850,20554122,20556441,20543181,20552365,20556706,20551615,20549457,20552054,20551946,20550709,20555453,20551148,20550923,20555720,20553220,20552797,20550165,20540795,20550811,20555681,20549422,20549442,20550167,20540852,20549685,20549804,20549706,20549424,20541488,20550445,20553028,20550143,20549895,20542107,20540493,20546906,20550516,20546609,20551592,20547013,20549354,20541275,20549056,20547911,20554305,20541683,20546356,20546288,20541448,20549983,20547104,20540871,20550758,20541401,20542862,20551641,20546246,20547856,20546231,20541492,20553777,20540928,20541113,20541441,20548815,20555229,20548806,2054

---

__Get a list of "ask" stories.__

In [8]:
response3 = requests.get("https://hacker-news.firebaseio.com/v0/askstories.json")
print("Request status: {}".format(type(response3.status_code)))
print("Type: {}; Result: {}".format(type(response3.text), response3.text))

Request status: <class 'int'>
Type: <class 'str'>; Result: [20556336,20557018,20555977,20546906,20555169,20553659,20555966,20552823,20539187,20551761,20546513,20551269,20549310,20550641,20550436,20543880,20548849,20534533,20546958,20534551,20532979,20543786,20539566,20532057,20537492,20529655,20542203,20541956,20535533,20532615,20548745,20534743,20533211,20532139,20539141,20528927,20540653,20536347,20538295,20537209,20541962,20555001,20554292,20554228,20553968,20553886,20553831,20553605,20553340,20553338,20553260,20553136,20552710,20552660,20552473,20551939,20551925,20551834,20551785,20551734,20551655,20551537,20551503,20551453,20551020,20550942,20550626]


---
__Get a list of "show" stories.__

In [9]:
response4 = requests.get("https://hacker-news.firebaseio.com/v0/showstories.json")
print("Request status: {}".format(type(response4.status_code)))
print("Type: {}; Result: {}".format(type(response4.text), response4.text))

Request status: <class 'int'>
Type: <class 'str'>; Result: [20556168,20550167,20546356,20548264,20552619,20551223,20550620,20537966,20550732,20549268,20552400,20548036,20545977,20539678,20533318,20531307,20543817,20546407,20534729,20535143,20545259,20543569,20546884,20530716,20540683,20536539,20536139,20535245,20535095,20534885,20534803,20533930,20532746,20532460,20542160,20555019,20554772,20554742,20554599,20554262,20553452,20553336,20552274,20551336,20551334,20551178,20550856,20550651]


---
__Combine all 4 of the lists together.__

In [10]:
# Split the string values and combine all 4 of the lists.
# Create a `generator` to contain each of the values and convert every string value into integer.
id_ls = response1.text.split(",") + response2.text.split(",") + response3.text.split(",") + response4.text.split(",")
id_ls = (int(re.sub(r"\D", "", id_)) for id_ in id_ls)
id_ls

<generator object <genexpr> at 0x1158752b0>

## Running an actual 'get' request. ##

Return to [Table of content:](#Table-of-content:)

In [11]:
request_failed = []
stories = []

request = 0
start = time()

for id_ in id_ls:
    request += 1
    response = requests.get("https://hacker-news.firebaseio.com/v0/item/{}.json".format(id_), params=params)
    elapsed_time = time() - start
    
    if response.raise_for_status() is None:
        stories.append(response.json())
    else:
        request_failed.append([id_, response.raise_for_status()])
    
    print("Request: {}; Frequency: {} requests/s".format(request, request / elapsed_time))
    clear_output(wait=True)

Request: 803; Frequency: 0.8746811639174001 requests/s


## Convert into 'dataframe' object and save the dataset into CSV file. ##

Return to [Table of content:](#Table-of-content:)

---

__Check the compiled stories and convert the list into `DataFrame` object.__

In [12]:
stories[:5]

[{'by': 'dredmorbius',
  'descendants': 530,
  'id': 20542107,
  'kids': [20542705,
   20542852,
   20556909,
   20542612,
   20542618,
   20542656,
   20542368,
   20543637,
   20543155,
   20542851,
   20543535,
   20544503,
   20543092,
   20542505,
   20542390,
   20542427,
   20544736,
   20543355,
   20542906,
   20544023,
   20552083,
   20543126,
   20543328,
   20545816,
   20543422,
   20546392,
   20542523,
   20553103,
   20547999,
   20545315,
   20544694,
   20550303,
   20545010,
   20545070,
   20544898,
   20547338,
   20545154,
   20543949,
   20546825,
   20545782,
   20546652,
   20542354,
   20548675,
   20544245,
   20544661,
   20547432,
   20545191,
   20544173,
   20545183,
   20545361,
   20545420,
   20544140,
   20545457,
   20542443,
   20542689,
   20542723,
   20546523,
   20544271,
   20545422,
   20548145,
   20542905,
   20542879,
   20544063,
   20543699,
   20544264,
   20544984,
   20546265,
   20543996,
   20543567,
   20542383,
   20542804,
   205

In [14]:
hn = pd.DataFrame(stories)
hn.head(15)

Unnamed: 0,by,descendants,id,kids,score,text,time,title,type,url
0,dredmorbius,530.0,20542107,"[20542705, 20542852, 20556909, 20542612, 20542...",884,,1564236658,Adblocking: How about Nah?,story,https://www.eff.org/deeplinks/2019/07/adblocki...
1,parsimo2010,70.0,20531987,"[20532148, 20532695, 20532176, 20540584, 20532...",740,,1564109820,Decades-Old Computer Science Conjecture Solved...,story,https://www.quantamagazine.org/mathematician-s...
2,cpeterso,162.0,20532763,"[20533597, 20533371, 20533348, 20533647, 20534...",545,,1564121664,Mozilla debuts implementation of WebThings Gat...,story,https://venturebeat.com/2019/07/25/mozilla-deb...
3,zoobab,223.0,20533576,"[20534838, 20533782, 20533727, 20534294, 20535...",523,,1564134694,The UK authorities made illegal copies of the ...,story,https://twitter.com/SophieintVeld/status/11546...
4,braythwayt,252.0,20549457,"[20549523, 20550613, 20550475, 20549622, 20552...",488,,1564343108,Malicious code in the purescript NPM installer,story,https://harry.garrood.me/blog/malicious-code-i...
5,tafda,197.0,20542862,"[20543176, 20543063, 20543844, 20543058, 20543...",376,,1564246439,The Roots of Boeing’s 737 Max Crisis: A Regula...,story,https://www.nytimes.com/2019/07/27/business/bo...
6,carapace,326.0,20537409,"[20537922, 20538634, 20539303, 20543060, 20539...",586,,1564167557,How is China able to provide enough food to fe...,story,https://www.quora.com/How-is-China-able-to-pro...
7,timafuyc,68.0,20540852,"[20549190, 20547705, 20547233, 20549106, 20547...",498,,1564212569,Tokyo subway’s humble duct-tape typographer,story,https://medium.com/@chrisgaul/tokyo-subways-hu...
8,InvOfSmallC,308.0,20533026,"[20534135, 20534614, 20534300, 20537640, 20534...",457,,1564126368,Users hate change,story,https://gist.github.com/sleepyfox/a4d311ffcdc4...
9,girlwhocodes,237.0,20530046,"[20530409, 20531859, 20530242, 20531878, 20530...",483,,1564090771,Square’s Growth Framework for Engineers and En...,story,https://developer.squareup.com/blog/squares-gr...


---
__Save the dataset as .csv format.__

In [16]:
hn.to_csv("hacker_news.csv", index=False, header=True, sep=",")

## Notes for future reference. ##

__List of things to take note in the future:__

* The API endpoints may have changed.
* Useful data such as `descendants` and `score` may be outdated. 
* Hacker News might introduce rate limit, thus the `get` requests framework has to be modified.
* Hacker News might impose a requirement for API authentication. 
* Terms & condition, and license may be updated. It might introduce some limitations to data usage. 

Return to [Table of content:](#Table-of-content:)