# "Part 1: Getting data from HackerNews"
> "Here I will describe the process that I used to get HN data."

- toc: true
- branch: master
- badges: true
- comments: true
- author: Pushkar Paranjpe
- categories: [HN, google, chrome, extension, unsupervised, ML]

### Where is the data ?

I had two options - either download the news data from HN using their [API](https://github.com/HackerNews/API) or get a (near) up-to-date dump via [Kaggle](https://www.kaggle.com/hacker-news/hacker-news). Yes! Kaggle is not just for getting into the world of competitive ML coding. It is a fabulous resource for a broad variety of structured datasets.
Here is the link for - [HN data](https://www.kaggle.com/hacker-news/hacker-news).

![](./images/kaggle_hn_data.png "Kaggle is for data too!")

The header image says it has posts from 2006 to late 2017. But if you look inside the data tables you will find that it has data upto the current week. Current day and up to a couple previous days may not be in it. You will have to use the HN API to get that - but more on that later. For now all the HN data upto current minus two days will suffice.

### Using Google's BigQuery to access the data source

We will write some SQL queries to explore the data, save it to the Kaggle's working directory. You get a (generous) 5GB of space in your working directory! That's where I will put HN data. Later, I will show you a nifty trick to download that data to your data processing machine. 

First - fire up a Kaggle notebook. Start coding!


```python
import pandas as pd
from google.cloud import bigquery


# Create a "Client" object
client = bigquery.Client()
```

Our SQL client is ready.

Now lets's explore the HN dataset a bit.


```python
query = """
        SELECT COUNT(*)
        FROM
        bigquery-public-data.hacker_news.full
        WHERE
          type = 'story'
          AND
          title IS NOT NULL;
"""

# Set up the query
query_job = client.query(query)

# API request - run the query, and return a pandas DataFrame
query_job.to_dataframe()
```



Output:


|   | f0_ |
| - | - |
| 0  |	3524454 |
 

So - there are about 3M rows of the type 'story' that have a non-empty title. Why do we care about the `title` ? Well - we will be using the title field alone to build the news catalogue.

Let's see some rows.


```python
query = """
        SELECT *
        FROM
        bigquery-public-data.hacker_news.full
        WHERE type = 'story'
        LIMIT 10;
"""

# Set up the query
query_job = client.query(query)

# API request - run the query, and return a pandas DataFrame
query_job.to_dataframe()
```

![](./images/hn_stories_rows.png "Sample rows from the HN table.")

What is the lastest row in this table ? Lets find out!

```python
query = """
        SELECT MAX(timestamp)
        FROM
        bigquery-public-data.hacker_news.full
        WHERE type = 'story'
        AND DATE(timestamp) > '2020-05-01';
"""

# Set up the query
query_job = client.query(query)

# API request - run the query, and return a pandas DataFrame
query_job.to_dataframe()
```

I ran the query on `2020-05-20 9:30 p.m. IST`. Here was the output:

| | f0_|
| - | - |
|0	| 2020-05-19 11:19:42+00:00|

So - less than a day old. Not bad for a data source that is freely available and maintained by a 3rd party (not me, laziness !)


### Fetch data

Let's fetch the data now.

```python
query = """
        SELECT title, id, timestamp
        FROM
        bigquery-public-data.hacker_news.full
        WHERE
                type = 'story'
            AND title IS NOT NULL;
"""

# Set up the query
query_job = client.query(query)

# API request - run the query, and return a pandas DataFrame
df_stories = query_job.to_dataframe()

df_stories.shape
```

This query will run for a little longer but no longer than 10 minutes. Remember, we are getting *all* the HN stories that have a title and there are about 3M of them at time of this writing. Patience !

Here's the output:  
`(3524454, 3)`

Naice (: We've got > 3.5 million stories to play with ! And for each of them we have - a HN id, a timestamp and a title.

### Save data

Let's save our hard work locally i.e. inside the Kaggle working directory.

```python
import pickle


with open('hn.stories.dt.pickle', 'wb') as outf:
    pickle.dump(df_stories, outf)
```

Check our work:  
`!ls -al --block-size=M *.pickle`

Output:  
`-rw-r--r-- 1 root root 250M May 20 16:10 hn.stories.dt.pickle`

Thats good. Sweet 250 mega bytes of HN news data.

### A nifty little trick

Here's a nifty trick that I had promised - to create a cute little download link! 

```python
from IPython.display import FileLink


FileLink(r'hn.stories.dt.pickle')
```

Output:  
[hn.stories.dt.pickle](hn.stories.dt.pickle)

The output is a hyperlink to your pickle file. Use `wget` or your favorite download manager to get that file onto your data processing machine. I have used an AWS EC2 instance for further work. So lets say bye to Kaggle and hello to EC2. See you there - in the [next blog post](https://pushkarparanjpe.github.io/kidepaha_fastpages/hn/google/chrome/extension/unsupervised/ml/2020/05/24/HackerMark-featurise-titles.html) !