# Hacker News Data Pipeline

## Intro

We began with the concepts of functional programming, and then built our own data pipeline class in Python. We learned about advanced Python concepts such as the decorators, closures, and good API design. We also learned how to implement a directed acyclic graph as the scheduler for our pipeline.

Finally, we built a robust data pipeline that schedules our tasks in the correct order! In this project, we will use the pipeline we have been building, and apply it to a real world data pipeline project. From a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.

The data we will use comes from a [Hacker News](https://news.ycombinator.com/) (HN) API that returns [JSON data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON) of the top stories in 2014 (sample was shrinked to January only to keep the file size smaller). If you're unfamiliar with Hacker News, it's a link aggregator website that users vote up stories that are interesting to the community. It is similar to [Reddit](https://www.reddit.com/), but the community only revolves around on computer science and entrepreneurship posts.

<img src="hn.png"/>

To make things easier, we have already downloaded a list of JSON posts to a file called `hn_stories_2014.json`. The JSON file contains a single key `stories`, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

- `vcreated_at`: A timestamp of the story's creation time.
- `created_at_i`: A unix epoch timestamp.
- `url`: The URL of the story link.
- `objectID`: The ID of the story.
- `author`: The story's author (username on HN).
- `points`: The number of upvotes the story had.
- `title`: The headline of the post.
- `num_comments`: The number of a comments a post has.

Here's an example of the full list of keys in a story:

```JSON
{
    "story_text": "",
    "created_at": "2014-05-29T08:23:46Z",
    "story_title": null,
    "story_id": null,
    "comment_text": null,
    "created_at_i": 1401351826,
    "url": "http://bits.blogs.nytimes.com/2014/05/28/making-twitter-easier-to-use/",
    "parent_id": null,
    "objectID": "7815285",
    "author": "Leynos",
    "points": 1,
    "title": "Making Twitter Easier to Use",
    "_tags": [
        "story",
        "author_Leynos",
        "story_7815285"
    ],
    "num_comments": 0,
    "_highlightResult": {
        "story_text": {
            "matchedWords": [],
            "value": "",
            "matchLevel": "none"
        },
        "author": {
            "matchedWords": [],
            "value": "Leynos",
            "matchLevel": "none"
        },
        "url": {
            "matchedWords": [],
            "value": "http://bits.blogs.nytimes.com/2014/05/28/making-twitter-easier-to-use/",
            "matchLevel": "none"
        },
        "title": {
            "matchedWords": [],
            "value": "Making Twitter Easier to Use",
            "matchLevel": "none"
        }
    },
    "story_url": null
}
```

Using this dataset, we will run a sequence of basic natural language processing tasks using our `Pipeline` class. The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

## Setup

Module imports:

In [1]:
import json
import io
import datetime
import csv
import string

# self implemented modules
from pipeline import Pipeline
from pipeline import build_csv
from stop_words import stop_words

Global variables:

In [2]:
json_file = "hn_stories_2014.json"
pipeline = Pipeline()

## Loading the JSON Data

We'll start by loading the JSON file data into Python. Because JSON files resemble a key-value dictionary, the goal is to parse the JSON file into a list of Python `dict` objects. We can accomplish this using the [`json` module](https://docs.python.org/3/library/json.html).

In [3]:
@pipeline.task()
def file_to_json():
    with open(json_file, 'r') as file:
        stories = json.load(file)
    return stories['stories']

In [4]:
results = pipeline.run()
result = results[file_to_json]
print(json.dumps(result[0], indent=4))

{
    "story_text": "",
    "created_at": "2014-01-31T23:59:19Z",
    "story_title": null,
    "story_id": null,
    "comment_text": null,
    "created_at_i": 1391212759,
    "url": "http://www.newrepublic.com/article/115883/drugs-drinking-water-new-epa-study-finds-more-we-knew",
    "parent_id": null,
    "objectID": "7160071",
    "author": "jrs99",
    "points": 2,
    "title": "Drugs in water",
    "_tags": [
        "story",
        "author_jrs99",
        "story_7160071"
    ],
    "num_comments": 1,
    "_highlightResult": {
        "story_text": {
            "matchedWords": [],
            "value": "",
            "matchLevel": "none"
        },
        "author": {
            "matchedWords": [],
            "value": "jrs99",
            "matchLevel": "none"
        },
        "url": {
            "matchedWords": [],
            "value": "http://www.newrepublic.com/article/115883/drugs-drinking-water-new-epa-study-finds-more-we-knew",
            "matchLevel": "none"
        }

## Filtering most popular Stories

Great! Now that we have loaded in all the stories as a list of `dict` objects, we can now operate on them. Let's start by filtering the list of stories to get the most popular stories of the year.

Like any social link aggregator site, individual users can post whatever content they want. The reason we want the most popular stories is to ensure that we select stories that were the most talked about during the year. We can filter for popular stories by ensuring they are links (not `Ask HN` posts), have a good number of points, and have some comments.

In [5]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    filtered = []
    for story in stories:
        if (story["points"] > 50
            and story["num_comments"] > 1
            and not story["title"].startswith("Ask HN")):
            filtered.append(story)
    return (story for story in filtered)

In [6]:
results = pipeline.run()
result = results[filter_stories]
print(json.dumps(next(result), indent=4))

{
    "story_text": "",
    "created_at": "2014-01-31T23:34:59Z",
    "story_title": null,
    "story_id": null,
    "comment_text": null,
    "created_at_i": 1391211299,
    "url": "http://maxhorstmann.net/2014/01/31/why-dart-should-learn-json-while-its-still-young/",
    "parent_id": null,
    "objectID": "7159926",
    "author": "Max_Horstmann",
    "points": 98,
    "title": "Why Dart should learn JSON while it\u2019s still young",
    "_tags": [
        "story",
        "author_Max_Horstmann",
        "story_7159926"
    ],
    "num_comments": 54,
    "_highlightResult": {
        "story_text": {
            "matchedWords": [],
            "value": "",
            "matchLevel": "none"
        },
        "author": {
            "matchedWords": [],
            "value": "Max_Horstmann",
            "matchLevel": "none"
        },
        "url": {
            "matchedWords": [],
            "value": "http://maxhorstmann.net/2014/01/31/why-dart-should-learn-json-while-its-still-young/"

## Converting to CSV

With a reduced set of stories, it's time to write these `dict` objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of your pipeline tasks will be adaptable with future task requirements.

In [7]:
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    def parse(story):
        return [story['objectID'],
                datetime.datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"),
                story['url'],
                story['points'],
                story['title']]
    parsed = (parse(story) for story in stories)
    header = ['objectID', 'created_at', 'url', 'points', 'title']
    return build_csv(parsed, header, io.StringIO())

In [8]:
results = pipeline.run()
result = results[json_to_csv]
reader = csv.reader(result)
for i in range(4):
    print(next(reader))

['objectID', 'created_at', 'url', 'points', 'title']
['7159926', '2014-01-31 23:34:59', 'http://maxhorstmann.net/2014/01/31/why-dart-should-learn-json-while-its-still-young/', '98', 'Why Dart should learn JSON while it’s still young']
['7159896', '2014-01-31 23:29:22', 'http://voicechatapi.com/?hn', '281', 'Show HN: We just open-sourced a Skype replacement with HTML5']
['7159768', '2014-01-31 23:04:18', 'http://stackoverflow.com/questions/1995113/strangest-language-feature/2002154#2002154', '67', 'Strangest Programming Language Feature?']


## Extract Title Column

Using the CSV file format we created in the previous task, we can now extract the title column. Once we have extracted the titles of each popular post, we can then run the next word frequency task.

The steps were: 1. Import `csv` and create a `csv.reader()`` object from the file object. 2. Find the index of the title in the header. 3. Iterate the through the reader, and return each item from the reader in the corresponding title index position.

In [9]:
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    index_title = header.index('title')
    return (line[index_title] for line in reader)

In [10]:
results = pipeline.run()
result = results[extract_titles]
for i in range(3):
    print(next(result))

Why Dart should learn JSON while it’s still young
Show HN: We just open-sourced a Skype replacement with HTML5
Strangest Programming Language Feature?


## Clean the Titles

Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. For example, words like `Google`, `google`, `GooGle?`, and `google.`, all mean the same keyword: `google`. If we were to split the title into words, however, they would all be lumped into different categories.

To clean the titles, we should make sure to lower case the titles, and to remove the punctuation. An easy way to rid a string of punctuation is to check each character, determine if it is a letter or punctuation, and only keep the letter. From the `string` package, we are given a handy string constant that contains all the punctuation needed.

In [11]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [12]:
@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    def clean(title):
        result = title
        for char in string.punctuation + '‘’':
            result = result.replace(char, '')
        return result.lower()
    return (clean(title) for title in titles)

In [13]:
results = pipeline.run()
result = results[clean_titles]
for i in range(3):
    print(next(result))

why dart should learn json while its still young
show hn we just opensourced a skype replacement with html5
strangest programming language feature


## Create Word Frequency Dictionary

With a cleaned title, we can now build the word frequency dictionary. A word frequency dictionary are key value pairs that connects a word to the number of times it is used in a text. Here's an example of how a word frequency would work on a single string:

```python
sample_text = "Wow, the Dataquest Data Engineering track is the best track!"
print(word_freq_from_string(sample_text))
```
```
{'wow': 1, 'the': 2, 'dataquest': 1, 'data': 1, 'engineering': 1, 'track': 2, 'is': 1, 'best': 1}
```

As you can see, the title has been stripped of its punctuation and lower cased. Furthermore, to find actual keywords, we should enforce the word frequency dictionary to not include stop words. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

We have included a module `called stop_words` (the file `stop_words.py`) with a `tuple` of the most common used stop words in the English language. Here's what the sample text would look like without the stop words:

```python
sample_text = "Wow, the Dataquest Data Engineering track is the best track!"
print(word_freq_no_stop_words(sample_text))
```
```
{'wow': 1, 'dataquest': 1, 'data': 1, 'engineering': 1, 'track': 2, 'best': 1}
```

In [14]:
@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(titles):
    frequencies = {}
    for title in titles:
        words = [word for word in title.split() if word and word not in stop_words]
        for word in words:
            if word not in frequencies:
                frequencies[word] = 0
            frequencies[word] += 1
    return frequencies

In [15]:
results = pipeline.run()
result = results[build_keyword_dictionary]
print(result['google'])
print(result['apple'])
print(result['facebook'])
print(result['microsoft'])

32
5
14
6


## Sorted Top 100 Words

Finally, we're ready to sort the top words used in all the titles. In this final task, it's up to you to decide how you want to sort the top words. The goal is to output a list of tuples with (`word`, `frequency`) as the entries sorted from most used, to least most used.

In [16]:
@pipeline.task(depends_on=build_keyword_dictionary)
def top_100(titles):
    return sorted(titles.items(), key=lambda x: x[1], reverse=True)[:100]

In [17]:
results = pipeline.run()
result = results[top_100]
print(result)

[('new', 46), ('google', 32), ('data', 20), ('nsa', 19), ('python', 18), ('video', 18), ('using', 16), ('open', 16), ('code', 15), ('year', 15), ('bitcoin', 15), ('facebook', 14), ('startup', 14), ('free', 14), ('programming', 13), ('2014', 13), ('use', 13), ('windows', 13), ('make', 12), ('released', 12), ('state', 12), ('javascript', 12), ('court', 12), ('2013', 12), ('internet', 12), ('years', 11), ('web', 11), ('security', 11), ('apps', 11), ('work', 11), ('billion', 11), ('c', 11), ('software', 11), ('users', 10), ('yc', 10), ('linux', 10), ('way', 10), ('mac', 10), ('10', 9), ('hacker', 9), ('twitter', 9), ('source', 9), ('worlds', 9), ('project', 9), ('public', 9), ('1', 9), ('raises', 9), ('dogecoin', 9), ('time', 9), ('man', 9), ('like', 9), ('money', 9), ('stop', 8), ('machine', 8), ('death', 8), ('server', 8), ('github', 8), ('mobile', 8), ('access', 8), ('life', 8), ('os', 8), ('world', 8), ('tech', 8), ('making', 8), ('vs', 8), ('best', 8), ('good', 8), ('ceo', 7), ('ask',

## Conclusion

The final result yielded some interesting keywords. There were terms like `bitcoin` (the cryptocurrency), `nsa` and many others. Even though this was a basic natural language processing task, it did provide some interesting insights into conversations from January 2014. Nonetheless, now that you have created the pipeline, there are additional tasks you can perform with the data.

Here are just a few:

- Rewrite the `Pipeline` class output to save a file of the output for each task. This will allow you to "checkpoint" tasks so they don't have to be run twice.
- Use the `nltk` package for more advanced natural language processing tasks.
- Convert to a CSV before filtering, so you can keep all the stories from 2014 in a raw file.
- Fetch the data from Hacker News directly from a JSON API. Instead of reading from the file we gave, you can perform additional data processing using newer data.