# Hacker News Pipeline

In this project, I will begin with raw data then filter, clean, aggregate, and summarize it in a sequence of tasks using a data pipeline. The dataset I will be using comes from a [Hacker News](https://news.ycombinator.com/) API that returns JSON data of the top stories in 2014. The JSON posts are in a file called `hn_stories_2014.json`. Each post has a set of keys, and I am interested in the following keys:

* `created_at`: A timestamp of the story's creation time.
* `created_at_i`: A unix epoch timestamp.
* `url`: The URL of the story link.
* `objectID`: The ID of the story.
* `author`: The story's author (username on HN).
* `points`: The number of upvotes the story had.
* `title`: The headline of the post.
* `num_comments`: The number of a comments a post has.

With this dataset, I will run a sequence of basic natural language processing tasks using the `Pipeline` class. My goal will be to find the top 100 keywords of Hacker News posts in 2014. This will give insight into some of the most popular tech topics in that year.

I will create pipieline tasks that do the following:

1) Loads the JSON file data into Python by parses it into a Python `dict` object

2) Filters popular stories by posts that have more than 50 points, more than one comment, and do not begin with "Ask HN" 

3) Writes the `dict` objects to a CSV file in order to have a consistent data cormat when running later summarizations

4) Extracts the title column from each post and runs the next word frequency task

5) Turns titles into all lowercase letters and removes punctuation, in order to recognize "Title" and "title!" as the same word

6) Builds a word frequency dictionary with key value pairs that connect words to the number of times that word is used in a text, as well as excludes stop words such as "the, "or", etc.

7) Sorts the top words used in all the titles

In [1]:
import csv
import json
import io
import string
from datetime import datetime
from pipeline import build_csv, Pipeline
from stop_words import stop_words

pipeline = Pipeline()

@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as file:
        data = json.load(file)
        stories = data['stories']
    return stories
    
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith("Ask HN") 
    return (story for story in stories if popular(story))

@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append((
            story['objectID'], 
            datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"),
            story['url'],
            story['points'],
            story['title']
        ))
    csv_file = build_csv(
        lines, 
        header=['objectID', 'created_at', 'url', 'points', 'title'],
        file=io.StringIO()
    )
    return csv_file.readlines()

@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    return ((line[idx]) for line in reader)

@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(x for x in title if x not in string.punctuation)
        yield title

@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(titles):
    word_frequency = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in word_frequency:
                    word_frequency[word] = 1
                word_frequency[word] += 1
    return word_frequency

@pipeline.task(depends_on=build_keyword_dictionary)
def get_top_100(word_freq):
    all = [
        (word, word_freq[word]) 
        for word in sorted(word_freq, key=word_freq.get, reverse=True)]
    return all[:100]


Now that the entire pipeline is built, it is time to run it and determine the top 100 words that were discussed in the Hacker News posts of 2014. 

In [2]:
output = pipeline.run()
print(output[get_top_100])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('released', 72), ('facebook', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('source', 65), ('free', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('language', 55), ('work', 55), ('2014', 53), ('software', 53), ('startup', 52), ('use', 51), ('make', 51), ('apple', 51), ('yc', 49), ('time', 49), ('security', 49), ('github', 46), ('nsa', 46), ('windows', 45), ('world', 42), ('like', 42), ('way', 42), ('heartbleed', 41), ('project', 41), ('computer', 41), ('1', 41), ('git', 38), ('dont', 38), ('users', 38), ('design', 38), ('ios', 38), ('life', 37), ('os', 37), ('twitter', 37), ('developer', 37), ('ceo', 37), ('vs', 37), ('day', 36), ('big', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('api', 33), ('guide', 33), ('mt', 33), (