# Hacker News Pipeline

In this course, we began with the concepts of functional programming, and then built our own data pipeline class in Python. We learned about advanced Python concepts such as the decorators, closures, and good API design. In the last mission, we also learned how to implement a directed acyclic graph as the scheduler for our pipeline.

After completing all these missions, we have finally built a robust data pipeline that schedules our tasks in the correct order! In this guided project, we will use the pipeline we have been building, and apply it to a real world data pipeline project. From a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.

Using this dataset, we will run a sequence of basic natural language processing tasks using our Pipeline class. The goal will be to find the top 100 keywords of Hacker News posts in 2014

## Introduction to the Data

Instantiate an instance of the Pipeline class and assign it to the variable "pipeline."

In [80]:
from pipeline import Pipeline

pipeline = Pipeline()

## Loading the JSON Data

* Create a pipeline.task() function that takes in no arguments.
* Call the function file_to_json(), where the function does the following:
    * Loads the hn_stories_2014.json file into a Python dict.
    * Returns the list of stories.

In [81]:
import json

@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)
        stories = data['stories']
    return stories


## Filtering the Stories

* Create a pipeline.task() function that depends on the file_to_json() function.
* Call the new function filter_stories(), that filters popular stories that have more than 50 points, more than 1 comment, and do not begin with Ask HN.
* filter_stories() should return a generator of these filtered stories

In [82]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    filtered = [x for x in stories if x["points"]>50 and x["num_comments"]>1 and not x['title'].startswith('Ask HN')]
    return filtered

## Convert to CSV

With a reduced set of stories, it's time to write these dict objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of your pipeline tasks will be adaptable with future task requirements.

In [83]:
from pipeline import build_csv
from datetime import datetime
import io

@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append(
            (story['objectID'], datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), story['url'], story['points'], story['title'])
        )
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'], file=io.StringIO())


## Extract Title Column

Using the CSV file format we created in the previous task, we can now extract the title column. Once we have extracted the titles of each popular post, we can then run the next word frequency task. 

In [84]:
import csv

@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    titles = [line[idx] for line in reader]
    return titles

## Clean the Titles

Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. 

To clean the titles, we should make sure to lower case the titles, and to remove the punctuation.

In [85]:
import string

@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    cleaned = []
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        cleaned.append(title)
    return cleaned

## Create the Word Frequency Dictionary

With a cleaned title, we can now build the word frequency dictionary. A word frequency dictionary are key value pairs that connects a word to the number of times it is used in a text.

In [86]:
from stop_words import stop_words

@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(cleaned):
    word_freq = {}
    for title in cleaned:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq


## Sort the Top Words

Finally, we're ready to sort the top words used in all the titles. In this final task, it's up to you to decide how you want to sort the top words. The goal is to output a list of tuples with (word, frequency) as the entries sorted from most used, to least most used.

In [87]:
@pipeline.task(depends_on=build_keyword_dictionary)
def top_words(word_freq):
    freq_tuple = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq_tuple[:100]

In [88]:
ran = pipeline.run()
print(ran[top_words])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('2014', 53), ('software', 53), ('startup', 52), ('use', 51), ('make', 51), ('apple', 51), ('security', 49), ('time', 49), ('yc', 49), ('github', 46), ('nsa', 46), ('windows', 45), ('like', 42), ('way', 42), ('world', 42), ('computer', 41), ('1', 41), ('project', 41), ('heartbleed', 41), ('users', 38), ('dont', 38), ('design', 38), ('git', 38), ('ios', 38), ('ceo', 37), ('twitter', 37), ('os', 37), ('developer', 37), ('vs', 37), ('life', 37), ('day', 36), ('big', 36), ('android', 35), ('online', 35), ('simple', 34), ('court', 34), ('years', 34), ('api', 33), ('mt', 33), ('learning', 33)