# Hacker News Pipeline

In this guided project, from a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.

The data we will use comes from a Hacker News (HN) API that returns JSON data of the top stories in 2014. Hacker News is a link aggregator website that users vote up stories that are interesting to the community. It is similar to Reddit, but the community only revolves around on computer science and entrepreneurship posts.

The data used for the project includes a list of JSON posts in a file called *hn_stories_2014.json*. The JSON file contains a single key stories, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

- **created_at**: A timestamp of the story's creation time.
- **created_at_i**: A unix epoch timestamp.
- **url**: The URL of the story link.
- **objectID**: The ID of the story.
- **author**: The story's author (username on HN).
- **points**: The number of upvotes the story had.
- **title**: The headline of the post.
- **num_comments**: The number of a comments a post has.

Using this dataset, we will run a sequence of basic natural language processing tasks using our Pipeline class. The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

## Introduction to the Data

In [10]:
from datetime import datetime
import json
import io
import csv
import string

from pipeline import build_csv, Pipeline
from stop_words import stop_words

pipeline = Pipeline()

## Loading the JSON data

In [11]:
# Create a task function that takes no argument
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)       # load json file in a dict
        stories = data['stories'] # return the list of stories
    return stories

## Filtering the stories

Like any social link aggregator site, individual users can post whatever content they want. The reason we want the most popular stories is to ensure that we select stories that were the most talked about during the year. We can filter for popular stories by ensuring they are links (not *Ask HN* posts), have a good number of points, and have some comments.

In [12]:
# Create a task function that depends on the file_to_json() function
@pipeline.task(depends_on=file_to_json)

# Filter popular stories that have 
# more than 50 points, more than 1 comment, and do not begin with Ask HN
def filter_stories(stories):
    def is_popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith('Ask HN') 
    # Return a generator of these filtered stories
    return (story for story in stories if is_popular(story)) 

## Converting to CSV

With a reduced set of stories, it's time to write these dict objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of your pipeline tasks will be adaptable with future task requirements.

In [13]:
# Create a task function that depends on the filter_stories() function
@pipeline.task(depends_on=filter_stories)

# Create a function to writed filtered JSON stories to a CSV file
def json_to_csv(stories): # return value from build_csv using lines, header and io.StringIO() file
    lines = []
    for story in stories:
        lines.append(
            (story['objectID'], 
             datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), 
             story['url'], story['points'], story['title'])
        )
    return build_csv(lines, 
                     header=['objectID', 'created_at', 'url', 'points', 'title'], 
                     file=io.StringIO())

## Extracting Title Column

Using the CSV file format we created in the previous task, we can now extract the title column. Once we have extracted the titles of each popular post, we can then run the next word frequency task.

In [14]:
# Create a task function that depends on the json_to_csv() function
@pipeline.task(depends_on=json_to_csv)

# Create a function that returns a generator of every Hacker News story title
def extract_titles(csv_file):
    reader = csv.reader(csv_file) # create object from the file object
    header = next(reader)         # get header
    idx = header.index('title')   # find the index of the title in the header
    # Loop through the reader and return each item of the title index position
    return (line[idx] for line in reader)

## Cleaning the Titles

Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. For example, words like *Google, google, GooGle?,* and *google.*, all mean the same keyword: google. If we were to split the title into words, however, they would all be lumped into different categories.

To clean the titles, we should make sure to lower case the titles, and to remove the punctuation. An easy way to rid a string of punctuation is to check each character, determine if it is a letter or punctuation, and only keep the letter. From the *string* package, we are given a handy string constant that contains all the punctuation needed:

In [15]:
# Create a task function that depends on the extract_titles() function
@pipeline.task(depends_on=extract_titles)
def clean_title(titles): # return cleaned titles
    for title in titles:
        title = title.lower() # title is lower case
        title = ''.join(c for c in title if c not in string.punctuation) # remove punctuation from titles
        yield title

## Creating the Word Frequency Dictionary

With a cleaned title, we can now build the word frequency dictionary. A word frequency dictionary are key value pairs that connects a word to the number of times it is used in a text. Furthermore, to find actual keywords, we should enforce the word frequency dictionary to not include stop words. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

In [16]:
# Create a task function that depends on the clean_titles() function
@pipeline.task(depends_on=clean_title)

# Create a function that returns a dit of the word frequency of all titles
def build_keyword_dictionary(titles):
    word_freq = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words: # word frequency should NOT include stop words
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq

## Sorting the Top Words

In [17]:
# Create a task function that depends on the build_keyword_dictionary() function
@pipeline.task(depends_on=build_keyword_dictionary)

# Create a function that returns a list of top 100 tuples 
def top_keywords(word_freq):
    freq_tuple = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq_tuple[:100]

# Run the pipeline
ran = pipeline.run()
print(ran[top_keywords])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3

## Conclusion

The final result yielded some interesting keywords. There were terms like bitcoin (the cryptocurrency), heartbleed (the 2014 hack), and many others. Even though this was a basic natural language processing task, it did provide some interesting insights into conversations from 2014. 

## Next Steps

- Rewrite the Pipeline class' output to save a file of the output for each task. This will allow us to "checkpoint" tasks so we don't have to be run twice.
- Use the nltk package for more advanced natural language processing tasks.
- Convert to a CSV before filtering, so we can keep all the stories from 2014 in a raw file.
- Fetch the data from Hacker News directly from a JSON API (https://hn.algolia.com/api). Instead of reading from the original file, we can perform additional data processing using newer data.