# Hacker News Pipeline

From a JSON API, transform data with a Pipeline class of processing tasks to filter, clean, aggregate, and summarize data including runing a sequence of basic natural language processing. The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

The data comes from a Hacker News (HN) API that returns JSON data of the top stories in 2014. Hacker News is a link aggregator website that users vote up stories that are interesting to the community. It is similar to Reddit, but the community only revolves around on computer science and entrepreneurship posts.

The JSON file contains a single key stories, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

- created_at: A timestamp of the story's creation time.
- created_at_i: A unix epoch timestamp.
- url: The URL of the story link.
- objectID: The ID of the story.
- author: The story's author (username on HN).
- points: The number of upvotes the story had.
- title: The headline of the post.
- num_comments: The number of a comments a post has.



In [12]:
from pipeline import Pipeline, build_csv
import json
import csv
import io
import string
from stop_words import stop_words
from datetime import datetime


In [13]:
pipeline = Pipeline()

Step 1: load in all the stories as a list of dict objects

In [14]:
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)
    return data['stories']


Step 2: filter popular stories that have more than 50 points, more than 1 comment, and do not begin with "Ask HN".

In [15]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return story['points']>50 and story['num_comments'] and not story['title'].startswith('Ask HN') 
    
    return (story for story in stories if is_popular(story))  
        
    

Step 3: write these dict objects to a CSV file. The purpose of translating the dictionaries to a CSV is to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of your pipeline tasks will be adaptable with future task requirements.

In [16]:
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        line = (story['objectID'], datetime.strptime(story['created_at'], '%Y-%m-%dT%H:%M:%SZ'), story['url'], story['points'], story['title']) 
        lines.append(line)
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'], file=io.StringIO())
          

Step 4: extract the title column using the CSV file format we created in the previous task. 

In [17]:
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csvfile):
    reader = csv.reader(csvfile)
    header = next(reader)
    title_idx = header.index('title')
    return (line[title_idx] for line in reader)
               

Step 5: to clean the titles, lower case the titles, and remove the punctuation before creating a non-empty word frequency model of words from Hacker News titles,

In [18]:
@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)  
        yield title
    

Step 6: build the word frequency dictionary not including stop words.

In [19]:
@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(titles):
    word_frequency = {}
    for title in titles:
        words = title.split(' ')
        for word in words:
            if word and word not in stop_words:
                if word not in word_frequency:
                    word_frequency[word] = 1
                word_frequency[word] += 1
                                 
    return word_frequency


Step 7: sort the top words used in all the titles.   Save a file of the output for each task. This will allow you to "checkpoint" tasks so they don't have to be run twice.

In [20]:
@pipeline.task(depends_on=build_keyword_dictionary)
def sort_top_words(word_freq):
    sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)  
    
    csv_file = 'word_frequency_desc.csv'  
    try:
        with open(csv_file, 'w') as f:
            writer = csv.writer(f)
            writer.writerow(('words', 'frequency'))
            for row in sorted_word_freq:
                writer.writerow(row)
    except IOError:
        print('I/O error')
        
    return sorted_word_freq[:100]




Step 9: Run the pipeline and print the word and frequency in descending order

In [21]:
run_pipeline = pipeline.run()
print(run_pipeline[sort_top_words])


[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 94), ('programming', 91), ('web', 89), ('data', 87), ('video', 80), ('python', 77), ('code', 73), ('released', 72), ('facebook', 72), ('using', 71), ('javascript', 66), ('free', 66), ('source', 66), ('2013', 66), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('language', 55), ('work', 55), ('software', 53), ('2014', 53), ('startup', 52), ('make', 51), ('use', 51), ('apple', 51), ('yc', 49), ('security', 49), ('time', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('way', 42), ('like', 42), ('world', 42), ('project', 41), ('heartbleed', 41), ('computer', 41), ('1', 41), ('git', 38), ('dont', 38), ('design', 38), ('ios', 38), ('users', 38), ('os', 37), ('developer', 37), ('twitter', 37), ('vs', 37), ('ceo', 37), ('life', 37), ('big', 36), ('day', 36), ('online', 35), ('android', 35), ('years', 34), ('simple', 34), ('court', 34), ('mt', 33), ('guide', 33), ('apps', 33), 

Additional Steps:

- Use the nltk package for more advanced natural language processing tasks.

- Convert to a CSV before filtering, so you can keep all the stories from 2014 in a raw file.

- Fetch the data from Hacker News directly from a JSON API. Instead of reading from the file we gave, you can perform additional data processing using newer data.