# Hacker News: Stories, 2014

<br></br>
<i>Created by the startup incubator Y Combinator in 2007, [Hacker News](https://news.ycombinator.com/) is a social news site where *posts* — user-submitted content — are voted and commented upon, highly similar to Reddit's format. However, unlike Reddit, users can only upvote or downvote once they've accumulated enough karma (user points) to discourage [trolling](https://unlcms.unl.edu/engineering/james-hanson/trolls-and-their-impact-social-media) and affirm intelligent, respectful discourse. Hacker News' top posts can get hundreds of thousands of user engagements since it is fairly popular in technology and startup circles.</i>

## Dataset

The dataset for this mission is in JavaScript Object Notation (JSON). `hn_2014.json` was downloaded from the Hacker News API, and contains data about stories from Hacker News in 2014. The data contains keys representing the title, URL, points, number of comments, and date, to name a few. 

To review:

| Column              | Definition          |
|:--------------------|:--------------------|
| author | The username of the person who submitted the story. |
| createdAt | The date and time at which the story was created. |
| createdAtI | An integer value representing the date and time at which the story was created. |
| numComments | The number of comments that were made on the story. |
| objectId | The unique identifier from Hacker News for the story. |
| points | The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes. |
| storyText | The text of the story (if the story contains text). |
| tags | A list of tags associated with the story. |
| title | The title of the story. |
| url | The URL that the story links to (if the story links to a URL). |

The aim here is to explore tips and syntax shortcuts, including list comprehensions, passing functions as arguments and lambda functions.

In [1]:
import numpy as np
import pandas as pd
import json

In [2]:
# open, load, and create df
hn = open('datasets/hn_2014.json')
hnjs = json.load(hn)
hndf = pd.DataFrame(hnjs)

In [3]:
hndf.head(4).T

Unnamed: 0,0,1,2,3
author,dragongraphics,jcr,callum85,d3v3r0
numComments,0,0,0,0
points,2,1,1,1
url,http://ashleynolan.co.uk/blog/are-we-getting-t...,http://spectrum.ieee.org/automaton/robotics/ho...,http://online.wsj.com/articles/apple-to-buy-be...,http://alexsblog.org/2014/05/29/dont-wait-for-...
storyText,,,,
createdAt,2014-05-29T08:07:50Z,2014-05-29T08:05:58Z,2014-05-29T08:05:06Z,2014-05-29T08:00:08Z
tags,"[story, author_dragongraphics, story_7815238]","[story, author_jcr, story_7815234]","[story, author_callum85, story_7815230]","[story, author_d3v3r0, story_7815222]"
createdAtI,1401350870,1401350758,1401350706,1401350408
title,Are we getting too Sassy? Weighing up micro-op...,Telemba Turns Your Old Roomba and Tablet Into ...,Apple Agrees to Buy Beats for $3 Billion,Don’t wait for inspiration
objectId,7815238,7815234,7815230,7815222


## List Comprehensions

A list comprehension provides a concise syntax for creating a new list out of an existing one, using a single line of code. Below are some examples.

In [4]:
cols = ["col_{}".format(i) for i in range(1,5)]
data = np.zeros((4,4))

df = pd.DataFrame(data, columns=cols)
df

Unnamed: 0,col_1,col_2,col_3,col_4
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0


In [5]:
# generate a new column using list comprehension
hndf['hasComments'] = [x > 0 for x in hndf['numComments']]
hndf[hndf['hasComments']].head(3).T

Unnamed: 0,11,14,19
author,outrightfree,kamaal,mr_tyzic
numComments,1,2,27
points,1,2,161
url,http://techcrunch.com/gallery/five-super-succe...,https://www.kickstarter.com/projects/227461008...,http://projects.aljazeera.com/2014/portrait-of...
storyText,,,
createdAt,2014-05-29T06:18:51Z,2014-05-29T05:05:22Z,2014-05-29T03:51:01Z
tags,"[story, author_outrightfree, story_7815001]","[story, author_kamaal, story_7814838]","[story, author_mr_tyzic, story_7814608]"
createdAtI,1401344331,1401339922,1401335461
title,Five Super Successful Tech Pivots,"Gi Bike: The light, full-size, electric, foldi...",For Hire: Dedicated Young Man With Down Syndrome
objectId,7815001,7814838,7814608


In [6]:
# create a list of urls
urls = [hndf['url'][i] for i in range(len(hndf))]
urls[:5]

['http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971',
 'http://alexsblog.org/2014/05/29/dont-wait-for-inspiration/',
 'http://techcrunch.com/2014/05/28/hackerone-get-9m-in-series-a-funding-to-build-bug-tracking-bounty-programs/']

In [7]:
# use an if clause as filter
thousand_points = [hndf['numComments'][i] for i in range(len(hndf)) if hndf['points'][i] > 1000]

# total count of comments
sum(thousand_points)

3401

## Passing Functions as Arguments

Parentheses allows functions to be executed. If the parentheses are omitted, functions retain their defined logic but are not executed. In those cases, functions operate like variables and can be passed as arguments. The example below illustrates this.

In [8]:
def a_function():
    return "This is a function"

a_function()

'This is a function'

In [9]:
t = type(a_function)
print(t)

<class 'function'>


In [10]:
sisters = [
    {"age": 36, "name": "Sabine"},
    {"age": 40, "name": "Zoe"},
    {"age": 41, "name": "Heidi"}
]

def get_age(json_dict):
    return json_dict['age']

# get the details of the youngest
youngest = min(sisters, key=get_age)
print('My youngest sister is {} at {} years old.'.format(youngest['name'], youngest['age']))

My youngest sister is Sabine at 36 years old.


In [11]:
def get_num_comments(json_dict):
    return json_dict['numComments']

# get the post with the most comments
max(hnjs, key=get_num_comments)

{'author': 'platz',
 'numComments': 1208,
 'points': 889,
 'url': 'https://blog.mozilla.org/blog/2014/04/03/brendan-eich-steps-down-as-mozilla-ceo/',
 'storyText': None,
 'createdAt': '2014-04-03T19:02:53Z',
 'tags': ['story', 'author_platz', 'story_7525198'],
 'createdAtI': 1396551773,
 'title': 'Brendan Eich Steps Down as Mozilla CEO',
 'objectId': '7525198'}

## Lambda Functions

Often, functions are created to reuse logic for repetitive tasks. However, there is a special syntax for temporary or one-off functions, called lambda functions. Lambda functions can be defined in a single line, which allows for defining a function to be passed as an argument when it is needed.

In [12]:
# use a lambda function to generate a key out of a story's points
hn_sorted_points = sorted(hnjs, key=lambda d: d['points'], reverse=True)
hn_sorted_points[0:2]

[{'author': 'frederfred',
  'numComments': 398,
  'points': 2732,
  'url': 'http://gabrielecirulli.github.io/2048/',
  'storyText': '',
  'createdAt': '2014-03-10T15:44:42Z',
  'tags': ['story', 'author_frederfred', 'story_7373566'],
  'createdAtI': 1394466282,
  'title': '2048',
  'objectId': '7373566'},
 {'author': 'brokenparser',
  'numComments': 260,
  'points': 1958,
  'url': 'https://thedaywefightback.org/',
  'storyText': '',
  'createdAt': '2014-02-11T08:12:28Z',
  'tags': ['story', 'author_brokenparser', 'story_7216471'],
  'createdAtI': 1392106348,
  'title': 'Today is The Day We Fight Back',
  'objectId': '7216471'}]

In [13]:
top_hn_stories = [d['title'] for d in hn_sorted_points[:5]]
top_hn_stories

['2048',
 'Today is The Day We Fight Back',
 'Wozniak: “Actually, the movie was largely a lie about me”',
 'Microsoft Open Sources C# Compiler',
 'Elon Musk: To the People of New Jersey']

### Ternary Operator

A ternary operator is used whenever there is a need to return one of two values depending on a boolean expression. The syntax is as follows: 

`[on_true] if [expression] else [on_false]`

In [14]:
tags = hndf['tags']

# identify type of tags
tag_types = tags.apply(type)
tag_types.value_counts(dropna=False)

<class 'list'>    35806
Name: tags, dtype: int64

In [15]:
# count of tags per story
tag_types = tags.apply(len)
tag_types.value_counts(dropna=False)

3    33459
4     2347
Name: tags, dtype: int64

In [16]:
# get stories with four tags
four_tags = tags[tags.apply(len) == 4]

# grab the fourth tag if there is one
cleaned_tags = tags.apply(lambda l: l[-1] if len(l) == 4 else None)

# assign results back to the main dataframe
hndf['tags'] = cleaned_tags
hndf['tags'].value_counts(dropna=False)

NaN        33459
ask_hn      1348
show_hn      999
Name: tags, dtype: int64

In [17]:
hndf[hndf['tags'] == 'ask_hn'].head().T

Unnamed: 0,86,104,107,165,281
author,cweagans,nightstrike789,ISeemToBeAVerb,greekspain,hoodoof
numComments,4,0,0,0,1
points,1,1,1,1,1
url,,,,,http://www.wikihow.com/Capitalise-Correctly
storyText,As a followup to my question from a few days a...,I am working on a personal project to help peo...,Howdy folks. I&#x27;m a designer&#x2F;develope...,"Hi, just wondering - I am playing for a univer...",
createdAt,2014-05-28T19:51:02Z,2014-05-28T18:57:32Z,2014-05-28T18:47:05Z,2014-05-28T15:57:05Z,2014-05-28T05:38:21Z
tags,ask_hn,ask_hn,ask_hn,ask_hn,ask_hn
createdAtI,1401306662,1401303452,1401302825,1401292625,1401255501
title,Ask HN: New technical cofounder. How should ow...,Ask HN: Categorizing company cultures,Ask HN: Content Design Service?,Ask HN: Does anyone have a uber / taxi clone a...,Ask HN editors: why are HN headlines so badly ...
objectId,7812404,7812099,7812048,7810927,7808556
