# Hacker News Data Exploration

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if it the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either `"Ask HN"` or `"Show HN"` substring. Users submit `"Ask HN"` posts to ask the Hacker News community a specific question. Likewise, users submit `"Show HN"` posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do `"Ask HN"` or `"Show HN"` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [1]:
from csv import reader

opened_file = open('dataset/hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(headers)
print("\n")
for each_row in hn[:5]:
    print(each_row)
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `"Ask HN"` or `"Show HN"`, we'll create new lists of lists containing just the data for those titles.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for each_row in hn:
    title = each_row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(each_row)
    elif title.startswith('show hn'):
        show_posts.append(each_row)
    else:
        other_posts.append(each_row)
        
print("Number of posts asking ('Ask HN'): " + str(len(ask_posts)))
print("Number of posts showing ('Show HN'): " + str(len(show_posts)))
print("Number of other posts: " + str(len(other_posts)))

Number of posts asking ('Ask HN'): 1744
Number of posts showing ('Show HN'): 1162
Number of other posts: 17194


Next, let's determine if ask posts or show posts receive more comments on average.

We can create a function `avg_no_of_cmnts()` as there will be redundant line of codes otherwise as we will essentially be doing the same thing (finding average number of comments for certain type of posts). `avg_no_of_cmnts()` takes a list of lists (rows for `"Ask HN"` or `"Show HN"` posts) as its only input/ parameter. It then counts the total number of comments in that certain types of posts and returns the average number of comments.

In [3]:
def avg_no_of_cmnts(rows):
    total = 0
    for each_row in rows:
        n_cmnt = int(each_row[4])
        total += n_cmnt
        
    avg = total/len(rows)
    return avg

avg_ask_comments = avg_no_of_cmnts(ask_posts)
avg_show_comments = avg_no_of_cmnts(show_posts)

print("Average comments for posts asking ('Ask HN'): " + str(round(avg_ask_comments, 2)))
print("Average comments for posts showing ('Show HN'): " + str(round(avg_show_comments,2)))

Average comments for posts asking ('Ask HN'): 14.04
Average comments for posts showing ('Show HN'): 10.32


Looking from the above data, it is clear that `"Ask HN"` posts receive more comments on average (**14.04**) than `"Show HN"` posts (**10.32**).

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain *time* are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2. Calculate the average number of comments ask posts receive by hour created.

In [4]:
import datetime as dt

result_list = []

for each_row in ask_posts:
    temp_list = [each_row[6], int(each_row[4])]
    result_list.append(temp_list)
    
counts_by_hour = {}
comments_by_hour = {}

for elem in result_list:
    dt_obj = dt.datetime.strptime(elem[0], "%m/%d/%Y %H:%M")
    hour = dt_obj.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = elem[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += elem[1]
    
counts_by_hour_list = []
comments_by_hour_list = []
    
print("Number of posts by hour [High->Low]: ")
for k,v in counts_by_hour.items():
    a, b = v, k
    counts_by_hour_list.append([a,b])
    
for elem in sorted(counts_by_hour_list, reverse=True):
    print('{}: {}'.format(elem[1],elem[0]))
    
print("\n")
    
print("Number of comments by hour [High->Low]: ")
for k,v in comments_by_hour.items():
    a, b = v, k
    comments_by_hour_list.append([a,b])
    
for elem in sorted(comments_by_hour_list, reverse=True):
    print('{}: {}'.format(elem[1],elem[0]))

Number of posts by hour [High->Low]: 
15: 116
19: 110
21: 109
18: 109
16: 108
14: 107
17: 100
13: 85
20: 80
12: 73
22: 71
23: 68
01: 60
10: 59
11: 58
02: 58
00: 55
03: 54
08: 48
04: 47
05: 46
09: 45
06: 44
07: 34


Number of comments by hour [High->Low]: 
15: 4477
16: 1814
21: 1745
20: 1722
18: 1439
14: 1416
02: 1381
13: 1253
19: 1188
17: 1146
10: 793
12: 687
01: 683
11: 641
23: 543
08: 492
22: 479
05: 464
00: 447
03: 421
06: 397
04: 337
07: 267
09: 251
