# Exploring Hacker News posts

## Introduction
In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). It is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.


## Data set
The data set used is [available](https://www.kaggle.com/hacker-news/hacker-news-posts), and it has been reduced to approximately 20 k rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

The columns are:

|  Column name |                Description                |
|:------------:|:-----------------------------------------:|
|      id      | Post's unique identifier from Hacker News |
|     title    |             Title of the post             |
|      url     |      The URL that the posts links to      |
|  num_points  |   The number of points the post acquired  |
| num_comments |       Number of comments on the post      |
|    author    |     Username of who submitted the post    |
|  created_at  |   Date and time of the post's submission  |




We're specifically interested in posts whose titles begin with either _Ask HN_ or _Show HN_. Users submit those posts to ask the Hacker News community a specific question.


We'll compare these two types of posts to determine the following:

- Do _Ask HN_ or _Show HN_ receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We can see some rows of the data set below.

In [2]:
from csv import reader

file = open('hacker_news.csv')
read_file = reader(file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(hn[0])
print('\n')
print(hn[1])
print('\n')
print(hn[2])
print('\n')
print(hn[3])
print('\n')
print(hn[4])

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


And then we can observe the headers:

In [3]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


## Extracting Ask HN and Show HN Posts

Now we will filter the data set into 3 lists:
- ask_posts: Contains _Ask HN_ posts
- show_posts: Contains _Show HN_ posts
- other_posts: Contains the remaining posts

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
    tit = post[1]
    title = tit.lower() #To avoid capitalization problems
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
print('Total of Ask HN posts: ' + str(len(ask_posts)))
print('Total of Show HN posts: ' + str(len(show_posts)))
print('Total of other posts: ' + str(len(other_posts)))

Total of Ask HN posts: 1744
Total of Show HN posts: 1162
Total of other posts: 17194


Now we can calculate the total of comments for each list.

In [5]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
avg_ask_comments = round(total_ask_comments / len(ask_posts))

total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
avg_show_comments = round(total_show_comments / len(show_posts))

total_other_comments = 0
for post in other_posts:
    num_comments = int(post[4])
    total_other_comments += num_comments
avg_other_comments = round(total_other_comments / len(other_posts))

string = 'Average comments for {list} posts is: {num}'
print(string.format(list="Ask HN", num=avg_ask_comments))
print(string.format(list="Show HN", num=avg_show_comments))
print(string.format(list="other", num=avg_other_comments))

Average comments for Ask HN posts is: 14
Average comments for Show HN posts is: 10
Average comments for other posts is: 27


As we can see above, _Ask HN_ posts receive more comments than _Show HN_.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

We will start at the first step:

In [6]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for i in result_list:
    hour = i[0]
    comment = i[1]
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(hour, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

We'll use both dictionaries created above, to calculate the average number of comments for posts created during each hour of the day.

In [7]:
avg_by_hour = []
for i in comments_by_hour:
    counts = counts_by_hour[i]
    comments = comments_by_hour[i]
    average = comments/counts
    avg_by_hour.append([i, average])

print(avg_by_hour)

[['12', 9.337837837837839], ['08', 10.142857142857142], ['03', 7.672727272727273], ['15', 38.27350427350427], ['09', 5.586956521739131], ['22', 6.680555555555555], ['10', 13.233333333333333], ['20', 21.28395061728395], ['06', 8.844444444444445], ['16', 16.798165137614678], ['04', 7.083333333333333], ['21', 15.9], ['18', 13.1], ['01', 11.737704918032787], ['11', 10.898305084745763], ['17', 11.356435643564357], ['00', 8.160714285714286], ['13', 14.906976744186046], ['07', 7.685714285714286], ['23', 7.884057971014493], ['14', 13.13888888888889], ['02', 23.45762711864407], ['05', 10.48936170212766], ['19', 10.72972972972973]]


We can clean the data to make it easier to understand.

In [8]:
swap_avg_by_hour = []

for i in avg_by_hour: # Swap to order by average comments
    swap_avg_by_hour.append([i[1],i[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
string = "{hour}: {average:.2f} average comments per post"
for i in sorted_swap[:5]:
    average = i[0]
    hour = i[1]
    hour = dt.datetime.strptime(hour,"%H")
    hour = dt.datetime.strftime(hour,"%H:%M")
    print(string.format(string, hour=hour, average=average))

Top 5 Hours for Ask Posts Comments
15:00: 38.27 average comments per post
02:00: 23.46 average comments per post
20:00: 21.28 average comments per post
16:00: 16.80 average comments per post
21:00: 15.90 average comments per post


We can see from the top shown above, that creating a post between 15:00 and 16:59 will provide an average of +16 comments, up to 38. Creating the post at nigth, between 20:00 and 21:59 will result in 15 to 21 comments per post.

There is an indication that Hacker News also fits nigth fans, because creating a post at 2 AM will provide an average of 23 comments.