# Exploring Hacker News posts

## Introduction
In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). It is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.


## Data set
The data set used is [available](https://www.kaggle.com/hacker-news/hacker-news-posts), and it has been reduced to approximately 20 k rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

The columns are:

|  Column name |                Description                |
|:------------:|:-----------------------------------------:|
|      id      | Post's unique identifier from Hacker News |
|     title    |             Title of the post             |
|      url     |      The URL that the posts links to      |
|  num_points  |   The number of points the post acquired  |
| num_comments |       Number of comments on the post      |
|    author    |     Username of who submitted the post    |
|  created_at  |   Date and time of the post's submission  |




We're specifically interested in posts whose titles begin with either _Ask HN_ or _Show HN_. Users submit those posts to ask the Hacker News community a specific question.


We'll compare these two types of posts to determine the following:

- Do _Ask HN_ or _Show HN_ receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We can see some rows of the data set below.

In [1]:
from csv import reader

file = open('hacker_news.csv')
read_file = reader(file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(hn[0])
print('\n')
print(hn[1])
print('\n')
print(hn[2])
print('\n')
print(hn[3])
print('\n')
print(hn[4])

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


And then we can observe the headers:

In [2]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


## Extracting Ask HN and Show HN Posts

Now we will filter the data set into 3 lists:
- ask_posts: Contains _Ask HN_ posts
- show_posts: Contains _Show HN_ posts
- other_posts: Contains the remaining posts

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
    tit = post[1]
    title = tit.lower() #To avoid capitalization problems
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
print('Total of Ask HN posts: ' + str(len(ask_posts)))
print('Total of Show HN posts: ' + str(len(show_posts)))
print('Total of other posts: ' + str(len(other_posts)))

Total of Ask HN posts: 9139
Total of Show HN posts: 10158
Total of other posts: 273822


Now we can calculate the total of comments for each list.

In [4]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
avg_ask_comments = round(total_ask_comments / len(ask_posts))

total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
avg_show_comments = round(total_show_comments / len(show_posts))

total_other_comments = 0
for post in other_posts:
    num_comments = int(post[4])
    total_other_comments += num_comments
avg_other_comments = round(total_other_comments / len(other_posts))

string = 'Average comments for {list} posts is: {num}'
print(string.format(list="Ask HN", num=avg_ask_comments))
print(string.format(list="Show HN", num=avg_show_comments))
print(string.format(list="other", num=avg_other_comments))

Average comments for Ask HN posts is: 10
Average comments for Show HN posts is: 5
Average comments for other posts is: 6


As we can see above, _Ask HN_ posts receive more comments than _Show HN_.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

We will start at the first step:

In [5]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for i in result_list:
    hour = i[0]
    comment = i[1]
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(hour, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

We'll use both dictionaries created above, to calculate the average number of comments for posts created during each hour of the day.

In [6]:
avg_by_hour = []
for i in comments_by_hour:
    counts = counts_by_hour[i]
    comments = comments_by_hour[i]
    average = comments/counts
    avg_by_hour.append([i, average])

print(avg_by_hour)

[['02', 11.122222222222222], ['01', 7.392226148409894], ['22', 8.78125], ['21', 8.67437379576108], ['19', 7.151898734177215], ['17', 9.438775510204081], ['15', 28.632148377125194], ['14', 9.673151750972762], ['13', 16.285393258426968], ['11', 8.942492012779553], ['10', 10.646643109540635], ['09', 7.058295964125561], ['07', 7.0], ['03', 7.922794117647059], ['23', 6.6773255813953485], ['20', 8.731898238747554], ['16', 7.7], ['08', 9.182170542635658], ['00', 7.546357615894039], ['18', 7.949593495934959], ['12', 12.361516034985423], ['04', 9.680327868852459], ['06', 6.757446808510639], ['05', 8.752380952380953]]


We can clean the data to make it easier to understand.

In [7]:
swap_avg_by_hour = []

for i in avg_by_hour: # Swap to order by average comments
    swap_avg_by_hour.append([i[1],i[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
string = "{hour}: {average:.2f} average comments per post"
for i in sorted_swap[:5]:
    average = i[0]
    hour = i[1]
    hour = dt.datetime.strptime(hour,"%H")
    hour = dt.datetime.strftime(hour,"%H:%M")
    print(string.format(string, hour=hour, average=average))

Top 5 Hours for Ask Posts Comments
15:00: 28.63 average comments per post
13:00: 16.29 average comments per post
12:00: 12.36 average comments per post
02:00: 11.12 average comments per post
10:00: 10.65 average comments per post


We can see from the top shown above, that creating a post between 15:00 and 16:59 will provide an average of +16 comments, up to 38. Creating the post at nigth, between 20:00 and 21:59 will result in 15 to 21 comments per post.

There is an indication that Hacker News also fits nigth fans, because creating a post at 2 AM will provide an average of 23 comments.