# Hacker News
## Posts analysis

The aim of this project is to ask the two following questions:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?



In [1]:
# Read data
from csv import reader
hn_posts_data = open('/home/li0t/Documents/workspace/data-sets/HN_posts_year_to_Sep_26_2016.csv')
hn_posts_read = reader(hn_posts_data)
hn_posts = list(hn_posts_read)

# Remove headers
hn_headers = hn_posts[0] 
hn_posts = hn_posts[1:]

print('Posts headers\n')
print(hn_headers)

print('Sample posts\n')
for post in hn_posts[0:3]:
    print('\n')
    print(post)
    
print('\nPosts count: {0}'.format(len(hn_posts)))

Posts headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
Sample posts



['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']

Posts count: 293119


## Preparing data
Let's split the data in two three different categories: 
1. Those that are for HN Ask
2. Those that are for HN Show
3. Other posts

In [2]:
# Create a map to the headers columns
cols = {
    'id': 0,
    'title': 1,
    'url': 2,
    'num_points': 3,
    'num_comments': 4,
    'author': 5,
    'created_at': 6
}

def is_post_type(post, post_type):
    title = post[cols['title']].lower()
    return title.startswith(post_type)

hn_ask_posts = []
hn_show_posts = []
hn_other_posts = []

for post in hn_posts:
    if is_post_type(post, 'ask hn'):
        hn_ask_posts.append(post)
    elif is_post_type(post, 'show hn'):
        hn_show_posts.append(post)
    else: 
        hn_other_posts.append(post)
        
        
print('HN Ask posts: {0}'.format(len(hn_ask_posts)))
print('HN Show posts: {0}'.format(len(hn_show_posts)))
print('HN Other posts: {0}'.format(len(hn_other_posts)))

HN Ask posts: 9139
HN Show posts: 10158
HN Other posts: 273822


## Counting comments
Which type of posts get the most comments?

In [3]:
# Count comments and calculate their averages
hn_ask_comments = 0
hn_show_comments = 0

for post in hn_posts:
    comments = float(post[cols['num_comments']])
    
    if is_post_type(post, 'ask hn'):
        hn_ask_comments += comments
    elif is_post_type(post, 'show hn'):
        hn_show_comments += comments

hn_ask_comments_avg = hn_ask_comments / len(hn_ask_posts)
hn_show_comments_avg = hn_show_comments / len(hn_show_posts)

print('HN Ask posts comments average: {0:.2f}'.format(hn_ask_comments_avg))
print('HN Show posts comments average: {0:.2f}'.format(hn_show_comments_avg))

HN Ask posts comments average: 10.39
HN Show posts comments average: 4.89


**On average HN Ask posts get roughly 2.5 times more comments than HN Show posts.**

That said we'll focus our analysis in the **HN Ask** posts.

## Determining time distribution
How do HN Ask posts (and their comments) distribute over the hours of the day?

In [18]:
import datetime as dt

date_format = '%m/%d/%Y %H:%M'

# Count comment by date
comments_by_date = []

for post in hn_ask_posts:
    date_str = post[cols['created_at']]
    comments = float(post[cols['num_comments']])
    
    comments_by_date.append([date_str, comments])
    
posts_by_hour = {}
comments_by_hour = {}

# Count posts and comments by hour
for elem in comments_by_date:
    date_str = elem[0]
    date = dt.datetime.strptime(date_str, date_format)
    hour = date.strftime('%H')
        
    comments = elem[1]
    
    if hour in posts_by_hour:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else: 
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    
# Calculate average comment by hour
avg_comments_by_hour = {}

for hour in posts_by_hour:
    posts = posts_by_hour[hour]
    comments = comments_by_hour[hour]
    
    average = comments / posts
    
    avg_comments_by_hour[hour] = average
    
# Swap and sort results
avg_comments_by_hour_swap = {}

for hour in avg_comments_by_hour:
    average = avg_comments_by_hour[hour]
    avg_comments_by_hour_swap[average] = hour

    
sorted_averages = sorted(avg_comments_by_hour_swap, reverse=True)

avg_comments_by_hour_result = {}

for average in sorted_averages:
    hour = avg_comments_by_hour_swap[average]
    avg_comments_by_hour_result[hour] = average
    
print('Sorted average comments by hour\n')
for hour in avg_comments_by_hour_result:
    average = avg_comments_by_hour_result[hour]
    print('{0}:00 : {1:.2f}'.format(hour, average))
    






    
    


Sorted average comments by hour

15:00 : 28.68
13:00 : 16.32
12:00 : 12.38
02:00 : 11.14
10:00 : 10.68
04:00 : 9.71
14:00 : 9.69
17:00 : 9.45
08:00 : 9.19
11:00 : 8.96
22:00 : 8.80
05:00 : 8.79
20:00 : 8.75
21:00 : 8.69
03:00 : 7.95
18:00 : 7.94
16:00 : 7.71
00:00 : 7.56
01:00 : 7.41
19:00 : 7.16
07:00 : 7.01
06:00 : 6.78
23:00 : 6.70
09:00 : 6.65


## Conclusions
The HN Ask posts who get the greatest average of comments are between 12PM to 15PM.