# Hacker News post analysis

This analysis looks at post data from the site Hacker News. It aims to specifically answer thes two questions:

1. Do `Ask HN` or `Show HN` receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

This analysis removes all submissions that did not receive any comments, and then randomly sampling from the remaining submissions to reduce the data from 300k rows to 20k. 


In [15]:
# reading in the data 
import csv
import datetime as dt

opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [9]:
# make a list of just the header 
headers = hn[0]

# remove the header row from the list of lists
hn = hn[1:]

# verify that i've removed the headers 
hn[:2]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30']]

In [10]:
ask_posts = [] 
show_posts = [] 
other_posts = [] 

for post in hn: 
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print('number of ask posts:', len(ask_posts))
print('number of show posts:', len(show_posts))
print('number of other posts:', len(other_posts))


number of ask posts: 1744
number of show posts: 1162
number of other posts: 17194


In [11]:
total_ask_comments = 0 

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = round(total_ask_comments / len(ask_posts),2)

print('Total ask comments:',total_ask_comments)
print('Average number of ask comments:', avg_ask_comments)

Total ask comments: 24483
Average number of ask comments: 14.04


In [12]:
total_show_comments = 0 

for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = round(total_show_comments / len(show_posts), 2)

print('Total show comments:', total_show_comments)
print('Average show comments:', avg_show_comments)

Total show comments: 11988
Average show comments: 10.32


In [14]:
total_other_comments = 0 

for post in other_posts:
    total_other_comments += int(post[4])
    
    
avg_other_comments = round(total_other_comments / len(other_posts), 2)

print('Total other comments:', total_other_comments)
print('Average other comments:', avg_other_comments)

Total other comments: 462055
Average other comments: 26.87


## Which type of post received more comments on average?

On average, `Ask HN` posts received more comments. The average number of comments on an`Ask HN` post is **14.04** and the average number of comments on a `Show HN` post is **10.32**. The posts that aren't categorized as ask/show actually have the highest average comment engagement with 26.87 average comments per post. 

Going forward we are just going to be looking at `Ask HN` posts for the last part of the analysis. 


In [26]:
result_list = []
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])])

for row in result_list:
    date = row[0]
    num_comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime('%H')
    if time in counts_by_hour:
        comments_by_hour[time] += num_comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = num_comment
        counts_by_hour[time] = 1 
   

print('Posts by hour:')
print()
print(counts_by_hour)
print()
print('Comments by hour:')
print()
print(comments_by_hour)


Posts by hour:

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}

Comments by hour:

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [30]:
avg_by_hour = [] 

for hour in comments_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour] / counts_by_hour[hour],2)])

avg_by_hour

[['09', 5.58],
 ['13', 14.74],
 ['10', 13.44],
 ['14', 13.23],
 ['16', 16.8],
 ['23', 7.99],
 ['12', 9.41],
 ['17', 11.46],
 ['15', 38.59],
 ['21', 16.01],
 ['20', 21.52],
 ['02', 23.81],
 ['18', 13.2],
 ['03', 7.8],
 ['05', 10.09],
 ['19', 10.8],
 ['01', 11.38],
 ['22', 6.75],
 ['08', 10.25],
 ['04', 7.17],
 ['00', 8.13],
 ['06', 9.02],
 ['07', 7.85],
 ['11', 11.05]]

In [36]:
# swapping the values to use sorted func

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print(sorted_swap)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]
[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16'], [16.01, '21'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [13.2, '18'], [11.46, '17'], [11.38, '01'], [11.05, '11'], [10.8, '19'], [10.25, '08'], [10.09, '05'], [9.41, '12'], [9.02, '06'], [8.13, '00'], [7.99, '23'], [7.85, '07'], [7.8, '03'], [7.17, '04'], [6.75, '22'], [5.58, '09']]


In [39]:
print('Top 5 hours for "Ask HN" post comments')
print()

for average, hour in sorted_swap[:5]:
    print('{}: {:.2f} average comments'.format(dt.datetime.strptime(hour, '%H').strftime('%H:%M'), average))


Top 5 hours for "Ask HN" post comments

15:00: 38.59 average comments
02:00: 23.81 average comments
20:00: 21.52 average comments
16:00: 16.80 average comments
21:00: 16.01 average comments
