# Hacker News Analysis

In this project I will be analyzing the public data set of hacker news posts to answer questions about which posts drive the most engagement. 

This exercise is to help me practice datetime parsing and application.

In [10]:
#Read data set
from csv import reader

hn = list(reader(open("hacker_news.csv")))

#Extract headers
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [16]:
#Filter data set 
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
        
print("The number of 'ask hn' posts is {0}".format(len(ask_posts)))
print("The number of 'show hn' posts is {0}".format(len(show_posts)))
print("The number of 'other' posts is {0}".format(len(other_posts)))


The number of 'ask hn' posts is 1744
The number of 'show hn' posts is 1162
The number of 'other' posts is 17194


### Determine which type of posts generate more comments on average

In [30]:
total_ask_comments = 0
for i in ask_posts:
    total_ask_comments+= int(i[4])
    
print("Total 'ask' comments is {0:,}".format(total_ask_comments))
print("Average 'ask' comments is {0:,.2f}".format(total_ask_comments/len(ask_posts)))


total_show_comments = 0
for i in show_posts:
    total_show_comments+= int(i[4])
    
print("Total 'show' comments is {0:,}".format(total_show_comments))
print("Average 'show' comments is {0:.2f}".format(total_show_comments/len(show_posts)))

Total 'ask' comments is 24,483
Average 'ask' comments is 14.04
Total 'show' comments is 11,988
Average 'show' comments is 10.32


### Determine if ask posts are more likely to attract comments based on time posted.

In [50]:
import datetime as dt

result_list = []

for i in ask_posts:
    created_at = i[6]
    comments = int(i[4])
    result_list.append([created_at,comments])
    
counts_by_hour = {}
comments_by_hour = {}

for x in result_list:
    date_string = dt.datetime.strptime(x[0],"%m/%d/%Y %H:%M")
    try:
        hour = int(dt.datetime.strftime(date_string,"%-H"))
    except:
        hour = ""
        
    if hour not in counts_by_hour.keys():
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = x[1]
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=x[1]

print(counts_by_hour)
print(comments_by_hour)



{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


### Calculate average comments by hour posted

In [58]:
avg_by_hour = []

for h in range(24):
    try:
        avg_comments = comments_by_hour[h] / counts_by_hour[h]
    except:
        avg_comments = 0
    
    avg_by_hour.append([avg_comments,h])


avg_by_hour.sort(reverse=True)
for i in range(5):
    print("{0}:00 EST: {1:,.2f} average comments per post".format(avg_by_hour[i][1],avg_by_hour[i][0]))

15:00 EST: 38.59 average comments per post
2:00 EST: 23.81 average comments per post
20:00 EST: 21.52 average comments per post
16:00 EST: 16.80 average comments per post
21:00 EST: 16.01 average comments per post


# Key Takeaways

An analysis of the data set highlighted that hacker news posts tagged as "Ask HN" are more likely to get engagement than posts tagged as "Show HN"; though it should be noted that the difference is not significant. The average number of comments on an ask posts is 14 while the average on show posts is 10.

It was also discovered that ask posts within the hours of 3 pm - 4pm Eastern Time get an average of 38 comments which is more than double the average comments in the full population at 14.

The sample size of this analysis was 24,483 ask posts.