## Hacker News Data Analysis 

In this project, we'll work with a data set of submissions to popular technology site Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
Let's start by importing the libraries we need and reading the data set into a list of lists.



In [2]:
from csv import reader
file_open=open("hacker_news.csv")
read=reader(file_open)
hn=list(read)
headers=hn[0]
hn=hn[1:]
print(headers)
print("\n")
print(hn[0:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Creating a list of Ask HN, Show HN and Other posts

Now we will separate the different types on questions based on how they have been classified and create separate lists for each.

In [3]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    title=title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print("\n")
print(ask_posts[0:4])

1744
1162
17194


[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']]


Now we will determine which one of Ask HN or Show HN get more comments. 

In [4]:
total_ask_comments=[]
total_show_comments=[]

for row in ask_posts:
    num_comments=int(row[4])
    total_ask_comments.append(num_comments)

avg_ask_comments=sum(total_ask_comments)/len(total_ask_comments)
print("Average Ask HN comments:", avg_ask_comments)

for row in show_posts:
    num_comments=int(row[4])
    total_show_comments.append(num_comments)

avg_show_comments=sum(total_show_comments)/len(total_show_comments)
print("Average Show HN comments:", avg_show_comments)
    

Average Ask HN comments: 14.038417431192661
Average Show HN comments: 10.31669535283993


Clearly the average comments on a Show HN posts are lower than the Ask HN posts. It might be because people are more interested in answering the already asked questions and that might also be generating additional discussion threads thus incrasing the number of comments overall. Since the average is higher for ask posts we will focus our attention there and try to find out which hour of the data attract most number of posts.

In [5]:
import datetime as dt
result_list=[]
counts_by_hour={}
comments_by_hour={}
for row in ask_posts:
    temp_lst=[]
    created_at=row[6]
    temp_lst.append(created_at)
    num_comments=int(row[4])
    temp_lst.append(num_comments)
    result_list.append(temp_lst)

for row in result_list:    
    date_format="%m/%d/%Y %H:%M"
    date_time=row[0]
    comment=row[1]
    date_time1=dt.datetime.strptime(date_time,date_format)
    hr=dt.datetime.strftime(date_time1,"%H")
    if hr in counts_by_hour:
        counts_by_hour[hr]+=1
        comments_by_hour[hr]+=comment
    else:
        counts_by_hour[hr]=1
        comments_by_hour[hr]=comment
        
print(counts_by_hour)
print("\n")
print(comments_by_hour)

{'15': 116, '12': 73, '05': 46, '09': 45, '14': 107, '00': 55, '11': 58, '13': 85, '10': 59, '08': 48, '19': 110, '20': 80, '17': 100, '01': 60, '02': 58, '21': 109, '03': 54, '18': 109, '06': 44, '22': 71, '04': 47, '16': 108, '23': 68, '07': 34}


{'15': 4477, '12': 687, '05': 464, '09': 251, '14': 1416, '00': 447, '11': 641, '13': 1253, '10': 793, '08': 492, '19': 1188, '20': 1722, '17': 1146, '01': 683, '02': 1381, '21': 1745, '03': 421, '18': 1439, '06': 397, '22': 479, '04': 337, '16': 1814, '23': 543, '07': 267}


We will now create list of lists from the two dictionaries we created above. We want to ultimately find the average number of comments per hour and average number of users per hour. 

In [6]:

avg_by_hour=[]
for hour in counts_by_hour:
    avg_by_hour.append([hour, int(comments_by_hour[hour])/int(counts_by_hour[hour])])

print(avg_by_hour)
    

[['15', 38.5948275862069], ['12', 9.41095890410959], ['05', 10.08695652173913], ['09', 5.5777777777777775], ['14', 13.233644859813085], ['00', 8.127272727272727], ['11', 11.051724137931034], ['13', 14.741176470588234], ['10', 13.440677966101696], ['08', 10.25], ['19', 10.8], ['20', 21.525], ['17', 11.46], ['01', 11.383333333333333], ['02', 23.810344827586206], ['21', 16.009174311926607], ['03', 7.796296296296297], ['18', 13.20183486238532], ['06', 9.022727272727273], ['22', 6.746478873239437], ['04', 7.170212765957447], ['16', 16.796296296296298], ['23', 7.985294117647059], ['07', 7.852941176470588]]


We will now format the list of lists above to make to increase the readability of the data. 

In [14]:
swap_avg_by_hour=[]
for rows in avg_by_hour:
    swap_avg_by_hour.append([rows[1], rows[0]])

print(swap_avg_by_hour)

sorted_swap=sorted(swap_avg_by_hour, reverse=True)

print("\n")
print(sorted_swap)
print("\n")
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[0:5]:
    date_format="%H"
    average=row[0]
    hour=row[1]
    date_time1=dt.datetime.strptime(hour,date_format)
    hr=dt.datetime.strftime(date_time1,"%H:%M")
    avg_format="{:.2f} average comments per post".format(average)
    print(hr,":", avg_format)
    


[[38.5948275862069, '15'], [9.41095890410959, '12'], [10.08695652173913, '05'], [5.5777777777777775, '09'], [13.233644859813085, '14'], [8.127272727272727, '00'], [11.051724137931034, '11'], [14.741176470588234, '13'], [13.440677966101696, '10'], [10.25, '08'], [10.8, '19'], [21.525, '20'], [11.46, '17'], [11.383333333333333, '01'], [23.810344827586206, '02'], [16.009174311926607, '21'], [7.796296296296297, '03'], [13.20183486238532, '18'], [9.022727272727273, '06'], [6.746478873239437, '22'], [7.170212765957447, '04'], [16.796296296296298, '16'], [7.985294117647059, '23'], [7.852941176470588, '07']]


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'],

The times above are the best times according to the data to get most number of responses for the Ask Posts.

It looks like that a two-hour window between 3-5 PM might be the most useful window for people to get responses to their queries. Similarly the window of 8-10 PM looks productive too. 

2 AM seems to be completely out of the general pattern and it might be interesting to dig deeper into that.

May be an analysis of Show HN data might throw more light as well.