# Hacker News Analysis

Hacker News is a site started by Y-Combinator, they specialize in acceleratng startups. The site allows users to submit post, which are then voted  and commented on by other users. The site is popular with the tech and startup crowds. What we shall be analyzing is a dataset of submissions from this site. Lets get this dataset loaded so we can interact with it.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

print(hn_header)
for row in hn[:3]:
    print('\n')
    print(row)
    

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


In [2]:
hn[13][1]

'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'

We specifically want to focus on the posts that start with 'Show HN' or 'Ask HN'. The 'Show HN' post consist of users showing off project and products. While 'Ask HN' post consist of users askin specific questions.

Our analysis will be focusing on answering two questions:
    
   1. Does 'Ask HN' or "Show HN' receive more comments on average?
   2. Do posts created at a certain time receive more comments on 
      average?
       
Let's begin with cleaning up our data by removing all post that are not 'Ask HN' and 'Show HN'

# Data Clean Up

In [3]:
ask_post = []
show_post = []
other_post = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_post.append(row)
    elif title.lower().startswith('show hn'):
        show_post.append(row)
    else:
        other_post.append(row)
        
print('''
Ask HN : {}
Show HN : {}
Others : {}
'''.format(len(ask_post), len(show_post), len(other_post) ) ) 


Ask HN : 1744
Show HN : 1162
Others : 17194



# Data Analysis Pt. 1

What we have done is split the dataset into three list. We have displayed the name and length of our lists. Now we can answer our first question:

Does 'Ask HN' or "Show HN' receive more comments on average? 

In [4]:
total_ask_comments = 0

for row in ask_post:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_post)
print(round(avg_ask_comments,  1))
    
total_show_comments = 0 
for row in show_post:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_post)
print(round(avg_show_comments, 1))

14.0
10.3


# Data Analysis Pt. 2

Above we have created a loop to find the average of the "Ask HN" and "Show_HN". We can see that "Ask HN" post recieved more comments on average. Meaning people are more willing to answer question than comment on people's projects and products.

Now we need to answer our second question, do posts created at a certain time receive more comments on average?

In [5]:
import datetime as dt

In [6]:
result_list = []
for row in ask_post:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M' )
    hour = date.strftime('%H')
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    elif hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    print(row[1])

6
29
1
3
17
1
4
1
1
2
7
1
1
4
4
2
3
1
22
2
2
7
7
3
6
2
1
3
29
2
20
3
3
33
5
4
7
11
1
9
2
37
1
2
4
1
1
182
5
9
8
24
5
3
4
7
10
3
3
4
3
20
1
140
5
3
30
2
2
5
1
17
1
1
5
5
2
8
2
6
22
7
12
1
2
72
2
130
3
2
7
15
2
6
1
1
10
2
15
8
2
1
1
1
6
43
19
5
1
234
5
6
9
1
1
19
3
1
25
7
7
71
3
17
2
5
2
3
61
2
17
2
3
2
1
5
1
13
12
1
1
185
7
2
4
2
2
1
2
2
2
1
22
37
8
1
55
9
3
20
6
8
3
10
11
1
8
2
35
4
10
3
8
1
55
8
5
1
4
4
3
1
2
11
1
3
3
20
1
9
5
250
11
93
5
2
1
4
4
2
2
92
4
112
4
16
4
3
4
4
4
9
11
43
3
1
3
5
2
1
1
3
3
29
11
3
6
3
3
15
1
1
2
5
9
1
2
7
1
1
5
11
13
2
2
3
2
1
3
5
1
2
32
28
5
2
1
2
6
10
3
3
2
1
2
1
10
2
5
3
1
1
5
60
2
1
6
12
22
32
62
2
2
18
2
1
3
11
4
11
34
5
10
2
6
12
2
1
3
266
183
3
10
2
4
1
5
8
1
5
10
3
1
1
2
1
6
1
1
1
6
2
12
1
2
14
2
6
10
19
1
7
8
4
5
18
17
7
3
1
10
4
3
1
2
9
2
4
26
11
15
7
46
5
5
3
3
16
3
8
4
24
8
4
1
7
2
1
4
2
22
1
9
16
1
2
43
3
18
3
22
4
5
2
4
13
1
1
29
2
41
6
4
1
2
6
3
10
1
2
3
9
14
1
2
2
1
1
11
1
9
4
85
4
42
7
9
1
4
9
1
20
6
1
2
910
3
1
12
2
3
5
12
3
1
95
6
1
1
3
9


In [7]:
print(counts_by_hour)
print(comments_by_hour)

{'21': 109, '23': 68, '19': 110, '18': 109, '20': 80, '09': 45, '16': 108, '13': 85, '01': 60, '03': 54, '17': 100, '22': 71, '04': 47, '14': 107, '00': 55, '08': 48, '07': 34, '15': 116, '12': 73, '10': 59, '05': 46, '06': 44, '11': 58, '02': 58}
{'21': 1745, '23': 543, '19': 1188, '18': 1439, '20': 1722, '09': 251, '16': 1814, '13': 1253, '01': 683, '03': 421, '17': 1146, '22': 479, '04': 337, '14': 1416, '00': 447, '08': 492, '07': 267, '15': 4477, '12': 687, '10': 793, '05': 464, '06': 397, '11': 641, '02': 1381}


In [8]:
avg_by_hour = []

for row in counts_by_hour:
    com = float(comments_by_hour[row])
    hours = float(counts_by_hour[row])
    avg_by_hour.append([row, com/ hours])
    
print(avg_by_hour)

[['21', 16.009174311926607], ['23', 7.985294117647059], ['19', 10.8], ['18', 13.20183486238532], ['20', 21.525], ['09', 5.5777777777777775], ['16', 16.796296296296298], ['13', 14.741176470588234], ['01', 11.383333333333333], ['03', 7.796296296296297], ['17', 11.46], ['22', 6.746478873239437], ['04', 7.170212765957447], ['14', 13.233644859813085], ['00', 8.127272727272727], ['08', 10.25], ['07', 7.852941176470588], ['15', 38.5948275862069], ['12', 9.41095890410959], ['10', 13.440677966101696], ['05', 10.08695652173913], ['06', 9.022727272727273], ['11', 11.051724137931034], ['02', 23.810344827586206]]


So we have found the average comments per hour! However, this looks waytomeesy to understand. Lets clean it up a bit so we can understand what has happened here.

In [12]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]] )
print(swap_avg_by_hour[:5])
print('\n')

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap[:5])
print('\n')


for row in sorted_swap[:5]:
    
    num_format = dt.datetime.strptime(str(row[1]), '%H')
    final_format = num_format.strftime('%H:%M')
    
    avg = int(row[0])
    str_format = '{0:.2f} average comments per post'.format(avg)
                                                            
    row[0] = final_format
    row[1] = str_format
    
print(sorted_swap[:5])

[[16.009174311926607, '21'], [7.985294117647059, '23'], [10.8, '19'], [13.20183486238532, '18'], [21.525, '20']]


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


[['15:00', '38.00 average comments per post'], ['02:00', '23.00 average comments per post'], ['20:00', '21.00 average comments per post'], ['16:00', '16.00 average comments per post'], ['21:00', '16.00 average comments per post']]


Based our analysis, it seems that the best time to post to receive the most comments is at 3 PM.

# Conclusion

We started our analysis trying to answer two questions:

   1. Does 'Ask HN' or "Show HN' receive more comments on average?
   2. Do posts created at a certain time receive more comments on average?
   
We learned that 'Ask HN' post, which are post that ask questions to the forum, earn more comments than 'Show HN'. Then we found that the best time to ask a question is at 3 PM.