# Hacker News post analysis

[Data source](https://www.kaggle.com/hacker-news/hacker-news-posts)

Below are descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if it the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

## Goal

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [4]:
from csv import reader
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]


Let's seperate data into `hn_header` for the title row, and `hn` for the data.

In [16]:
hn_header = hn[:1]
hn = hn[1:]

print(len(hn_header), len(hn))

1 293115


Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either `Ask HN` or `Show HN`, we'll use the string method `startswith`.

We will filter the `hn` dataset into three smaller sets: `ask_posts`, `show_posts` and `other_posts`.

In [17]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts), len(show_posts), len(other_posts))

9139 10158 273818


Print first 5 rows of each dataset for the most basic of confirmation that the filtering was successful.

In [19]:
print(ask_posts[:5])
print("=" * 100)
print(show_posts[:5])
print("=" * 100)
print(other_posts[:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25

Next, let's determine if ask posts or show posts receive more comments on average.

In [23]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comment = int(row[4])
    total_ask_comments += num_comment
print(total_ask_comments)

for row in show_posts:
    num_comment = int(row[4])
    total_show_comments += num_comment
print(total_show_comments)

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print('Avg for ask:', avg_ask_comments, 'Avg for show:', avg_show_comments)

94986
49633
Avg for ask: 10.393478498741656 Avg for show: 4.886099625910612


There are over 2X the number of comments for `Ask HN` posts `[10.39]` compared to `Show HN` posts `[4.88]`.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

Let's calculate the amount of ask posts created per hour, along with the total amount of comments.

In [51]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[-1], int(row[4])])
    
# print(result_list[:5])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
#     date, time = row[0].split()
    time = row[0]
    time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = time.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

for key in counts_by_hour:
    print(key, ':', counts_by_hour[key])
print("=" * 110)
for key in comments_by_hour:
    print(key, ':', comments_by_hour[key])

02 : 269
01 : 282
22 : 383
21 : 518
19 : 552
17 : 587
15 : 646
14 : 513
13 : 444
11 : 312
10 : 282
09 : 222
07 : 226
03 : 271
23 : 343
20 : 510
16 : 579
08 : 257
00 : 301
18 : 614
12 : 342
04 : 243
06 : 234
05 : 209
02 : 2996
01 : 2089
22 : 3372
21 : 4500
19 : 3954
17 : 5547
15 : 18525
14 : 4972
13 : 7245
11 : 2797
10 : 3013
09 : 1477
07 : 1585
03 : 2154
23 : 2297
20 : 4462
16 : 4466
08 : 2362
00 : 2277
18 : 4877
12 : 4234
04 : 2360
06 : 1587
05 : 1838


We've produced two dictionaries:

- `counts_by_hour`: contains the number of ask posts created during each hour of the day.
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. 

In [60]:
avg_by_hour = []

for key in comments_by_hour:
    avg_by_hour.append([key, (comments_by_hour[key] / counts_by_hour[key])])
    
sort_avg_by_hour = sorted(avg_by_hour)

for row in sort_avg_by_hour:
    print(row)

['00', 7.5647840531561465]
['01', 7.407801418439717]
['02', 11.137546468401487]
['03', 7.948339483394834]
['04', 9.7119341563786]
['05', 8.794258373205741]
['06', 6.782051282051282]
['07', 7.013274336283186]
['08', 9.190661478599221]
['09', 6.653153153153153]
['10', 10.684397163120567]
['11', 8.96474358974359]
['12', 12.380116959064328]
['13', 16.31756756756757]
['14', 9.692007797270955]
['15', 28.676470588235293]
['16', 7.713298791018998]
['17', 9.449744463373083]
['18', 7.94299674267101]
['19', 7.163043478260869]
['20', 8.749019607843136]
['21', 8.687258687258687]
['22', 8.804177545691905]
['23', 6.696793002915452]


This is difficult to read to value which is important to us.  Let's swap the values...

In [83]:
swp_avg_by_hour = []

for row in avg_by_hour:
    swp_avg_by_hour.append([row[1], row[0]])
    
# print(swp_avg_by_hour)

sorted_swap = sorted(swp_avg_by_hour, reverse=True)

print("The top 10 Hour slots for Ask Posts Comments")
print('=' * 110)
for index, row in enumerate(sorted_swap[:10]):
    count, time = row
    time = dt.datetime.strptime(time, "%H")
    time = time.strftime('%H:%M')
    print(f"{index+1:02}: at {time}: on average there were {count:.2f} comments per post")

The top 10 Hour slots for Ask Posts Comments
01: at 15:00: on average there were 28.68 comments per post
02: at 13:00: on average there were 16.32 comments per post
03: at 12:00: on average there were 12.38 comments per post
04: at 02:00: on average there were 11.14 comments per post
05: at 10:00: on average there were 10.68 comments per post
06: at 04:00: on average there were 9.71 comments per post
07: at 14:00: on average there were 9.69 comments per post
08: at 17:00: on average there were 9.45 comments per post
09: at 08:00: on average there were 9.19 comments per post
10: at 11:00: on average there were 8.96 comments per post


### Timezone

Bear in mind that the above times are only accurate to the user who created this dataset.

### Next steps

- Find out in which timezone this data was generate, and then convert to locale.
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.