# Hacker News posts: right post type and time for collecting the comments

In this small project we will compare two types of posts from `Hacker News` (HN) site, filled with technology related posts, - `Ask HN` and `Show HN`.  
Ask HN are posts where users asking for community advise, while Show HN are posts with intention to show other users something interesting.

The goal is to determine if `Ask HN or Show HN receive more comments on average` and `do posts created at a certain time recieve more comments on average`.

For realization we will use sample (20 000 out of 300 000 rows) from Hacker News data set - sample is made by omitting posts without comments, and then choosing rows randomly from the rest of the data.

# Preparing the data

Let's firstly look at first five rows from our data set:

In [2]:
from csv import reader
import datetime as dt

opened_hn = open('hacker_news.csv')
read_hn = reader(opened_hn)
list_hn = list(read_hn)
headers = list_hn[0]
hn = list_hn[1:]
opened_hn.close()

print(headers)
for row in hn[:5]:
    print(row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



Now we can filter out entries that we need - those, which starts with `ask hn` and `show hn`.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(
f'''Number of ask posts: {len(ask_posts)}
Number of show posts: {len(show_posts)}
Number of other posts: {len(other_posts)}'''
)

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


# Calculation of average amount of comments for Ask and Show posts

Next step is to find average amount of comments for both ask and show posts.

In [4]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)

print(
f'''Average number of comments for ask post: {round(avg_ask_comments, 2)}
Average number of comments for show post: {round(avg_show_comments, 2)}'''
)

Average number of comments for ask post: 14.04
Average number of comments for show post: 10.32


We could see that `users comment on ask posts more often` than on show posts.  
So let's focus our further analisys on this type of posts.

# Calculation of average amount of comments per hour

Next, we will generate a list with time and number of comments for each ask post.

In [5]:
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for pair in result_list:
    date = dt.datetime.strptime(pair[0], '%m/%d/%Y %H:%M')
    hour = date.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = pair[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += pair[1]

print(f'Posts, by hour:\n')
for hour in sorted(counts_by_hour):
    print(f'{hour}:{counts_by_hour[hour]} posts')
print(f'\nComments, by hour:\n')
for hour in sorted(comments_by_hour):
    print(f'{hour}:{comments_by_hour[hour]} comments')

Posts, by hour:

0:55 posts
1:60 posts
2:58 posts
3:54 posts
4:47 posts
5:46 posts
6:44 posts
7:34 posts
8:48 posts
9:45 posts
10:59 posts
11:58 posts
12:73 posts
13:85 posts
14:107 posts
15:116 posts
16:108 posts
17:100 posts
18:109 posts
19:110 posts
20:80 posts
21:109 posts
22:71 posts
23:68 posts

Comments, by hour:

0:447 comments
1:683 comments
2:1381 comments
3:421 comments
4:337 comments
5:464 comments
6:397 comments
7:267 comments
8:492 comments
9:251 comments
10:793 comments
11:641 comments
12:687 comments
13:1253 comments
14:1416 comments
15:4477 comments
16:1814 comments
17:1146 comments
18:1439 comments
19:1188 comments
20:1722 comments
21:1745 comments
22:479 comments
23:543 comments


As we can see, `the largest number of posts` are between `14.00 - 19.00`, and time for comments is about 2.00 and between 13.00 - 21.00.

To normalize this data and remove the cases where individual posts receive the most comments, it is worth finding a more reliable value.  
It is the `average comments value per hour`, dividing them on number of posts per hour.

In [6]:
avg_by_hour = []

for hour in sorted(counts_by_hour):
    posts = counts_by_hour[hour]
    comments = comments_by_hour[hour]
    avg_by_hour.append([hour, comments/posts])
        
print(f'Average amount of comments per hour\n')

for pair in avg_by_hour:
    print(f'{pair[0]}:{round(pair[1], 2)}')

Average amount of comments per hour

0:8.13
1:11.38
2:23.81
3:7.8
4:7.17
5:10.09
6:9.02
7:7.85
8:10.25
9:5.58
10:13.44
11:11.05
12:9.41
13:14.74
14:13.23
15:38.59
16:16.8
17:11.46
18:13.2
19:10.8
20:21.52
21:16.01
22:6.75
23:7.99


Thus, we found the most popular hours for comments (more than 20 comments per post on average): 2.00, 15.00, 20.00.

Now let's translate it in more universal form.

In [9]:
sorted_avg_by_hour = sorted(avg_by_hour, key = lambda x: x[1], reverse = True)
print(f'Top 5 hours for Ask posts comments:\n')
for pair in sorted_avg_by_hour[:5]:
    hour = dt.datetime.strptime(str(pair[0]), '%H')
    print(f'{hour.strftime("%H")}:00 : {pair[1]:.2f}')

Top 5 hours for Ask posts comments:

15:00 : 38.59
02:00 : 23.81
20:00 : 21.52
16:00 : 16.80
21:00 : 16.01


Based on this knowledge we can say that `the most productive time for comments` is:  
* 15:00 - 17:00  
* 20:00 - 22:00  
* 2:00 - 3:00

But we need to take into account `time difference`.  
Our data set has the time zone of `Eastern Time in the US`.
If we are located in Vladivostok, Russia, favorable hours for receiving comments will be different (+15 hours):
* 6:00 - 8:00  
* 11:00 - 13:00  
* 17:00 - 18:00

# Findings

In conclusion, we can say that:

* `Ask HN posts` recieve `more comments` than show posts on average (based on posts with comments data)  
* there are `hours` when users `comment posts more` on average - `15.00 - 17.00, 20.00 - 22.00, 2.00 - 3.00` (based on ask posts comments data)

In addition, we need to keep in mind `time zone` we are making the analysis for and translate it if necessary.