## Investigating Hacker News Posts with the Most Comments 

**LOOPS, MANIPULATING STRINGS AND DATE/TIME DATA**

![Image](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

Hacker News is a site established by the startup incubator *Y Combinator*, where user-submitted stories (known as "posts") are voted and commented on. Hacker News is extremely popular in the technology and startup fields, and posts that make it to the top of Hacker News' listings (e.g., receive the most points or comments) can get hundreds of thousands of visitors as a result. 

The current project focuses on determining what kinds of Hacker News posts receive more comments.
Specifically, we will focus analyses on:
1. Do *Ask HN* or *Show HN* posts receive more comments? (*Ask HN* posts pose questions to the community, while *Show HN* posts show material, such as projects, products, etc.)
2. Do posts created at a certain time receive more comments on average?

The dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Note that this is cleaned version, reduced from 300,000 submissions down to 20,000 (submissions without comments were removed, and then 20,000 submissions were randomly sampled from the remaining dataset).

Below are the brief descriptions for each column:

* id: Unique identifier for the post
* title: Title of the post
* url: URL that the post links to
* num_points: Number of points the post acquired (calculated as total number of upvotes minus total number of downvotes)
* num_comments: Number of comments made on the post
* author: Username of person who submitted the post
* created_at: Date and time which the post was submitted

Let us start by importing libraries and reading the dataset: 

In [120]:
#import libraries and read file

from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

#Display first 5 rows of dataset

for i in range(5):
    print(hn[i])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


The first row contains the names of our columns. Let's amend our Hacker News dataset by removing column names:

In [121]:
#Remove header row from dataset

hn_header = hn[:0]
hn = hn[1:]

for i in range(5):
    print(hn[i])

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Based on our aims, we are interested in exploring post titles beginning with *Ask HN* or *Show HN*. We will filter for these posts using the `lower` method and the `startswith` string method to separate them into 2 different lists: 

In [122]:
##Filter Ask HN and Show HN into separate lists

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    if title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print('\n')
print(ask_posts[0:4])
print('\n')
print(show_posts[0:4])
print('\n')
print(other_posts[0:4])

1744
1162
18938


[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients'

Above, we see that the majority of posts (18,938) are classified as other (i.e., neither *Ask HN* or *Show HN*). Additionally, there are 1,744 posts that start with *Ask HN*, and 1,162 posts that begin with *Show HN*.

### Average Number of Comments per Post Type

Let's assess which of these 2 classes of posts have more comments, on average:

In [124]:
#Compute average number of comments for ask_posts:

total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    avg_ask_comments = total_ask_comments/len(ask_posts)
    
print("The average number of Ask HN comments per post is:", avg_ask_comments)
print('\n')

#Compute average number of comments for show_posts:

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    avg_show_comments = total_show_comments/len(show_posts)
    
print("The average number of Show HN comments per post is:", avg_show_comments)

The average number of Ask HN comments per post is: 14.038417431192661


The average number of Show HN comments per post is: 10.31669535283993


Our computed averages demonstrate that *Ask HN* posts receive 4 more comments on average, compared to *Show HN* comments. We can infer that *Ask HN* posts encourage more activity and interaction, compared to *Show HN*.

### Average Number of Comments for *Ask HN* Posts by Hour

Next, we will determine whether *Ask HN* posts created at a certain time attract more comments. To do so, we will:

1. Calculate the amount of *Ask HN* posts created each hour of the day along with the number of comments received.
2. Calculate the average number of comments *Ask HN* posts receive by hour created.

Let us start with step 1, creating frequency tables for the number of posts per hour and number of comments per hour:

In [125]:
#Number of Ask HN posts and comments by hour created

import datetime as dt

counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    created = row[6]
    num_comments = int(row[4])
    date = row[-1]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, '%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

print("Frequency Table for Number of Ask Posts per Hour:")
for key in sorted(counts_by_hour.keys()):
    print(key, ":", counts_by_hour[key])
    
print('\n')
print("Frequency Table for Number of Ask Comments per Hour:")
for key in sorted(comments_by_hour.keys()):
    print(key, ":", comments_by_hour[key])

Frequency Table for Number of Ask Posts per Hour:
00 : 55
01 : 60
02 : 58
03 : 54
04 : 47
05 : 46
06 : 44
07 : 34
08 : 48
09 : 45
10 : 59
11 : 58
12 : 73
13 : 85
14 : 107
15 : 116
16 : 108
17 : 100
18 : 109
19 : 110
20 : 80
21 : 109
22 : 71
23 : 68


Frequency Table for Number of Ask Comments per Hour:
00 : 447
01 : 683
02 : 1381
03 : 421
04 : 337
05 : 464
06 : 397
07 : 267
08 : 492
09 : 251
10 : 793
11 : 641
12 : 687
13 : 1253
14 : 1416
15 : 4477
16 : 1814
17 : 1146
18 : 1439
19 : 1188
20 : 1722
21 : 1745
22 : 479
23 : 543


Looking at our first frequency table, we see that the hour with the most number of posts is 3:00PM (i.e., 15th hour), with 116 posts. Generally, however, we can see that the majority of posts are made available between the hours of 2:00PM and 9:00PM (i.e., between the 14th and 21st hour). The fewest posts are made available during the early morning hours (4:00AM to 9:00AM).

Looking at our second frequency table, we see that the hour with the most number of comments is also 3:00PM (i.e., the 15th hour). Generally, most comments are posted between the hours of 1:00PM and 9:00PM (i.e., between the 13th and 21st hour).

We can surmise a number of things here:
1. There is a flurry of activity, both posting and commenting, during the mid-afternoon and evening hours.
2. It's possible that people comment more during these hours because people know more posts are made available during this time.
3. Conversely, we can theorize that those who are around to comment during this time may also be available to post more *Ask HN* content.

Let us now investigate point 2, calculating the average number of comments an *Ask HN* post receives by hour created.

In [129]:
#Calculate average number of comments for Ask posts for each hour

avg_by_hour = []

for key in comments_by_hour:
    avg_by_hour.append([key, comments_by_hour[key]/counts_by_hour[key]])

#Sort lists by hour:

sorted_abh = sorted(avg_by_hour)

#Organizing output

print("Average Number of Comments an Ask HN Post Receives per Hour")
print('\n')
template = "At {hour}:00, there were {avg:,.2f} average comments per post"
for row in sorted_abh:
    hour = row[0]
    avg = row[1]
    output = template.format(hour=hour, avg=avg)
    print(output)

Average Number of Comments an Ask HN Post Receives per Hour


At 00:00, there were 8.13 average comments per post
At 01:00, there were 11.38 average comments per post
At 02:00, there were 23.81 average comments per post
At 03:00, there were 7.80 average comments per post
At 04:00, there were 7.17 average comments per post
At 05:00, there were 10.09 average comments per post
At 06:00, there were 9.02 average comments per post
At 07:00, there were 7.85 average comments per post
At 08:00, there were 10.25 average comments per post
At 09:00, there were 5.58 average comments per post
At 10:00, there were 13.44 average comments per post
At 11:00, there were 11.05 average comments per post
At 12:00, there were 9.41 average comments per post
At 13:00, there were 14.74 average comments per post
At 14:00, there were 13.23 average comments per post
At 15:00, there were 38.59 average comments per post
At 16:00, there were 16.80 average comments per post
At 17:00, there were 11.46 average comments 

Above, we see that that, on average, the greatest number of comments occur at 3:00PM (i.e., the 15th hour), with an average of 38 comments. Interestingly, the second largest average number of comments occurs at 2:00AM, with an average of 23 comments.

#### Conclusion for *Ask HN* Posts:

1. We began our analysis focusing on *Ask HN* posts, given that *Ask HN* posts are more popular (i.e., there are more of these types of posts. 
2. The greatest number of *Ask HN* posts are posted at ~3PM.
3. The greatest number of comments for *Ask HN* are posted at ~3PM. Relatedly, the largest average number of comments per post occurs at ~3PM.

Of course, one resulting question is why is there such a flurry of activity during this time for *Ask HN* posts. One hypothesis is that this is the time during the workday where people are most likely to have a break, and so they have a bit of time to visit and interact with their favorite websites. 

Another question is why are there more comments on average for *Ask HN* posts compared to *Show HN* posts. One explanation is perhaps people are instrinsically motivated to want to help people, so we are likelier to read *Ask HN* posts to help posters resolve their questions.

### Average Number of Comments for *Show HN* Posts by Hour

Let's redo the previous analysis focusing on the *Show HN* data. Perhaps we don't have anything to ask, but we would still like to maximize viewers/commentor numbers if we instead want to show something.

In [130]:
#Number of Ask HN posts and comments by hour created

import datetime as dt

counts_by_hour = {}
comments_by_hour = {}

for row in show_posts:
    created = row[6]
    num_comments = int(row[4])
    date = row[-1]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, '%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

print("Frequency Table for Number of Show Posts per Hour:")
for key in sorted(counts_by_hour.keys()):
    print(key, ":", counts_by_hour[key])
    
print('\n')
print("Frequency Table for Number of Show Comments per Hour:")
for key in sorted(comments_by_hour.keys()):
    print(key, ":", comments_by_hour[key])

Frequency Table for Number of Show Posts per Hour:
00 : 31
01 : 28
02 : 30
03 : 27
04 : 26
05 : 19
06 : 16
07 : 26
08 : 34
09 : 30
10 : 36
11 : 44
12 : 61
13 : 99
14 : 86
15 : 78
16 : 93
17 : 93
18 : 61
19 : 55
20 : 60
21 : 47
22 : 46
23 : 36


Frequency Table for Number of Show Comments per Hour:
00 : 487
01 : 246
02 : 127
03 : 287
04 : 247
05 : 58
06 : 142
07 : 299
08 : 165
09 : 291
10 : 297
11 : 491
12 : 720
13 : 946
14 : 1156
15 : 632
16 : 1084
17 : 911
18 : 962
19 : 539
20 : 612
21 : 272
22 : 570
23 : 447


Looking at our first frequency table, we see that the hour with the most number of posts is 1:00PM (i.e., 13th hour), with 99 posts. Generally, however, we can see that the majority of posts are made available between the hours of 1:00PM and 5:00PM (i.e., between the 13th and 17th hour). The fewest posts are made available during the late /early morning hours (11:00PM to 10:00AM).

Looking at our second frequency table, we see that the hour with the most number of comments is 2:00PM (i.e., the 14th hour). Generally, most comments are posted between the hours of 12:00PM and 6:00PM (i.e., between the 12th and 18th hour).

Some interpretations:
1. There is a flurry of activity, both posting and commenting, during the mid-afternoon and early evening hours.
2. It's possible that people comment more during these hours because people know more posts are made available during this time.

Let's compute the average number of comments a typical *Show HN* post receives by hour created.

In [131]:
#Calculate average number of show comments for posts for each hour

avg_by_hour = []

for key in comments_by_hour:
    avg_by_hour.append([key, comments_by_hour[key]/counts_by_hour[key]])

#Sort lists by hour:

sorted_abh = sorted(avg_by_hour)

#Organizing output

print("Average Number of Comments a Show HN Post Receives per Hour")
print('\n')
template = "At {hour}:00, there were {avg:,.2f} average comments per post"
for row in sorted_abh:
    hour = row[0]
    avg = row[1]
    output = template.format(hour=hour, avg=avg)
    print(output)

Average Number of Comments a Show HN Post Receives per Hour


At 00:00, there were 15.71 average comments per post
At 01:00, there were 8.79 average comments per post
At 02:00, there were 4.23 average comments per post
At 03:00, there were 10.63 average comments per post
At 04:00, there were 9.50 average comments per post
At 05:00, there were 3.05 average comments per post
At 06:00, there were 8.88 average comments per post
At 07:00, there were 11.50 average comments per post
At 08:00, there were 4.85 average comments per post
At 09:00, there were 9.70 average comments per post
At 10:00, there were 8.25 average comments per post
At 11:00, there were 11.16 average comments per post
At 12:00, there were 11.80 average comments per post
At 13:00, there were 9.56 average comments per post
At 14:00, there were 13.44 average comments per post
At 15:00, there were 8.10 average comments per post
At 16:00, there were 11.66 average comments per post
At 17:00, there were 9.80 average comments per 

Above, we see that that, on average, the greatest number of comments per *Show HN* post occurs at 5:00PM (i.e., the 18th hour), with an average of 15.77 comments. Interestingly, the second highest average number of comments per post (15.71) occurs at midnight (12:00AM)

#### Conclusion for *Show HN* Posts:

1. The greatest number of *Show HN* posts are posted at ~1PM.
2. The greatest number of comments for *Show HN* are posted at ~2PM. Relatedly, the largest average number of comments per post occur at 5PM and 12AM.

There is an interesting pattern here. Most content is posted in the early afternoon, likely because people have time during lunch or their early afternoon break to share content. However, most comments occur in the early afternoon, and the largest average number of comments occurs early (5PM) and late (12AM) evening.

Commenting activity may occur later in the day because this is the time when people are no longer at work, and they have time to dedicate to reading the longer articles characteristic of *Show HN* posts. 

It would be interesting to see whether the length of  *Show HN* posts plays a role in the number of comments received as well. Also, we can consider the variable `num_points`, which is the number of votes a post receives, as a proxy for popularity.