# Analysing Hacker News Posts

## Description of the Project

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

This project is designed to compare two types of common Hacker News listings. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting.

`Ask HN` examples:
- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?

`Show HN` examples:
- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

The dataset to be used for this analysis is available on Kaggle [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).

## Project Goal

Our aim is to compare "Ask HN" and "Show HN" posts to answer the following questions:
1) Which type of post receives more comments, on average
2) Does the time a post is created affect the average number of comments

By answering these questions users will be able to create posts that generate more comments.

### Process and Analysis

Importing the CSV:

In [131]:
from csv import reader
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding='UTF8')
read_file = reader(opened_file)
allposts = list(read_file)

Viewing the first five entries of the dataset, and determing the length of the datset:

In [132]:
print(allposts[:5])
print(len(allposts))

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
293120


Removing the header row:

In [133]:
header = allposts[0]
allposts = allposts[1:]

print("The headers:")
print(header)
print('\n')
print("The first five rows of data:")  

for row in allposts[0:5]:    
    print(row)
    
print('\n')
print("There are",len(allposts), "rows of data.")

The headers:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


The first five rows of data:
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor

Separating "Ask HN", "Show HN" and "Other" posts into lists:

In [134]:
ask_posts = []
show_posts = []
other_posts = []

for row in allposts:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
                                 
print("Total posts:",len(allposts))
print("Ask posts:",len(ask_posts))
print("Show posts:",len(show_posts))
print("Other posts:",len(other_posts))

Total posts: 293119
Ask posts: 9139
Show posts: 10158
Other posts: 273822


Finding total number of comments in ask posts to determine the average number of comments per ask post:

In [136]:
total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments = total_ask_comments + comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,2))

10.39


Finding total number of comments in ask posts to determine the average number of comments per ask post:

In [137]:
total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments = total_show_comments + comments
    
avg_show_comments = total_show_comments / len(ask_posts)
print(round(avg_show_comments,2))

5.43


Posts that start with "Ask HN" receive, on average, 10.39 comments.
Posts that start with "Show HN" receive, on average, 5.43 comments.

Importing the datetime module as dt:

In [138]:
import datetime as dt

Making sure datetime is working and we are able to extract the hour in double digit format:

In [141]:
print(ask_posts[0])
for row in ask_posts[:5]:
    time = dt.datetime.strptime(row[6], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time, "%H")
    print(time)
    print(hour)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
2016-09-26 02:53:00
02
2016-09-26 01:17:00
01
2016-09-25 22:57:00
22
2016-09-25 22:48:00
22
2016-09-25 21:50:00
21


Creating dictionaries of hour : posts per hour and hour : total comments, then filling out:

In [142]:
counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    comments = int(row[4])
    time = dt.datetime.strptime(row[6], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time, "%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
    else:
        counts_by_hour[hour] = 1
        
    if hour in comments_by_hour:
        comments_by_hour[hour] += comments
    else:
        comments_by_hour[hour] = comments
        
print(counts_by_hour)
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


Creating a list of hour and average comments for the hour:

In [143]:
avg_comments_by_hour = []

## for each row in the counts_by_hour dictionary
for hour in counts_by_hour:
    
    ## append the avg_comments_by_hour_list
    avg_comments_by_hour.append(
        
        ## the key in the dictionary becomes the first element in a list item
        [hour,
         
         ## the 
             ## value for the corresponding key in the comments_by_hour dictionary 
         ## divided by 
             ## the value for the corresponding key in the counts_by_hour dictionary 
         ## becomes the second element in a list item
         comments_by_hour[hour]/counts_by_hour[hour]
        ]
    )

In [144]:
print(avg_comments_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


Rounding these averages down to 2 dp:

In [145]:
for row in avg_comments_by_hour:
    avg_comments = row[1]
    avg_comments = round(avg_comments,2)
    row[1] = avg_comments

print(avg_comments_by_hour)

[['02', 11.14], ['01', 7.41], ['22', 8.8], ['21', 8.69], ['19', 7.16], ['17', 9.45], ['15', 28.68], ['14', 9.69], ['13', 16.32], ['11', 8.96], ['10', 10.68], ['09', 6.65], ['07', 7.01], ['03', 7.95], ['23', 6.7], ['20', 8.75], ['16', 7.71], ['08', 9.19], ['00', 7.56], ['18', 7.94], ['12', 12.38], ['04', 9.71], ['06', 6.78], ['05', 8.79]]


Sorting these value to be in order of highest average posts to lowest average posts:

Swapping hour and avg_comments:

In [146]:
swapped_avg_comments_by_hour = []
for hour in avg_comments_by_hour:
    hr = hour[0]
    comments = hour[1]
    swapped_avg_comments_by_hour.append([comments, hr])
print(swapped_avg_comments_by_hour)

[[11.14, '02'], [7.41, '01'], [8.8, '22'], [8.69, '21'], [7.16, '19'], [9.45, '17'], [28.68, '15'], [9.69, '14'], [16.32, '13'], [8.96, '11'], [10.68, '10'], [6.65, '09'], [7.01, '07'], [7.95, '03'], [6.7, '23'], [8.75, '20'], [7.71, '16'], [9.19, '08'], [7.56, '00'], [7.94, '18'], [12.38, '12'], [9.71, '04'], [6.78, '06'], [8.79, '05']]


Sorting the new list in descending order:

In [147]:
ordered_avg_comments_by_hour = sorted(swapped_avg_comments_by_hour,reverse=True)
print(ordered_avg_comments_by_hour)

[[28.68, '15'], [16.32, '13'], [12.38, '12'], [11.14, '02'], [10.68, '10'], [9.71, '04'], [9.69, '14'], [9.45, '17'], [9.19, '08'], [8.96, '11'], [8.8, '22'], [8.79, '05'], [8.75, '20'], [8.69, '21'], [7.95, '03'], [7.94, '18'], [7.71, '16'], [7.56, '00'], [7.41, '01'], [7.16, '19'], [7.01, '07'], [6.78, '06'], [6.7, '23'], [6.65, '09']]


Printing the top five hours for average comments per post in the "Ask Post" category of posts:

In [148]:
print("Top 5 Hours for Ask Posts Comments")
for row in ordered_avg_comments_by_hour[:5]:
    print(row[0],"posts at",row[1])

Top 5 Hours for Ask Posts Comments
28.68 posts at 15
16.32 posts at 13
12.38 posts at 12
11.14 posts at 02
10.68 posts at 10


Based on the above, you are likely to receive the greatest number of comments on an "Ask HN" post if it is posted between 3:00pm and 4:00pm. If you are unable to post at this time, posting between 10:00 am and 4:00 pm will likely receive a large number of comments.

Repeating the process for "Show HN" posts:

Creating dictionaries of hour : posts per hour and hour : total comments, then filling out:

In [151]:
counts_by_hour_2 = {}
comments_by_hour_2 = {}

for row in show_posts:
    comments = int(row[4])
    time = dt.datetime.strptime(row[6], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time, "%H")
    if hour in counts_by_hour_2:
        counts_by_hour_2[hour] += 1
    else:
        counts_by_hour_2[hour] = 1
        
    if hour in comments_by_hour_2:
        comments_by_hour_2[hour] += comments
    else:
        comments_by_hour_2[hour] = comments
        
print(counts_by_hour_2)
print(comments_by_hour_2)

{'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172}
{'00': 1283, '23': 1444, '20': 2183, '19': 2791, '18': 3242, '16': 3769, '14': 3839, '10': 1228, '09': 1411, '08': 1771, '06': 904, '03': 934, '21': 1759, '17': 3236, '15': 3824, '11': 2413, '07': 1577, '04': 978, '13': 3314, '12': 3609, '01': 1006, '22': 1450, '02': 1076, '05': 592}


Creating a list of hour and average comments for the hour:

In [158]:
avg_comments_by_hour_2 = []

## for each row in the counts_by_hour dictionary
for hour in counts_by_hour_2:
    
    ## append the avg_comments_by_hour_list
    avg_comments_by_hour_2.append(
        
        ## the key in the dictionary becomes the first element in a list item
        [hour,
         
         ## the 
             ## value for the corresponding key in the comments_by_hour dictionary 
         ## divided by 
             ## the value for the corresponding key in the counts_by_hour dictionary 
         ## becomes the second element in a list item
         comments_by_hour_2[hour]/counts_by_hour_2[hour]
        ]
    )
print(avg_comments_by_hour_2)

[['00', 4.648550724637682], ['23', 4.5266457680250785], ['20', 4.158095238095238], ['19', 5.01978417266187], ['18', 4.942073170731708], ['16', 4.705368289637953], ['14', 5.515804597701149], ['10', 3.801857585139319], ['09', 4.672185430463577], ['08', 5.6044303797468356], ['06', 4.708333333333333], ['03', 4.533980582524272], ['21', 4.090697674418605], ['17', 4.252299605781866], ['15', 4.574162679425838], ['11', 6.002487562189055], ['07', 6.682203389830509], ['04', 5.041237113402062], ['13', 5.432786885245902], ['12', 6.994186046511628], ['01', 4.0728744939271255], ['22', 3.8461538461538463], ['02', 5.148325358851674], ['05', 3.441860465116279]]


Rounding these averages down to 2 dp:

In [159]:
for row in avg_comments_by_hour_2:
    avg_comments = row[1]
    avg_comments = round(avg_comments,2)
    row[1] = avg_comments

print(avg_comments_by_hour_2)

[['00', 4.65], ['23', 4.53], ['20', 4.16], ['19', 5.02], ['18', 4.94], ['16', 4.71], ['14', 5.52], ['10', 3.8], ['09', 4.67], ['08', 5.6], ['06', 4.71], ['03', 4.53], ['21', 4.09], ['17', 4.25], ['15', 4.57], ['11', 6.0], ['07', 6.68], ['04', 5.04], ['13', 5.43], ['12', 6.99], ['01', 4.07], ['22', 3.85], ['02', 5.15], ['05', 3.44]]


Sorting these value to be in order of highest average posts to lowest average posts:

Swapping hour and avg_comments:

In [160]:
swapped_avg_comments_by_hour_2 = []
for hour in avg_comments_by_hour_2:
    hr = hour[0]
    comments = hour[1]
    swapped_avg_comments_by_hour_2.append([comments, hr])
print(swapped_avg_comments_by_hour_2)

[[4.65, '00'], [4.53, '23'], [4.16, '20'], [5.02, '19'], [4.94, '18'], [4.71, '16'], [5.52, '14'], [3.8, '10'], [4.67, '09'], [5.6, '08'], [4.71, '06'], [4.53, '03'], [4.09, '21'], [4.25, '17'], [4.57, '15'], [6.0, '11'], [6.68, '07'], [5.04, '04'], [5.43, '13'], [6.99, '12'], [4.07, '01'], [3.85, '22'], [5.15, '02'], [3.44, '05']]


Sorting the new list in descending order:

In [161]:
ordered_avg_comments_by_hour_2 = sorted(swapped_avg_comments_by_hour_2,reverse=True)
print(ordered_avg_comments_by_hour_2)

[[6.99, '12'], [6.68, '07'], [6.0, '11'], [5.6, '08'], [5.52, '14'], [5.43, '13'], [5.15, '02'], [5.04, '04'], [5.02, '19'], [4.94, '18'], [4.71, '16'], [4.71, '06'], [4.67, '09'], [4.65, '00'], [4.57, '15'], [4.53, '23'], [4.53, '03'], [4.25, '17'], [4.16, '20'], [4.09, '21'], [4.07, '01'], [3.85, '22'], [3.8, '10'], [3.44, '05']]


Printing the top five hours for average comments per post in the "Ask Post" category of posts:

In [162]:
print("Top 5 Hours for Show Posts Comments")
for row in ordered_avg_comments_by_hour_2[:5]:
    print(row[0],"posts at",row[1])

Top 5 Hours for Show Posts Comments
6.99 posts at 12
6.68 posts at 07
6.0 posts at 11
5.6 posts at 08
5.52 posts at 14


Based on the above, you are likely to receive the greatest number of comments on a "Show HN" post if it is posted between 12:00pm and 1:00pm. If you are unable to post at this time, posting between 7:00 am and 1:00 pm will likely receive a large number of comments.

This may be due to users reading and commenting on articles earlier in the day, but unable to take the time to answer questions until later in the day.

## Conclusion

Our aim was to compare "Ask HN" and "Show HN" posts to answer the following questions:

1) Which type of post receives more comments, on average?

`Ask HN` posts receive more comments, on average. 

Posts that start with `Ask HN` receive, on average, 10.39 comments.

Posts that start with `Show HN` receive, on average, 5.43 comments.

2) Does the time a post is created affect the average number of comments?

The time a post is created does affect the average number of comments, and the ideal times are different for each type of post.

Top 5 Hours for `Ask HN` Comments
28.68 posts at 15
16.32 posts at 13
12.38 posts at 12
11.14 posts at 02
10.68 posts at 10

Top 5 Hours for `Show HN` Comments
6.99 posts at 12
6.68 posts at 07
6.0 posts at 11
5.6 posts at 08
5.52 posts at 14

### Average Comments per `Ask HN` Post at Different Times of the Day

In [198]:
import pandas as pd
from IPython.display import display, HTML

In [199]:
df = pd.DataFrame(data=ordered_avg_comments_by_hour[:5],columns=['Average Comments','Hour'])
display(HTML(df.to_html(index=False)))

Average Comments,Hour
28.68,15
16.32,13
12.38,12
11.14,2
10.68,10


### Average Comments per `Show HN` Post at Different Times of the Day

In [201]:
df = pd.DataFrame(data=ordered_avg_comments_by_hour_2[:5],columns=['Average Comments','Hour'])
display(HTML(df.to_html(index=False)))

Average Comments,Hour
6.99,12
6.68,7
6.0,11
5.6,8
5.52,14
