# Exploring Hacker News Posts

[Hacker News](https://news.ycombinator.com/) is a site where user-submitted stories (known as "posts") are voted and commented upon. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`:

`Ask HN` - These are posts that users submit to ask the Hacker News community a specific question.
`Show HN` - These are posts that users submit to show the Hacker News community a project, product, or just generally something interesting.

In our analysis, we'll compare these two types of posts to determine the following:

* Do `Ask HN` or `Show HN` posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?


## Introduction

First we'll read in our [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) (a description of what each column contains can be found by following the link).
This dataset contains approximately 300,000 rows and contains information such as the link that a post points to, the votes they have been awarded and how many comments they have received.


In [1]:
# Read in the Data
from csv import reader

opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


# Removing Headers from a List of Lists

Notice the first list in the inner lists contains the column headers, and the lists after contain the data for a given row. In order to analyse our data, we need to first remove the row containing the column headers.

In [2]:
# Remove the headers.
headers = hn[0]
hn = hn[1:]

print(headers)
print('\n')
print(hn[:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


# Extracting Ask HN and Show HN Posts

Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

To find the desired posts, we'll use the string method `startswith`, while accounting for variations in capitalization.

In [3]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate into different lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))



9139
10158
273822


Below are the first five rows in the new `ask_posts` list of lists:

In [4]:
for row in ask_posts[:5]:
    print(row)
    print('\n')

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']




Below are the first five rows in thw `show_posts` list of lists:

In [5]:
for row in show_posts[:5]:
    print(row)
    print('\n')

['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']


['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']


['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17']


['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']




# Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now we'll determine if `Ask HN` posts or `Show HN` posts receive more comments on average.

In [6]:
# Calculate the average number of comments on `Ask HN` posts.
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

10.393478498741656


In [7]:
# Calculate the average number of comments on `Show HN` posts.
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.886099625910612


As we can see from the results above, `Ask HN` posts receive a much higher number of comments on average than `Show HN` posts.

# Finding the Amount of Ask Posts and Comments by Hour Created

Since `Ask HN` posts are more likely to receive comments, we'll focus our remaining analysis on these posts.

During the second part of our analysis, we'll determine whether `Ask HN` posts created at a certain time are more likely to attract comments. We'll carry out the following steps:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

Below, we'll tackle the first step - calculating the amount of `Ask HN` posts and comments received by hour created.


In [8]:
# Calculate the amount of ask posts created during each hour of the day and the number of comments received.
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    num_comments = row[1]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments


# Calculating the Average Number of Comments for Ask HN Posts by Hour

Above, we created two dictionaries:

* `counts_by_hour`: contains the number of `Ask HN` posts created during each hour of the day.
* `comments_by_hour`: contains the corresponding total number of comments `Ask HN` posts created at each hour recieved

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [9]:
# Calculate the average number of comments that `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

print(avg_by_hour)
   

                        

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


# Sorting and Printing Values from a List of Lists

Although we now have the results we need to discern the average number of comments for posts created during each hour of the day, the format above makes it difficult to identify the hours with the highest averages.
We'll finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [10]:
# Sort the list of lists in descending order.
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)


[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [11]:
# Print the top results in a readable format.
print("Top 5 Hours for Ask Posts Comments")

template = "{hour}: {avg:.2f} average comments per post"

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    avg = row[0]
    formatted_result = template.format(hour=hour, avg=avg)
    print(formatted_result)

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


We can see here that the hour with the highest average number of comments per post is 15:00. According to the [dataset documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/), the timezone is set to US Eastern Time.
This translates to 20:00 British Summer Time.

# Conclusion

In this project, we analysed `Ask HN` and `Show HN` posts on [Hacker News](https://news.ycombinator.com/) to determine which type of posts created at which time recieve the most comments on average.
Based on our analysis, to maximise the engagement of a post, we'd recommend that it be catagorised as an `Ask HN` post and created between the hours of 20:00 and 21:00 BST.