## Exploring Hacker News Posts

Hacker News is a website that was started by the startup incubator Y Combinator, where user-submitted posts are voted and commented on.  Hacker News is popular in technology circles and top posts can get hundreds of thousands of views.  

The dataset we will be working with has approximately 80,000 rows, which reflects all submissions that had at least one comment.  Please find below a list of column titles and a short description of each.

**id**: A unique identifier from Hacker News for the post  
**title**: Title of the post  
**url**: URL that the post links to, if available  
**num_points**: Number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes  
**num_comments**: Number of comments made on the post  
**author**: Username of the person who submitted the post  
**created_at**: Date and time at which the post was submitted  

## Introduction

In this analysis, we are interested in posts which have titles that begin with either "Ask HN" or "Show HN".  Users submit Ask HN posts to ask the Hacker News community a question, whereas Show HN posts are intended to show the Hacker News community an interesting concept or idea.

We will be comparing these two types of posts to answer the following questions:

1. Do Ask HN or Show HN posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the dataset into a list of lists.

In [30]:
from csv import reader

opened_file = open("hacker_news.csv", encoding = "utf8")
read_file = reader(opened_file)
hn = list(read_file)

print(hn[0])
print("\n")
print(hn[1])
print("\n")
print(hn[2])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


## Cleaning the Dataset

Let's separate the headers from the rest of the data

In [31]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [32]:
hn = hn[1:]
print(hn[0:2])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']]


Let's remove all entries that don't have at least one comment

In [33]:
hn_comments = []

for row in hn:
    n_comments = int(row[4])
    if n_comments > 0:
        hn_comments.append(row)

hn = hn_comments
print(len(hn_comments))

80401


## Extracting Ask HN and Show HN Posts

Now let's figure out how many different types of posts there are.  To do this, we will iterate over the "title" column and group posts based on whether the title starts with "ask hn" or "show hn".  Because different users will likely use different capitalization techniques, we will make all of the titles lowercase.

In [35]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Ask Posts: " + str(len(ask_posts)))
print("Show Posts: " + str(len(show_posts)))
print("Other Posts: " + str(len(other_posts)))

Ask Posts: 6911
Show Posts: 5059
Other Posts: 68431


##  Calculating the Average Number of Comments for Different Types of Posts 

Now that we have the posts grouped, we can see which type of posts receives more comments on average.

In [37]:
total_ask_comments = 0

for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0

for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print("Avg. Number of Ask Comments: " + str(avg_ask_comments))
print("Avg. Number of Show Comments: " + str(avg_show_comments))

Avg. Number of Ask Comments: 13.744175951381855
Avg. Number of Show Comments: 9.810832180272781


On average, Ask HN posts receive ~40% more comments than Show HN posts.  Intuitively, this seems reasonable due to the fact that Ask HN posts are intended to be interactive because they request help from the community.  

## Determining Number of Posts and Comments by Hour

Let's move on to answering our second question, which is whether or not comments are related to the time of day a post is made.

We will do this by creating a new list of lists which contains when each post was made and the number of comments.  Afterward, we will iterate through our new list of lists to create frequency tables that contain the number of posts and comments by hour. 

In [43]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    new_list = [created_at, n_comments]
    result_list.append(new_list)

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    n_comment = int(row[1])
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = n_comment 
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += n_comment    

print(counts_by_hour)
print(comments_by_hour)        

{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


Let's create a new list of lists which contain the average number of comments per post.

In [46]:
avg_posts_by_hour = []

for hour in comments_by_hour:
    avg_posts_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
print(avg_posts_by_hour)    

[['02', 13.198237885462555], ['01', 9.367713004484305], ['22', 11.749128919860627], ['21', 11.056511056511056], ['19', 9.414285714285715], ['17', 13.73019801980198], ['15', 39.66809421841542], ['14', 13.153439153439153], ['13', 22.2239263803681], ['11', 11.143426294820717], ['10', 13.757990867579908], ['09', 8.392045454545455], ['07', 10.095541401273886], ['03', 10.160377358490566], ['16', 10.76144578313253], ['08', 12.43157894736842], ['00', 9.857142857142858], ['23', 8.322463768115941], ['20', 11.38265306122449], ['18', 10.789823008849558], ['12', 15.452554744525548], ['04', 12.688172043010752], ['06', 9.017045454545455], ['05', 11.139393939393939]]


## Formatting Our Results

Now that we have our results, let's format them so they're a little easier to read.

In [47]:
swap_avg_posts_by_hour = []

for entry in avg_posts_by_hour:
    swap_list = [entry[1], entry[0]]
    swap_avg_posts_by_hour.append(swap_list)
    
print(swap_avg_posts_by_hour)    

[[13.198237885462555, '02'], [9.367713004484305, '01'], [11.749128919860627, '22'], [11.056511056511056, '21'], [9.414285714285715, '19'], [13.73019801980198, '17'], [39.66809421841542, '15'], [13.153439153439153, '14'], [22.2239263803681, '13'], [11.143426294820717, '11'], [13.757990867579908, '10'], [8.392045454545455, '09'], [10.095541401273886, '07'], [10.160377358490566, '03'], [10.76144578313253, '16'], [12.43157894736842, '08'], [9.857142857142858, '00'], [8.322463768115941, '23'], [11.38265306122449, '20'], [10.789823008849558, '18'], [15.452554744525548, '12'], [12.688172043010752, '04'], [9.017045454545455, '06'], [11.139393939393939, '05']]


In [49]:
sorted_swap = sorted(swap_avg_posts_by_hour, reverse = True)

print("Top 5 Hours for Ask HN Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg
        )
    )

Top 5 Hours for Ask HN Comments
15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


Based on the above results, posting an Ask HN at 3 PM ET is most likely to attract the most comments.

## Conclusion

In this project, we analyed Ask HN and Show HN posts to determine which type of post and hour of the day received the most comments on average.  

Based on our analysis, if you want to attract the most comments, we recommend posting an Ask HN at 3 PM ET.