# Exploring Hacker News Posts

Hacker News is a site where user-submitted stories (or 'posts') are voted and commented on. Hacker News is popular in startup circles and posts that make it to the top of Hacker News' listings can get hundreds of thousands of views as a result.

You can find the data set that we will be working with in this project [Here](https://www.kaggle.com/hacker-news/hacker-news-posts).

For the purposes of this project, we are interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the community questions. Users submit `Show HN` posts to show the community a project, product or something of interest. 

Specifically we are going to compare these two types of posts to determine the following:
   - Do `Ask HN` or `Show HN` posts receive more comments on average?
   - Do posts created at a certain time receive more comments on average?
   
**Please note that the data set we are working with has been reduced from nearly 300,000 rows to about 20,000 rows by removing all submissions that did not receive any comments. Then random sampling from the remaining submissions was utilized to create the final data set.**



## Introduction

We will start by importing the libraries we need to read the data set and then turn it into a list of lists.

In order to analyze our data, we will isolate the header row of the data set. We want to keep a reference, but don't want it included in the data.

In [1]:
from csv import reader # import reader from csv module
opened_file = open('hacker_news.csv') # opens data set
read_file = reader(opened_file) # reader applied to opened data set
hn = list(read_file) # creates a list of lists from the read data set
hn_headers = hn[0] # isolates header row
hn = hn[1:] # list of lists containing only data

print(hn_headers) # printing header row
print(hn[:5]) # printing first 5 rows of the list of lists

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We see that there are a few collumns in the header row that will be useful in our analysis. We will print those columns for later reference (Title, Number of Comments, and Date the Post was Created):

In [2]:
print(hn_headers[1]) # prints title collumn from header
print(hn_headers[4]) # prints number of comments collumn from header
print(hn_headers[-1]) # prints date the post was created from the header

title
num_comments
created_at


## Isolating `Ask HN` and `Show HN` Posts

We will seperate the posts beginning with `Ask HN` and `Show HN` into two different lists. We will need to take letter case variation into account when we write our code. By seperating the data into seperate lists we will be making it more efficient to conduct our analysis.

In [3]:
ask_posts = [] # establishes list for ask hn posts
show_posts = [] # establishes list for show hn posts
other_posts = [] # establishes list for other posts 

for post in hn: # loop through data set
    title = post[1] # assigns title collumn to variable
    if title.lower().startswith('ask hn'):
        ask_posts.append(post) # adds post to ask_posts list if it begins with ask hn
    elif title.lower().startswith('show hn'):
        show_posts.append(post)  # adds post to ask_posts list if it begins with show hn
    else:
        other_posts.append(post) # adds all other posts to other_posts list

# print lengths of created lists        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating the Average Number of Comments on `Ask HN` and `Show HN` Posts

Now that we have seperate lists for each post type, we can determine the average number of comments each receives.

In [4]:
total_ask_comments = 0 # initiates ask post comment counter

for row in ask_posts: # loops through ask posts data
    n_ask_comments = int(row[4]) # assigns number of comments collumn to variable
    total_ask_comments = total_ask_comments + n_ask_comments # adds number of comments for a row to counter
avg_ask_comments = total_ask_comments / len(ask_posts) # calculates average number of comments for ask posts
print('Average # of comments ask posts receive:', avg_ask_comments) # prints result


total_show_comments = 0 # initiates show post comment counter

for row in show_posts: # loops through show posts data
    n_show_comments = int(row[4]) # assigns number of comments collumn to variable
    total_show_comments = total_show_comments + n_show_comments # adds number of comments for a row to counter
avg_show_comments = total_show_comments / len(show_posts) # calculates average number of comments for show posts
print('Average # of comments show posts receive:', avg_show_comments) # prints result

Average # of comments ask posts receive: 14.038417431192661
Average # of comments show posts receive: 10.31669535283993


It appears that on average `Ask HN` posts receive about 14 comments, whereas `Show HN` posts receive about 10. We will turn the focus of our analysis to `Ask HN` posts as they receive more comments on average.

## Determining the Amount of Ask Posts and Comments from Hour Created

Next we will want to determine if ask posts created at certain times are more likely to receive comments. We will do this by:
   - Calculating the amount of ask posts created in each hour of the day, along with the number of comments
   - Calculate the average number of comments ask posts receive by hour created.

In [8]:
import datetime as dt # import datetime module

result_list = [] # establishes empty list
for row in ask_posts: # loops through ask posts data
    result_list.append([row[6], int(row[4])]) # appends two elements to result list as a list
    # added elements are created at and number of comments collumn
    
counts_by_hour = {} # establishes empty dictionary to count posts
comments_by_hour = {} # establishes empty dictionary to count comments
date_format = "%m/%d/%Y %H:%M" # establishes time object format to match created at collumn
for row in result_list: # loops though result list created above
    date = row[0] # assigns created at collumn to variable
    comments = row[1] # assigned number of comments collumn to variable
    time = dt.datetime.strptime(date, date_format).strftime('%H') # parses the date of created at collumn
    # creates datetime object we are able to isolate the hour from
    if time not in counts_by_hour: # evaluates when given hour isn't a key in counts_by_hour dictionary
        counts_by_hour[time] = 1 # establishes key for given hour in dictionary
        comments_by_hour[time] = comments # establishes key for given hour in comments_by_hour dictionary
    else:
        counts_by_hour[time] += 1 # adds 1 to key value in counts_by_hour dictionary
        comments_by_hour[time] += comments # adds number of comments to key value for comments_by_hour dictionary 

comments_by_hour
    

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for `Ask HN` Posts by Hour

We want to find out if posts created at a certain time of the day receive more comments on average than other times of the day. We need to calculate the average number of comments for `Ask HN` posts by Hour to accomplish this goal.

In [9]:
avg_by_hour = [] # initiates empty list

for key in comments_by_hour: # loops through comments by hour dictionary
    avg_by_hour.append([key, (comments_by_hour[key] / counts_by_hour[key])])
    # appends row to empty list containing the hour and the calculated average number of comments

avg_by_hour # prints created list



[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting and Analyzing the `avg_by_hour` list of lists

Next, we will switch the values in each row so that the average number of comments is displayed before the hour since our conclusions will be based on the average number of comments.

We will then want to sort the list of lists that we created into descending order so that the data displayed is more readable and so that the highest values are easily identifiable at the beginning of the list of lists. 

Lastly, we will display the five rows with the highest average number of comments value to identify which times of the day receive the most comments on average.

In [10]:
swap_avg_by_hour = [] # initiates empty list

for row in avg_by_hour: # loops through avg by hour list of lists
    hour = row[0] # assigns hour to variable
    avg_comments = row[1] # assigns average number of comments to variable
    swap_avg_by_hour.append([avg_comments, hour]) # appends same row with flipped values to new list

swap_avg_by_hour # prints created list

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [13]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True) # sorts swap avg by hour list of lists in descending order
sorted_swap # prints sorted list of lists

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [15]:
print('Top 5 Hours for Ask Posts Comments') # prints text in string
for row in sorted_swap[:5]: # loops through first five rows of sorted list of lists
    avg_comments = row[0] # assigns average number of comments to variable
    hour = row[1] # assigns hour to variable
    print(
        '{}: {:.2f} average comments per post'.format(
            dt.datetime.strptime(hour, '%H').strftime('%H:%M'), avg_comments)
    )
    # prints text in string
    # {}s set format
    # uses strptime constructor to return a datetime object
    # uses strftime method to convert to specify format of the returned datetime object to put in first {}
    # references avg comments to insert into second {} with 2 decimal places

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The top hour in terms of average number of comments is 15:00 with 38.59 comments. The difference between the top hour and the runner up is larger than the difference between the runner up and the fifth ranked hour. The top hour receives 14.78 or 62.0747% more comments on average than the runner up hour. The runner up hour receives 7.8 or 48.7196% more comments on average than the fifth ranked hour.

**Please note that according to the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home) the timezone being used is Eastern Time in the United States. We could write 15:00 as 3:00 PM EST, and this also helps us to know what the best times to post are in different time zones.**

## Conclusion

In this project, we analyzed `Ask HN` posts to determine which type of post received the most comments on average. We determined that `Ask HN` receive more on average and used that data to analyze which times of the day received the most comments on average. We could that 15:00 (3:00 PM EST) was the time of the day that received the most comments on average, by quite a bit. We would reccomend that when posting on Hacker News it would be best to utilize `Ask HN` posts and to post between 3:00 PM EST and 4:00 PM EST.

**Please note that the data set that we analyzed excludes posts that did not have any comments. Additionally a random sample was taken out of the remaining posts to make the data more manageable. Taking that background information into account, it would be more accurate for us to conclude that of the posts that received comments, `Ask HN` posts received more comments on average than `Show HN` posts, and `Ask HN` posts created during the 15:00 hour (EST) of the day received the most comments on average.**