# Exploring Hacker News Posts Through Data Analysis

Using a dataset of randomly sampled posts with comments, I will be conducting data analysis to explore posts on [Hacker News](https://news.ycombinator.com/). There are two posts that I am interested in exploring. "Ask HN" are posts submitted to Hacker News that asks the community a specific question. "Show HN" posts are to show the Hacker News community something of interest like an article, project, or product. 

I want to conduct analysis to determine the following:
1. Do "Ask HN" or "Show HN" posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

## Introduction

In [1]:
#reading the dataset in a list of lists

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


To continue the analysis, I have to remove the column header row. 

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extract Show HN and Ask HN Posts

Now that we have removed the column header rows, I am ready to filter the data between 'Ask HN' posts and 'Show HN' posts.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Average Comments on Ask HN and Show HN Posts

In [4]:
# To determine the number of 'Ask HN' comments

total_ask_comments = 0
for row in ask_posts:
    comment_number = row[4]
    comment_number = int(comment_number)
    total_ask_comments += comment_number
    
# To determine the average number of comments per 'Ask HN' post
avg_ask_comments = total_ask_comments / 1744
print(avg_ask_comments)

14.038417431192661


In [5]:
# To determine the number of 'Show HN' comments
total_show_comments = 0
for row in show_posts:
    comment_num = row[4]
    comment_num = int(comment_num)
    total_show_comments += comment_num
    
# To determine the average number of comments per 'Show HN' post
avg_show_comments = total_show_comments / 1162
print(avg_show_comments)

10.31669535283993


Based on the average calculations, 'Ask HN' posts receive more comments on average. 'Ask HN' receive an average of 14 comments versus an average of 10 comments for 'Show HN' posts.

## Calculate the Number of Posts and Comments by the Hour 

The next step is to determine if ask posts created at a certain time are more likely to bring in more comments. 

1. Calculate the amount of ask posts created in each hour of the day and the number of comments on these posts
2. Calculate the average number of comments ask posts receive by the hour

In [6]:
#Pull the created time and number of comments from ask_posts
import datetime as dt

result_list = []
for row in ask_posts:
    created_time = row[6]
    comments_num = int(row[4])
    result_list.append([created_time, comments_num])
    
print(result_list[:4])
    

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3]]


In [7]:
#extract the number of comments by hour in place them in dictionaries
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    post_date = row[0]
    post_date = dt.datetime.strptime(post_date, "%m/%d/%Y %H:%M")
    post_hour = post_date.strftime("%H")
    
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = row[1]
        
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += row[1]

print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


I will be using the count_by_hour and comments_by_hour dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [8]:
#Calculate the avergae number of comments for posts 
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[hour]) ])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


We have the results of the average number of comments for posts created during each hour of the day. However, it is best to sort of the avg_by_hour list of lists and then sort the five highest values in a format that is easier to digest. 

In [9]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [10]:
#Sort the values and print the top 5 hours for Ask Post Comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    template = "{h}: {c:.2f} average comments per post"
    hours = dt.datetime.strptime(row[1], '%H')
    hours_format = hours.strftime('%H:00')
    output = template.format(h = hours_format, c=row[0])
    print(output)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion 

After checking the documentation for the dataset, I realized that the hours are in EST. I also reside in the EST zone, so I do not need to convert the times to another time zone. The best time to create a post for higher comment engagement are 15:00 (3 PM), 2:00 AM, 20:00 (8 PM), 16:00 (4 PM) and 21:00 (9 PM). At 3 PM, posts get an average of 38 comments!