Exploring Hacker News Posts

In this project we're going to anaylze posts made on the website Hacker News. Hacker News is a webite similar to reddit but more focused on the technology sector. Users submit stories also known as posts, which are voted and commented upon. Two types of posts users make are Ask HN and Show HN, ask posts usually involve individuals asking about certain topics or  asking questions while Show HN is more about showing perhaps a project an individual is working on or a program they've created.

The columns for the data set are as folowed : ID - unique identifier from hacker news for the post // title - title of the post // url - url link to the post if it has one // num_points - number of points the post acquired, calculated as the totle number of upvotes minus the total number of downvotes // num_comments - number of comments that were made on the post // author - username of individual who submitted the post // created_at - date and time at which the post was submitted.

Let us start by opening and viewing the data set, followed by storing the header in a variable header, and the date set in a variable named hn.

In [2]:
#import csv, open file, read it, convert it to list
import csv
f = open('hacker_news.csv')
r = csv.reader(f)
hn = list(r)
header = hn[0]
hn = hn[1:]

In [4]:
# print header and first 5 rows of data set
print(header)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Since we're only interested in Ask HN and Show HN posts we'll use the .startswith string method to filter how those two types of posts.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for v in hn:
    title = v[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(v)
    elif title.lower().startswith('show hn'):
        show_posts.append(v)
    else:
        other_posts.append(v)


In [6]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


After filtering through the data set, we're left with 1744 ask posts and 1162 show posts.

Next we will see which of the two types of posts receives a higher number of average comments.

In [8]:
total_ask_comments = 0

for comments in ask_posts:
    total_ask_comments += int(comments[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_ask_comments    

14.038417431192661

In [10]:
total_show_comments = 0

for comments in show_posts:
    total_show_comments += int(comments[4])

avg_show_comments = total_show_comments / len(show_posts)

avg_show_comments

10.31669535283993

Ask posts received an average of 14 comments per post and show posts received an average of 10 comments per post.

Now that we know ask posts receive on average more comments we'll see if those ask posts receive more comments at a certain time of the day. So we'll have to calculate the amount of ask posts created in each hour of the day, along with the number of comments and then calculate the average number of comments ask posts receive by hour created.

In [12]:
#import datetime module
import datetime as dt

result_list = []

for v in ask_posts:
    result_list.append([v[6], int(v[4])])

#test     
#print(result_list)

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Next we'll use the two dictionaries we created above ( counts_by_hour and comments_by_hour ) to calculate the average number of comments for posts created during each hour of the day. 

In [13]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

In [14]:
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Now that we calculated the average number of comments for posts created during each hour of the day, we'll need to sort the list and find the highest value - we'll print out the 5 highest values.

In [15]:
swap_avg_by_hour = []

for v in avg_by_hour:
    swap_avg_by_hour.append([v[1], v[0]])
    

In [16]:
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [17]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [18]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We can see in the data above that ask hn posts that were created at 15:00 received the highest average comments per post with 38.59, next highest average comments per post were those posts created at 02:00 almost 15 average comments lower than at 15:00.

Conclusion

In this project we analyzed ask posts and show posts to determine which type of post and time receives the most comments on average. We've come to the conculsion that Ask HN posts recieved on average 4 more comments than Show HN posts. After analyzing the Ask HN posts we learned that posts created at 15:00 receive the highest amount of comments on average.