## Data Analysis of Hacker News Posts

Hacker news is akin to the "Reddit of Tech and Startup circles". This project aims to analyze the dataset of posts on Hacker News, and determine at what type of post, and at time (EST) is best to get community engagement.

In [5]:
# Importing the data from .csv file, and separating it into a list and a header.
from csv import reader
opened_file = open('/Users/burnsjse/PythonDirectory/Datasets/hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]
print(hn_header)
print('\n\n')
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']



[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


In [12]:
# Here we are separating the ask, show, and other posts into separate lists.
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    #print(title)
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('length of ask posts: ' + str(len(ask_posts)))
print('length of show posts: ' + str(len(show_posts)))
print('length of other posts: ' + str(len(other_posts)))

length of ask posts: 9139
length of show posts: 10158
length of other posts: 273822


In [19]:
# Checking total coments for each post type to measure engagement
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average ask comments: ' + str(round(avg_ask_comments,1)))

total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
avg_show_comments = total_show_comments / len(show_posts)
print('Average show comments: ' + str(round(avg_show_comments,1)))

Average ask comments: 10.4
Average show comments: 4.9


Here we can see that ask posts generate more comments than show posts

In [51]:
# Creating dictionaries showing the number of posts and the total number of comments on these posts for certain hours.

import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    comments = row[4]
    sublist = [created_at, comments]
    result_list.append(sublist)    
#print(result_list[:5])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    when = row[0]
    comments = int(row[1])
    hour_dt = dt.datetime.strptime(when, "%m/%d/%Y %H:%M")
    hour_str = hour_dt.strftime("%H")
    if hour_str in counts_by_hour:
        counts_by_hour[hour_str] += 1
    else:
        counts_by_hour[hour_str] = 1
        
    if hour_str in comments_by_hour:
        comments_by_hour[hour_str] += comments
    else:
        comments_by_hour[hour_str] = comments

print(counts_by_hour)
print('\n\n')
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}



{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [68]:
## Using the dictionary to calculate averages for each hour and storing as a list, so we can sort for order, etc. 

avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour],1])
print(avg_by_hour)

[['02', 11.137546468401487, 1], ['01', 7.407801418439717, 1], ['22', 8.804177545691905, 1], ['21', 8.687258687258687, 1], ['19', 7.163043478260869, 1], ['17', 9.449744463373083, 1], ['15', 28.676470588235293, 1], ['14', 9.692007797270955, 1], ['13', 16.31756756756757, 1], ['11', 8.96474358974359, 1], ['10', 10.684397163120567, 1], ['09', 6.653153153153153, 1], ['07', 7.013274336283186, 1], ['03', 7.948339483394834, 1], ['23', 6.696793002915452, 1], ['20', 8.749019607843136, 1], ['16', 7.713298791018998, 1], ['08', 9.190661478599221, 1], ['00', 7.5647840531561465, 1], ['18', 7.94299674267101, 1], ['12', 12.380116959064328, 1], ['04', 9.7119341563786, 1], ['06', 6.782051282051282, 1], ['05', 8.794258373205741, 1]]


In [76]:
# Sorting the list in descending order and generating statements for the top 10 times to post an ask post.
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
#print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Top 5 hours for Ask Posts Comments:')
for row in sorted_swap[:4]:
    time = dt.datetime.strptime(row[1], "%H")
    average = row[0]
    print('{} : {:.2f} average comments per post'.format(time.strftime("%H:%m"), average))

Top 5 hours for Ask Posts Comments:
15:01 : 28.68 average comments per post
13:01 : 16.32 average comments per post
12:01 : 12.38 average comments per post
02:01 : 11.14 average comments per post
