# Hacker News Anlaysis

The purpose of this program is to analyze the Hacker News dataset in order to gain insights into two questions:
* Do 'Ask HN' or 'Show HN' posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?

In [1]:
from csv import *
def open_data(file):
    opened_file = open('/Users/nstanzione/Documents/EDU/DataQuest/Data/' + file)
    data_raw = reader(opened_file)
    return list(data_raw)

In [2]:
hn = open_data('HackerNews.csv')

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## Data Set-up

We have takent eh below steps in order to simplify our coding in later sections. We simply removed the header row from the dataset.

In [4]:
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


In [5]:
headers = hn[0]

In [6]:
hn = hn[1:]

In [7]:
print(hn[:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## Ask/Show HN Analysis

For the first question, we will need to isolate the string of text in the title of each of posts. The below steps help us seperate out these posts from the others.

In [26]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts), len(show_posts), len(other_posts))     

9139 10158 273822


In [32]:
total_ask_comments = 0
total_show_comments = 0

for row in hn:
    title = row[1]
    num_comments = int(row[4])
    if title.lower().startswith('ask hn'):
        total_ask_comments += num_comments
    elif title.lower().startswith('show hn'):
        total_show_comments += num_comments


avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)
        
print(total_ask_comments, avg_ask_comments)
print(total_show_comments, avg_show_comments)
    

94986 10.393478498741656
49633 4.886099625910612


### Results

As noted above, it appears 'Ask HN' posts invite more attention/activity as they have more than twice as many comments as 'Show HN' posts. This makes sense in the fact that 'Ask HN' posts are inviting commentary while 'Show HN' posts are more informative in nature and not an invitation.

## Time of Post Analysis

We will now look into how the date-time of post, particularly time of day, and how that information may shed light on additional insights. In the prior section, we noted that ask comments recieve the most comments, so we will focus on that dataset.

In [25]:
import datetime as dt

In [38]:
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments_dt = row[4]
    result_list.append([created_at,num_comments_dt])
    
print(result_list[:5])

[['9/26/2016 2:53', '7'], ['9/26/2016 1:17', '3'], ['9/25/2016 22:57', '0'], ['9/25/2016 22:48', '3'], ['9/25/2016 21:50', '2']]


In [52]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    d = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    h = d.strftime("%H")
    c = int(row[1])
    if h in counts_by_hour:
        counts_by_hour[h] += 1
        comments_by_hour[h] += c
    else:
        counts_by_hour[h] = 1
        comments_by_hour[h] = c

print(counts_by_hour)
print("\n")
print(comments_by_hour)
    
    

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [54]:
avg_by_hour = []

for h1 in comments_by_hour:
    for h2 in counts_by_hour:
        if h1 == h2:
            avg_by_hour.append([h1,comments_by_hour[h1]/counts_by_hour[h1]])
      
print(avg_by_hour)    

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [56]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)


[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [61]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    avg = float(row[0])
    h2 = dt.datetime.strptime(row[1],"%H")
    hour = h2.strftime("%H")
    print("{h3}: {a:,.2f} average comments per post".format(h3=hour,a=avg))
     


Top 5 Hours for Ask Posts Comments
15: 28.68 average comments per post
13: 16.32 average comments per post
12: 12.38 average comments per post
02: 11.14 average comments per post
10: 10.68 average comments per post


### Results

Best time to create a 'Ask HN' post in order to receive the most activity/comments is around 3pm EST. 