## HACKER NEWS ANALYSIS GUIDED PROJECT

Hacker news is a site where users can upload "anything that gratifies one's intellectual curiosity", but "if they'd cover it on TV news, it's probably off-topic". When users have accumulated a certain number of "Karma" points they may vote on posts.

[Hacker News Website](https://news.ycombinator.com/newsguidelines.html)

[Wikipedia Page](https://en.wikipedia.org/wiki/Hacker_News)

This project will take a 20,000 row subset of the original 300,000 rows by removing rows with no comments, and randomly sampling the remaining rows.

The goal is to determine; 

* which type of post, 'Ask HN' or 'Show HN' receives more comments on average
* if the time a post was created influences the number of comments on average

'Ask HN' posts are where users ask questions to the community.

'Show HN' posts are where users upload something they have created.

In [26]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn_data_with_header = list(read_file)
hn_header = hn_data_with_header[0]
hn_data = hn_data_with_header[1:]
print(hn_header)
print('\n')
print(hn_data[:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


Separate the data into Ask HN posts, Show HN posts, and other posts.

In [27]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_data:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of Ask HN posts:', len(ask_posts))
print('Number of Show HN posts:', len(show_posts))
print('Number of other posts:', len(other_posts))

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194


In [28]:
ask_posts[:3]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14']]

In [4]:
show_posts[:3]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05']]

In [29]:
other_posts[:3]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20']]

Find the total number of comments for each type of post

In [30]:
def find_total_comments(data):
    total_comments = 0
    for row in data:
        total_comments += int(row[4])
    return total_comments

total_ask_comments = find_total_comments(ask_posts)
total_show_comments = find_total_comments(show_posts)

print('Total comments for Ask HN posts:', total_ask_comments)
print('Total comments for Show HN posts:', total_show_comments)   

Total comments for Ask HN posts: 24483
Total comments for Show HN posts: 11988


Find the average number of comments for each type of post

In [31]:
print('Average comments for Ask HN posts:', round(total_ask_comments / len(ask_posts),2))
print('Average comments for Show HN posts:', round(total_show_comments / len(show_posts), 2))

Average comments for Ask HN posts: 14.04
Average comments for Show HN posts: 10.32


On average, Ask HN posts receive more comments so we'll use Ask HN posts to determine if the time a post was created influences the number of comments

In [8]:
import datetime as dt

In [33]:
result_list = []
for row in ask_posts:
    number_of_comments = int(row[4])
    result_list.append([row[6], number_of_comments])
result_list[:20]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17],
 ['9/26/2015 23:23', 1],
 ['4/22/2016 12:24', 4],
 ['11/16/2015 9:22', 1],
 ['2/24/2016 17:57', 1],
 ['6/4/2016 17:17', 2],
 ['9/19/2015 17:04', 7],
 ['9/22/2015 13:16', 1],
 ['6/21/2016 15:45', 1],
 ['1/13/2016 21:17', 4],
 ['10/4/2015 21:27', 4],
 ['1/25/2016 20:27', 2],
 ['10/27/2015 2:47', 3],
 ['1/19/2016 12:01', 1],
 ['3/22/2016 2:05', 22],
 ['9/8/2015 14:04', 2]]

In [10]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    num_comments = row[1]
    hour_str = row[0]
    hour_dt = dt.datetime.strptime(hour_str, "%m/%d/%Y %H:%M")
    hour = hour_dt.hour
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments

In [11]:
comments_by_hour

{9: 251,
 13: 1253,
 10: 793,
 14: 1416,
 16: 1814,
 23: 543,
 12: 687,
 17: 1146,
 15: 4477,
 21: 1745,
 20: 1722,
 2: 1381,
 18: 1439,
 3: 421,
 5: 464,
 19: 1188,
 1: 683,
 22: 479,
 8: 492,
 4: 337,
 0: 447,
 6: 397,
 7: 267,
 11: 641}

In [32]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
sorted(avg_by_hour)

[[0, 8.127272727272727],
 [1, 11.383333333333333],
 [2, 23.810344827586206],
 [3, 7.796296296296297],
 [4, 7.170212765957447],
 [5, 10.08695652173913],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [8, 10.25],
 [9, 5.5777777777777775],
 [10, 13.440677966101696],
 [11, 11.051724137931034],
 [12, 9.41095890410959],
 [13, 14.741176470588234],
 [14, 13.233644859813085],
 [15, 38.5948275862069],
 [16, 16.796296296296298],
 [17, 11.46],
 [18, 13.20183486238532],
 [19, 10.8],
 [20, 21.525],
 [21, 16.009174311926607],
 [22, 6.746478873239437],
 [23, 7.985294117647059]]

In [22]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour

[[5.5777777777777775, 9],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [16.796296296296298, 16],
 [7.985294117647059, 23],
 [9.41095890410959, 12],
 [11.46, 17],
 [38.5948275862069, 15],
 [16.009174311926607, 21],
 [21.525, 20],
 [23.810344827586206, 2],
 [13.20183486238532, 18],
 [7.796296296296297, 3],
 [10.08695652173913, 5],
 [10.8, 19],
 [11.383333333333333, 1],
 [6.746478873239437, 22],
 [10.25, 8],
 [7.170212765957447, 4],
 [8.127272727272727, 0],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [11.051724137931034, 11]]

In [23]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour_str = str(row[1])
    hour_dt = dt.datetime.strptime(hour_str, "%H")
    time = hour_dt.strftime("%H:%M")
    print('{}: {:.2f} average number of comments per Ask HN post'.format(time, row[0]))

Top 5 hours for Ask Posts Comments
15:00: 38.59 average number of comments per Ask HN post
02:00: 23.81 average number of comments per Ask HN post
20:00: 21.52 average number of comments per Ask HN post
16:00: 16.80 average number of comments per Ask HN post
21:00: 16.01 average number of comments per Ask HN post


Convert from Pacific Time Zone to London, UK time

In [25]:
print("Top 5 hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour_str = str(row[1])
    hour_dt = dt.datetime.strptime(hour_str, "%H")
    hour_dt += dt.timedelta(hours=8)
    time = hour_dt.strftime("%H:%M")
    
    print('{}: {:.2f} average number of comments per Ask HN post'.format(time, row[0]))

Top 5 hours for Ask Posts Comments
23:00: 38.59 average number of comments per Ask HN post
10:00: 23.81 average number of comments per Ask HN post
04:00: 21.52 average number of comments per Ask HN post
00:00: 16.80 average number of comments per Ask HN post
05:00: 16.01 average number of comments per Ask HN post


In the UK 11pm could be the best time to upload an Ask HN post to receive the most comments.