# EXPLORING HACKER NEWS POSTS
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted posts are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

We will learn how to analyze data and present the informative data from hacker news posts using python. Let's start!

In [3]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]
len(hn)

20100

In [5]:
def explore_data(dataset, start, end):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)

print(hn_header)
print('\n')
explore_data(hn, 0, 4)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [6]:
import re
ask_posts = []
show_posts = []
other_posts = []

ask_patt = r'^Ask HN'
show_patt = r'Show HN'

for row in hn:
    title = row[1]
    match1 = re.search(ask_patt, title, re.I)
    match2 = re.search(show_patt, title, re.I)
    if match1:
        ask_posts.append(row)
    elif match2:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Num Ask: ',len(ask_posts)) 
print('Num Show: ',len(show_posts)) 
print('Num other: ',len(other_posts))

Num Ask:  1744
Num Show:  1165
Num other:  17191


In [7]:
ask_comments = [int(row[4]) for row in ask_posts]
show_comments = [int(row[4]) for row in show_posts]

avg_ask_comments = sum(ask_comments)/len(ask_comments)
avg_show_comments = sum(show_comments)/len(show_comments)

print('avg number of ask comment:', avg_ask_comments)
print('avg number of show comment:',avg_show_comments)


avg number of ask comment: 14.038417431192661
avg number of show comment: 10.302145922746782


In [8]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for each in result_list:
    hours = each[0]
    dformat = '%m/%d/%Y %H:%M'
    hours = dt.datetime.strptime(hours, dformat)
    hours = dt.datetime.strftime(hours,'%H')
    if hours in counts_by_hour:
        counts_by_hour[hours] += 1
        comments_by_hour[hours] += each[1]
    else:
        counts_by_hour[hours] = 1
        comments_by_hour[hours] = each[1]


In [9]:
avg_by_hour = [[h, comments_by_hour[h]/counts_by_hour[h]] for h in counts_by_hour]

print(avg_by_hour)

[['10', 13.440677966101696], ['05', 10.08695652173913], ['03', 7.796296296296297], ['20', 21.525], ['18', 13.20183486238532], ['16', 16.796296296296298], ['00', 8.127272727272727], ['19', 10.8], ['13', 14.741176470588234], ['21', 16.009174311926607], ['14', 13.233644859813085], ['17', 11.46], ['06', 9.022727272727273], ['11', 11.051724137931034], ['22', 6.746478873239437], ['12', 9.41095890410959], ['02', 23.810344827586206], ['04', 7.170212765957447], ['08', 10.25], ['15', 38.5948275862069], ['01', 11.383333333333333], ['07', 7.852941176470588], ['09', 5.5777777777777775], ['23', 7.985294117647059]]


In [10]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[13.440677966101696, '10'], [10.08695652173913, '05'], [7.796296296296297, '03'], [21.525, '20'], [13.20183486238532, '18'], [16.796296296296298, '16'], [8.127272727272727, '00'], [10.8, '19'], [14.741176470588234, '13'], [16.009174311926607, '21'], [13.233644859813085, '14'], [11.46, '17'], [9.022727272727273, '06'], [11.051724137931034, '11'], [6.746478873239437, '22'], [9.41095890410959, '12'], [23.810344827586206, '02'], [7.170212765957447, '04'], [10.25, '08'], [38.5948275862069, '15'], [11.383333333333333, '01'], [7.852941176470588, '07'], [5.5777777777777775, '09'], [7.985294117647059, '23']]


In [102]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

for each in sorted_swap:
    avg_com = each[0]
    avg_com = float('{0:.2f}'.format(avg_com))
    hour = each[1]
    hour = dt.datetime.strptime(hour, '%H')
    hour = dt.datetime.strftime(hour, '%H:%M')
    print(hour, avg_com, 'average comments per post')
    

15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.8 average comments per post
21:00 16.01 average comments per post
13:00 14.74 average comments per post
10:00 13.44 average comments per post
14:00 13.23 average comments per post
18:00 13.2 average comments per post
17:00 11.46 average comments per post
01:00 11.38 average comments per post
11:00 11.05 average comments per post
19:00 10.8 average comments per post
08:00 10.25 average comments per post
05:00 10.09 average comments per post
12:00 9.41 average comments per post
06:00 9.02 average comments per post
00:00 8.13 average comments per post
23:00 7.99 average comments per post
07:00 7.85 average comments per post
03:00 7.8 average comments per post
04:00 7.17 average comments per post
22:00 6.75 average comments per post
09:00 5.58 average comments per post
