## Playing with Data from Hacker News

### Goals

* Determine whether titles starting with **`Ask HN`** or **`Show HN`** receive more comments on average?
* Examine whether posts created at a certain time receive more comments on average?

### Description of the Data Set

* [0] **id:** The unique identifier from Hacker News for the post;
* [1] **title:** The title of the post;
* [2] **url:** The URL that the posts links to, if it the post has a URL;
* [3] **num_points:** The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes;
* [4] **num_comments:** The # of comments that were made on the post;
* [5] **author:** The username of the person who submitted the post;
* [6] **created_at:** The date and time at which the post was submitted;

In [1]:
from csv import reader
f = open('hacker_news.csv')
p = reader(f)
hn = list(p)
print(len(hn))
for row in hn[:5]:
    print(row, "\n")

20101
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



>Removing the header row.

In [2]:
headers = hn[0]
del hn[0]
print(headers, '\n')
for row in hn[:5]:
    print(row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



### Extracting "Ask HN" and "Show HN" Posts

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts)+len(show_posts)+ len(other_posts))

20100


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [4]:
total_ask_comments = 0
for i in ask_posts:
    total_ask_comments += int(i[4])
avg_ask_comments = total_ask_comments/len(ask_posts)
total_show_comments = 0
for i in show_posts:
    total_show_comments += int(i[4])
avg_show_comments = total_show_comments/len(show_posts)
print('Total Ask Comments = :',total_ask_comments)
print('Total Show comments = :',  total_show_comments)
print('Average Ask Comments = :',avg_ask_comments) 
print('Average Show Comments = :',avg_show_comments)

Total Ask Comments = : 24483
Total Show comments = : 11988
Average Ask Comments = : 14.038417431192661
Average Show Comments = : 10.31669535283993


>The results above show that comments for 'Ask' posts are on average 40% more than comments for 'Show' posts. 

### Finding the Amount of Ask Posts and Comments by Hour Created

In [5]:
import datetime as dt
hour_freq = {} #frequency distrib of posts per hour
comments_freq = {} #frequency distrib of comments per hour
for row in ask_posts[:]:
    dt_obj = dt.datetime.strptime(row[6],'%m/%d/%Y %H:%M')
    #time_obj = dt_obj.time()
    hour_obj = dt_obj.strftime('%H') #time_obj.hour
    if hour_obj in hour_freq:
        hour_freq[hour_obj] +=1
        comments_freq[hour_obj] +=int(row[4])
    else: 
        hour_freq[hour_obj] =1
        comments_freq[hour_obj] =int(row[4])
for i in sorted(hour_freq.items()):
    print('Time:',i[0], 'Posts: ', i[1], 'Comments:',comments_freq[i[0]])

Time: 00 Posts:  55 Comments: 447
Time: 01 Posts:  60 Comments: 683
Time: 02 Posts:  58 Comments: 1381
Time: 03 Posts:  54 Comments: 421
Time: 04 Posts:  47 Comments: 337
Time: 05 Posts:  46 Comments: 464
Time: 06 Posts:  44 Comments: 397
Time: 07 Posts:  34 Comments: 267
Time: 08 Posts:  48 Comments: 492
Time: 09 Posts:  45 Comments: 251
Time: 10 Posts:  59 Comments: 793
Time: 11 Posts:  58 Comments: 641
Time: 12 Posts:  73 Comments: 687
Time: 13 Posts:  85 Comments: 1253
Time: 14 Posts:  107 Comments: 1416
Time: 15 Posts:  116 Comments: 4477
Time: 16 Posts:  108 Comments: 1814
Time: 17 Posts:  100 Comments: 1146
Time: 18 Posts:  109 Comments: 1439
Time: 19 Posts:  110 Comments: 1188
Time: 20 Posts:  80 Comments: 1722
Time: 21 Posts:  109 Comments: 1745
Time: 22 Posts:  71 Comments: 479
Time: 23 Posts:  68 Comments: 543


### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [6]:
avg_comments = {}
for i in hour_freq:
    if i in avg_comments:
        print('something wrong!!!')
    else:
        avg_comments[i] = round(comments_freq[i]/hour_freq[i])
for i in sorted(avg_comments.items()):
    print(i)
    

('00', 8)
('01', 11)
('02', 24)
('03', 8)
('04', 7)
('05', 10)
('06', 9)
('07', 8)
('08', 10)
('09', 6)
('10', 13)
('11', 11)
('12', 9)
('13', 15)
('14', 13)
('15', 39)
('16', 17)
('17', 11)
('18', 13)
('19', 11)
('20', 22)
('21', 16)
('22', 7)
('23', 8)


>The same result but using a list of lists instead of a dictionary to store the average number of commebts per hour:

In [7]:
avg_com = []
for i in hour_freq:
    avg_c = comments_freq[i]/hour_freq[i]
    avg_com.append([i, avg_c])
for i in sorted(avg_com):
    print(i)

['00', 8.127272727272727]
['01', 11.383333333333333]
['02', 23.810344827586206]
['03', 7.796296296296297]
['04', 7.170212765957447]
['05', 10.08695652173913]
['06', 9.022727272727273]
['07', 7.852941176470588]
['08', 10.25]
['09', 5.5777777777777775]
['10', 13.440677966101696]
['11', 11.051724137931034]
['12', 9.41095890410959]
['13', 14.741176470588234]
['14', 13.233644859813085]
['15', 38.5948275862069]
['16', 16.796296296296298]
['17', 11.46]
['18', 13.20183486238532]
['19', 10.8]
['20', 21.525]
['21', 16.009174311926607]
['22', 6.746478873239437]
['23', 7.985294117647059]


### Sorting and Printing Values from a List of Lists

In [8]:
avg_com_swapped = []
for i in avg_com:
    swap_list = [i[1], i[0]]
    avg_com_swapped.append(swap_list)
for i in sorted(avg_com_swapped[:], reverse = True):
    hour = dt.datetime.strptime(i[1], '%H')
    str_hour = hour.strftime('%H:%M %Z')
                                
    av_c = round(i[0])
    print('{1}: {0:.2f} average comments per post'.format(i[0], str_hour))

15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post
13:00 : 14.74 average comments per post
10:00 : 13.44 average comments per post
14:00 : 13.23 average comments per post
18:00 : 13.20 average comments per post
17:00 : 11.46 average comments per post
01:00 : 11.38 average comments per post
11:00 : 11.05 average comments per post
19:00 : 10.80 average comments per post
08:00 : 10.25 average comments per post
05:00 : 10.09 average comments per post
12:00 : 9.41 average comments per post
06:00 : 9.02 average comments per post
00:00 : 8.13 average comments per post
23:00 : 7.99 average comments per post
07:00 : 7.85 average comments per post
03:00 : 7.80 average comments per post
04:00 : 7.17 average comments per post
22:00 : 6.75 average comments per post
09:00 : 5.58 average comments per post


>Results show that if you live on the East coast of the US your best chance of generating high numbers of comments for your posts on Hacker News is to publish in the 15th hour. On average the number of posts generated during this hour exceeds by far all other time periods.

>Does this mean that towards the end of the working day people look for distractions to pass the time until 5pm?

>Or maybe this is the time when people in Europe and Asia also become active so the numbers jump?

>Given the available data we can only speculate...

### Playing a Little Bit More:

#### Determine if show or ask posts receive more points on average.

In [9]:
# checking ask_posts avg number of points per post
ask_total_points = 0
for i in ask_posts:
    ask_total_points += int(i[3])
    
ask_avg_points = ask_total_points/len(ask_posts)
print(ask_avg_points)

15.061926605504587


In [10]:
# checking show_posts avg number of points per post
show_total_points = 0
for i in show_posts:
    show_total_points += int(i[3])
    
show_avg_points = show_total_points/len(show_posts)
print(show_avg_points)

27.555077452667813


>On average the Show posts are substantialy more than Ask posts.

#### Determine if posts created at a certain time are more likely to receive more points. 
#### Compare your results to the average number of comments and points other posts receive.


In [16]:
# function returning frequency distributions per hour of 
# number of posts and total points
def freq(d_set, col_ind):
    dic_num = {}
    dic_sum = {}
    for row in d_set:
        date_time = dt.datetime.strptime(row[6],'%m/%d/%Y %H:%M')
        hour = date_time.strftime('%H')
        if hour in dic_num:
            dic_num[hour] += 1
        else:
            dic_num[hour] = 1
        
        if hour in dic_sum:
            dic_sum[hour] += int(row[col_ind])
        else:
            dic_sum[hour] = int(row[col_ind])
    return (dic_num, dic_sum)

In [22]:
ask = freq(ask_posts, 3)
show = freq(show_posts, 3)
other = freq(other_posts, 3)


In [23]:
#function to find the frequency distribution for average points
def avg_points(tup):
    freq_num = tup[0]
    freq_sum = tup[1]
    freq_avg = {}
    for hour in freq_sum:
        freq_avg[hour] = freq_sum[hour]/freq_num[hour]
    return freq_avg

In [33]:
ask_avg_points = avg_points(ask)
show_avg_points = avg_points(show)
other_avg_points = avg_points(other)
#print(sorted(other_avg_points.items()))

In [37]:
#function to sort the avg nums in descending order
def sort_des(dic):
    list_swap = []
    for i in dic:
        list_swap.append([dic[i], i])
    return sorted(list_swap, reverse = True)

In [42]:
ask_sorted = sort_des(ask_avg_points)
show_sorted = sort_des(show_avg_points)
other_sorted = sort_des(other_avg_points)
#print(other_sorted)

In [80]:
# function to print the sorted list of lists in a 
# more legible format
def print_res(li_of_lists, n):
    for i in li_of_lists[:n]:
        st = '{0:.2f} number of points on avg for the {1} o\'clock period'.format(i[0], i[1])
        print(st)

##### Top 5 average points per hour for Ask posts

In [84]:
print_res(ask_sorted, 10)

29.99 number of points on avg for the 15 o'clock period
24.26 number of points on avg for the 13 o'clock period
23.35 number of points on avg for the 16 o'clock period
19.41 number of points on avg for the 17 o'clock period
18.68 number of points on avg for the 10 o'clock period
15.97 number of points on avg for the 18 o'clock period
15.79 number of points on avg for the 21 o'clock period
14.39 number of points on avg for the 20 o'clock period
14.22 number of points on avg for the 11 o'clock period
13.75 number of points on avg for the 19 o'clock period


##### Top 5 average points per hour for Show posts

In [85]:
print_res(show_sorted, 10)

42.39 number of points on avg for the 23 o'clock period
41.69 number of points on avg for the 12 o'clock period
40.35 number of points on avg for the 22 o'clock period
37.84 number of points on avg for the 00 o'clock period
36.31 number of points on avg for the 18 o'clock period
33.64 number of points on avg for the 11 o'clock period
30.95 number of points on avg for the 19 o'clock period
30.32 number of points on avg for the 20 o'clock period
28.56 number of points on avg for the 15 o'clock period
28.32 number of points on avg for the 16 o'clock period


##### Top 5 average points per hour for Other posts

In [87]:
print_res(other_sorted, 10)

62.53 number of points on avg for the 13 o'clock period
61.79 number of points on avg for the 14 o'clock period
60.54 number of points on avg for the 15 o'clock period
60.48 number of points on avg for the 10 o'clock period
60.01 number of points on avg for the 19 o'clock period
58.47 number of points on avg for the 02 o'clock period
58.46 number of points on avg for the 00 o'clock period
57.98 number of points on avg for the 17 o'clock period
57.57 number of points on avg for the 11 o'clock period
57.40 number of points on avg for the 12 o'clock period


>No discernible patterns. And rightly so - people's positive or negative assessments don't seem to depend on the time of day (or night for that matter)