# NEXT BIG POST of HackerNews

In this project I analyse the trends and features of HackerNews posts. HackerNews is the popular ICT community blog. This dataset contains 12 months of posts from September 2015 to September 2016. 

Dataset is cleaned and reduced because we need only posts with attributes such as - having url, having comments, ahaving upvotes and downvotes and so on. 

I will analyse posts having certain properties and will try to suggest the next successful post.

In [38]:
opened_hn = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_hn)
hn = list(read_file)

In [39]:
hn[0:5] #displaying first 5 rows

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [40]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [41]:
hn = hn[1:]

In [42]:
print(hn[0:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [43]:
# I am spliting the posts into three types - ask hn, show hn and other. 
ask_posts = []
show_posts = []
other_posts = [] 

In [44]:
for row in hn:
    p_title = row[1]
    p_title = p_title.lower()
    if p_title.startswith('ask hn'):
        ask_posts.append(row)
    elif p_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# And I am interested in quantity of ask and show posts

print('ASK:', len(ask_posts))
print('SHOW:', len(show_posts))
print('OTHER:', len(other_posts))

ASK: 1744
SHOW: 1162
OTHER: 17194


In [45]:
# Now i will calculate the comments differences between post types
# ASK posts comments
total_ask_comments = 0

for p_row in ask_posts:
    p_comments = int(p_row[4])
    total_ask_comments += p_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [46]:
# SHOW posts comments

total_show_comments = 0

for p_row in show_posts:
    p_comments = int(p_row[4])
    total_show_comments += p_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)




10.31669535283993


As we can see, the posts/topics that are raising the question for the community (Ask HN) receive more comments on average. They receive 14 comments per post compared to the show type of topics/posts. 

This shows us that community tends to react more to unsolved problems rather then solved problems (Show HN posts are about news or solutions). This perfectly represents the ICT sector community as creators. 

In [47]:
# First let's take a look maybe other type posts have bigger average comment
#count, this will be useful if i want to prognose next big post.
total_other_comments = 0

for p_row in other_posts:
    p_comments = int(p_row[4])
    total_other_comments += p_comments

avg_other_comments = total_other_comments / len(other_posts)
print(avg_other_comments)


26.8730371059672


In [48]:
# and yes of course they have! i will include this later. 
# From now on i will continue with only ASK HN posts because 
# guided project is for only analysing comments

In [49]:
import datetime as dt

In [50]:
result_list = [] #here i create a list with post gull date and post comment count
for row in ask_posts:
    created_at = row[6]
    comments_num = int(row[4])
    cre_and_com = [created_at, comments_num]
    result_list.append(cre_and_com)

counts_by_hour = {}
comments_by_hour = {} 
#here i convert date to dt object, then i extraxt the hour as string
#then i put to dictionary by that string - the post count by that hour
#and the comment count by that hour
for row in result_list: 
    date = row[0]
    p_comments = row[1]
    parse_rule = ("%m/%d/%Y %H:%M")
    dt_date = dt.datetime.strptime(date, parse_rule)
    post_hour = dt.datetime.strftime(dt_date, '%H')
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = p_comments
    if post_hour in counts_by_hour:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += p_comments

In [51]:
counts_by_hour

{'00': 56,
 '01': 61,
 '02': 59,
 '03': 55,
 '04': 48,
 '05': 47,
 '06': 45,
 '07': 35,
 '08': 49,
 '09': 46,
 '10': 60,
 '11': 59,
 '12': 74,
 '13': 86,
 '14': 108,
 '15': 117,
 '16': 109,
 '17': 101,
 '18': 110,
 '19': 111,
 '20': 81,
 '21': 110,
 '22': 72,
 '23': 69}

In [52]:
comments_by_hour

{'00': 457,
 '01': 716,
 '02': 1384,
 '03': 422,
 '04': 340,
 '05': 493,
 '06': 398,
 '07': 269,
 '08': 497,
 '09': 257,
 '10': 794,
 '11': 643,
 '12': 691,
 '13': 1282,
 '14': 1419,
 '15': 4478,
 '16': 1831,
 '17': 1147,
 '18': 1441,
 '19': 1191,
 '20': 1724,
 '21': 1749,
 '22': 481,
 '23': 544}

In [53]:
#i was looking for posts per hour and comments per same hour because i wanted
#to find the average. so here is the average comments per one post per hour. 
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour] / counts_by_hour[hour])])

In [54]:
avg_by_hour

[['00', 8.160714285714286],
 ['02', 23.45762711864407],
 ['05', 10.48936170212766],
 ['22', 6.680555555555555],
 ['07', 7.685714285714286],
 ['17', 11.356435643564357],
 ['19', 10.72972972972973],
 ['13', 14.906976744186046],
 ['15', 38.27350427350427],
 ['04', 7.083333333333333],
 ['18', 13.1],
 ['10', 13.233333333333333],
 ['14', 13.13888888888889],
 ['16', 16.798165137614678],
 ['20', 21.28395061728395],
 ['21', 15.9],
 ['11', 10.898305084745763],
 ['12', 9.337837837837839],
 ['08', 10.142857142857142],
 ['06', 8.844444444444445],
 ['23', 7.884057971014493],
 ['09', 5.586956521739131],
 ['03', 7.672727272727273],
 ['01', 11.737704918032787]]

In [56]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[8.160714285714286, '00'], [23.45762711864407, '02'], [10.48936170212766, '05'], [6.680555555555555, '22'], [7.685714285714286, '07'], [11.356435643564357, '17'], [10.72972972972973, '19'], [14.906976744186046, '13'], [38.27350427350427, '15'], [7.083333333333333, '04'], [13.1, '18'], [13.233333333333333, '10'], [13.13888888888889, '14'], [16.798165137614678, '16'], [21.28395061728395, '20'], [15.9, '21'], [10.898305084745763, '11'], [9.337837837837839, '12'], [10.142857142857142, '08'], [8.844444444444445, '06'], [7.884057971014493, '23'], [5.586956521739131, '09'], [7.672727272727273, '03'], [11.737704918032787, '01']]


In [57]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [58]:
sorted_swap[0:6]

[[38.27350427350427, '15'],
 [23.45762711864407, '02'],
 [21.28395061728395, '20'],
 [16.798165137614678, '16'],
 [15.9, '21'],
 [14.906976744186046, '13']]

In [59]:
print('Top 5 hours for Ask posts Comments')

Top 5 hours for Ask posts Comments


In [65]:
for row in sorted_swap[0:6]:
    hour = row[1]
    avg_comments = row[0]
    hour = dt.datetime.strptime(hour, '%H')
    hour = dt.datetime.strftime(hour, '%H:%M')
    row[1] = hour
    print('{}: {:.2f} average comments per post'.format(hour, avg_comments))

15:00: 38.27 average comments per post
02:00: 23.46 average comments per post
20:00: 21.28 average comments per post
16:00: 16.80 average comments per post
21:00: 15.90 average comments per post
13:00: 14.91 average comments per post


# Conclusions

Next big post on hacker news should be tagged with ASK HN tag and It should be posted between 15:00 and 20:00 to acttract more audience 