# Exploring hacker news posts 



Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts) , but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

In [1]:
from csv import reader 

In [2]:
openedfile = open('hacker_news.csv')
readfile = reader(openedfile)

hn = list(readfile)
headers = hn[0]    #seperating the header from the rows
hn = hn[1:]
hn[0:4]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

seperating headers from the rest of the data. 

In [3]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [4]:
hn[0:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

### seperating data into ask hn, show hn and others

### We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question &  Show HN posts to show the Hacker News community a project, product, or just generally something interesting. 

### We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

### First filter out Ask and Show posts by checking each post title to see if they start with 'Ask hn' or 'Show hn'

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    titlelower = title.lower()
    
    if titlelower.startswith('ask hn') == True:
        ask_posts.append(row)
        
    elif titlelower.startswith('show hn') == True:
        show_posts.append(row)
    else: 
        other_posts.append(row)
        
        
print ('ask posts: ' + str(len(ask_posts)), 
      'show posts: ' +str(len(show_posts)),
      'other posts: ' + str (len(other_posts)))


ask posts: 1744 show posts: 1162 other posts: 17194


### There are 1744 ask posts, 1162 show posts & 17,194 other types of posts

In [6]:
ask_posts[0:3]   #checking to see if the filtering worked

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14']]

lets see which type of posts get the most comments, we'll calculate the average number of comments each receives

In [7]:
total_ask_comments = 0 
for row in ask_posts:
    num_comments = int(row[4])
    
    total_ask_comments += num_comments 
    
avg_ask_comments = total_ask_comments/len(ask_posts)

print(total_ask_comments)
print(avg_ask_comments)


24483
14.038417431192661


In [8]:
total_show_comments = 0 
for row in show_posts:
    num_comments = int(row[4])
    
    total_show_comments += num_comments

avg_show_comments = total_show_comments/len(show_posts)

print(total_show_comments)
print(avg_show_comments)
    
    

11988
10.31669535283993


###  the ask posts have more comments than the show posts. which make sense when you consider that ask posts have the call to action which requires users to engage and reply the posts. 

now I'd like to determine if ask posts created at a certain time is more likely to attract comments by

- calculating the amount of ask posts created in each hour of the day, along with the num of comments received. 

- calculating the average number of comments ask posts receive by the hour created

In [9]:
ask_posts[0:3]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14']]

In [10]:
import datetime as dt #importing the datetime library enables us to parse and format date time data

In [39]:
result_list = []

for row in ask_posts:             #extracting the time each post was created along with the number of comments
    created_at = row[6]
    num_comments = int(row[4])
    
    result_list.append([created_at, num_comments])
   

In [41]:
result_list[0]

['8/16/2016 9:55', 6]

In [47]:
counts_by_hour = {}          
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment = row[1]
    
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    
    hour = date.hour
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

print(counts_by_hour)
print(comments_by_hour)
    

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


In [48]:
counts_by_hour

{9: 45,
 13: 85,
 10: 59,
 14: 107,
 16: 108,
 23: 68,
 12: 73,
 17: 100,
 15: 116,
 21: 109,
 20: 80,
 2: 58,
 18: 109,
 3: 54,
 5: 46,
 19: 110,
 1: 60,
 22: 71,
 8: 48,
 4: 47,
 0: 55,
 6: 44,
 7: 34,
 11: 58}

In [49]:
comments_by_hour

{9: 251,
 13: 1253,
 10: 793,
 14: 1416,
 16: 1814,
 23: 543,
 12: 687,
 17: 1146,
 15: 4477,
 21: 1745,
 20: 1722,
 2: 1381,
 18: 1439,
 3: 421,
 5: 464,
 19: 1188,
 1: 683,
 22: 479,
 8: 492,
 4: 337,
 0: 447,
 6: 397,
 7: 267,
 11: 641}

### using the two dictionaries comments_by_hour & counts_by_hour. we can create a list of lists containing the hour during which posts were created and average number of comments the post received


In [53]:
avg_by_hour = []

#iterating over both dictionaries to create a list with the hour and append the average value.

for key in counts_by_hour:
    num_of_posts = counts_by_hour[key]

    for tot_key in comments_by_hour:
        num_of_comments = comments_by_hour[tot_key]
        
        if key == tot_key:
            avg_by_hour.append([key, num_of_comments/num_of_posts])
        
avg_by_hour

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

### Using the data above, we can figure out what the best time is for posting on Hacker news to receive maximum engagement

In [54]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]


In [57]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True) 

In [60]:
print("Top 5 Hours for Ask Posts Comments")

Top 5 Hours for Ask Posts Comments


In [84]:
for row in sorted_swap[0:4]:
    average = row[0]
    hour_of = (row[1])
    
    h = dt.datetime.strftime(hour_of, "%H:%M")
    

    
    print( "{time} : {comms:.2f} average comments per post".format(time = h, comms = average))
    

15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post


## The best times to post on hacker news are 3pm, 2 am, 8pm and 4 pm. 

it is quite interesting that 2am is the 2nd highest hour for engagement. Perhaps users from a different time zone are active and reacting to posts at that time?