# Hacker News site Data analysis

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We have taken the the data from Kaggle but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Data source is given below:
https://www.kaggle.com/hacker-news/hacker-news-posts 

In this project we're specifically interested in posts whose title begin with either 'ASK HN' or 'Show HK' cause users submit Ask HN posts to ask the Hacker News community a specific question.

In addition, later on this project we're going to comapre these two types of post ('ASK HN' or 'Show HK') to determine whether 'ASK HN' or 'Show HK' receive more comments on average or posts created at a certain time receive more comments on average. 

In [1]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
#Extract the first row as header
#Remove the title row (index 0) from the list and keep the data only in hn list. 
#Display headers and display first five row from hn list.

In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [5]:
#if the lowercase version of title
#starts with ask hn, append the row to ask_posts,
#starts with show hn, append the row to show_posts
#else tot he other_posts 


In [6]:
ask_posts = []
show_posts = []
other_posts = []


for row in hn:
    title = row[1]
    if title.lower().startswith ('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith ('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [7]:
#Check the number of posts in ask_posts, show_posts, and other_posts

print('No of posts in \'Ask Posts:\'',len(ask_posts))
print('No of posts in \'Show Posts:\'',len(show_posts))
print('No of posts in \'Other Posts\'',len(other_posts))

No of posts in 'Ask Posts:' 1744
No of posts in 'Show Posts:' 1162
No of posts in 'Other Posts' 17194


In [8]:
#Determine if ask posts or show posts receive more comments on average.

total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print('Average number of comments on ask posts: ', avg_ask_comments)

Average number of comments on ask posts:  14.038417431192661


In [9]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments/len(show_posts)
print('Average number of comments on show posts: ', avg_show_comments)

Average number of comments on show posts:  10.31669535283993


On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

In [10]:
import datetime as dt

# calculate the amount of ask posts created per hour, along with the total amount of comments.

Create an empty list and assign it to result_list. This will be a list of lists created_at and num of comments per posts

In [11]:
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

In [12]:
result_list[:5]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [13]:
counts_by_hour = {} 
#contains the number of ask posts created during each hour of the day.
comments_by_hour = {}
#contains the corresponding number of comments ask posts created at each 
#hour received.

date_format = "%m/%d/%Y %H:%M"


for row in result_list:
    date = row[0]
    comment = int(row[1])
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [14]:
counts_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [15]:
avg_by_hour = []

for comment in comments_by_hour:
    avg_by_hour.append([comment, 
                        comments_by_hour[comment]/counts_by_hour[comment]])

In [16]:
avg_by_hour

[['17', 11.46],
 ['16', 16.796296296296298],
 ['22', 6.746478873239437],
 ['04', 7.170212765957447],
 ['23', 7.985294117647059],
 ['00', 8.127272727272727],
 ['19', 10.8],
 ['07', 7.852941176470588],
 ['14', 13.233644859813085],
 ['13', 14.741176470588234],
 ['20', 21.525],
 ['05', 10.08695652173913],
 ['10', 13.440677966101696],
 ['11', 11.051724137931034],
 ['15', 38.5948275862069],
 ['08', 10.25],
 ['02', 23.810344827586206],
 ['12', 9.41095890410959],
 ['01', 11.383333333333333],
 ['21', 16.009174311926607],
 ['03', 7.796296296296297],
 ['06', 9.022727272727273],
 ['18', 13.20183486238532],
 ['09', 5.5777777777777775]]

In [17]:
swap_avg_by_hour = []

for each in avg_by_hour:
    swap_avg_by_hour.append([each[1], each[0]])

In [18]:
swap_avg_by_hour

[[11.46, '17'],
 [16.796296296296298, '16'],
 [6.746478873239437, '22'],
 [7.170212765957447, '04'],
 [7.985294117647059, '23'],
 [8.127272727272727, '00'],
 [10.8, '19'],
 [7.852941176470588, '07'],
 [13.233644859813085, '14'],
 [14.741176470588234, '13'],
 [21.525, '20'],
 [10.08695652173913, '05'],
 [13.440677966101696, '10'],
 [11.051724137931034, '11'],
 [38.5948275862069, '15'],
 [10.25, '08'],
 [23.810344827586206, '02'],
 [9.41095890410959, '12'],
 [11.383333333333333, '01'],
 [16.009174311926607, '21'],
 [7.796296296296297, '03'],
 [9.022727272727273, '06'],
 [13.20183486238532, '18'],
 [5.5777777777777775, '09']]

In [20]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True)

In [21]:
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [28]:
# Sort the values and print the the 5 hours with the highest average 
#comments.
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
    "{}: {:.2f} average comments per posts".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg)
    )
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per posts
02:00: 23.81 average comments per posts
20:00: 21.52 average comments per posts
16:00: 16.80 average comments per posts
21:00: 16.01 average comments per posts
