# Hacker News Python Project

In this project we will use Python to explore blog posts on a website "Hacker News" to answer questions such as

- Do `Ask hn` or `Show hn` receive more comments on average? 

- Do posts created at certain times receive more comments on average?

The data set used in this project contains a sample of 20,000 posts from Hacker News that have received comments from Hacker News readers online.

### Introduction

Firstly, we start by importing `csv` plugin to read the data set `"hacker_news.csv"` and store the data as a list of lists in variable name `hn`.

In [1]:
import csv

file = open("hacker_news.csv")

read_file = csv.reader(file)

hn = []
for rows in read_file:
    hn.append(rows)


print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


### Removing Headers

Our first list in the lists of lists contains the column headers, and the lists after contains data for one row. In order to analyse the data, we need first to remove the row containing the column headers.

In [2]:
headers = hn[0]

hn = hn[1:]

print(headers)
print("\n")
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


From the above code cell, we can now verify that the header row has been removed from the `hn` list of lists

### Extracting Ask HN and Show HN Posts

Since we are only interested in post titles beginning with `Ask HN` or `Show HN`, we'll create a lists of lists containing just the data for those titles.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


We have 1744 `Ask HN` and 1162 `Show HN` posts to use for our analysis

### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that we have seperated the `Ask HN` and `Show HN` posts, we will calculate the average number of comments each post receives.

In [4]:
print(ask_posts[:5])
print("\n")
print(show_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 

Above is the first 5 rows for each `Ask HN` and `Show HN` post in the lists of lists we just created  above.

Next we determine if if `Ask HN` or `Show HN` receive more comments on average.

In [5]:
total_ask_comments = 0

for posts in ask_posts:
    num_comments = int(posts[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)


total_show_comments = 0

for posts in show_posts:
    num_comments = int(posts[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

14.038417431192661
10.31669535283993


From the above analysis, we can determine that `Ask HN` receive more comments on average than `Show HN` posts.

### Finding the Amount of Ask Posts and Comments by Hour Created

Since asks posts receive more comments on average. We will use them for the next part of our analysis to dertermine if `Ask HN` posts created at a certain time are more likely to attract comments from readers.

In [6]:
import datetime as dt

result_list = []

for posts in ask_posts:
    created_at = posts[6]
    num_comments = int(posts[4])
    result_list.append([created_at,num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comment_date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    comment_hour = comment_date.strftime("%H")
    comment = row[1]
    if comment_hour not in counts_by_hour:
        counts_by_hour[comment_hour] = 1
        comments_by_hour[comment_hour] = comment
    else:
        counts_by_hour[comment_hour] += 1
        comments_by_hour[comment_hour] += comment

comments_by_hour                                   
    

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [13]:
avg_by_hour = []

for x in counts_by_hour:
    avg_by_hour.append([x, comments_by_hour[x]/counts_by_hour[x]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Sorting and Printing Values from a List of Lists

In [14]:
swap_avg_by_hour = []
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [15]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [16]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

### Conclusions

In this project we have analysed a sample of `Ask HN` and `Show HN` posts from website Hacker News. From our analysis we have found that `Ask HN` posts receive more commments on average based on a sample of data for these posts. We have also found that the hour 15:00 is when `Ask HN` posts receive the mosts comments.