# Exploring Hacker News Ask HN and Show HN posts

### I will be playing around with the Hacker News posts database and specifically comparing Ask HN and Show HN. By doing that I want to determine which gets more comments on average and if posts that are created at certain times get more comments on average.

In [159]:
#Reading in the database
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

### id: The unique identifier from Hacker News for the post
### title: The title of the post
### url: The URL that the posts links to, if the post has a URL
### num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
### num_comments: The number of comments that were made on the post
### author: The username of the person who submitted the post
### created_at: The date and time at which the post was submitted

In [160]:
for x in hn[:5]:
    print('{} \n'.format(x))


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



In [161]:
#fetching out the headers and removing them from the dataset
headers = hn[0]
hn.pop(0)
for x in hn[:5]:
    print('{} \n'.format(x))

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



In [162]:
#Separating out different types of posts
ask_posts = []
show_posts = []
other_posts = []

#Using lower so we can use startswith next to get all the posts sorted
for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

#Checking out how many posts each type has
print('There are a total of {amount} posts in our database'.format(amount=len(hn)))
print("There are a total of {amount} Ask HN posts".format(amount=len(ask_posts)))
print("There are a total of {amount} Show HN posts".format(amount=len(show_posts)))
print("There are a total of {amount} other posts".format(amount=len(other_posts)))

There are a total of 20100 posts in our database
There are a total of 1744 Ask HN posts
There are a total of 1162 Show HN posts
There are a total of 17194 other posts


### Now that we have separated the posts to different lists I will be determining if Ask HN or Show HN have more comments

In [163]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)

print("On average Ask HN get {:.2f} comments on each post".format(avg_ask_comments))
print("On average Show HN get {:.2f} comments on each post".format(avg_show_comments))

if avg_ask_comments > avg_show_comments:
    difference = avg_ask_comments - avg_show_comments
    print("On average Ask HN gets {difference:.2f} more comments on each post than Show HN".format(difference=difference))
    
else:
    difference = avg_show_comments - avg_ask_comments
    print("On average Show HN gets {difference:.2f} more comments on each post than Ask HN".format(difference=difference))

On average Ask HN get 14.04 comments on each post
On average Show HN get 10.32 comments on each post
On average Ask HN gets 3.72 more comments on each post than Show HN


### As we can see from the above cell Ask HN gets 3.72 more comments on each post than Show HN and that was predicted as Ask HN posts are the type of posts where people ask for advice or help (few examples down below). Where as Show HN people show something they have discovered (few examples down below) or want to share some advice/tips. Therefore if someone posts an ask type of post they tend to get more comments as it is aimed at people to give advice and answer to the post.

In [164]:
for x in ask_posts[:5]:
    print(x[1])

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Ask HN: Looking for Employee #3 How do I do it?
Ask HN: Someone offered to buy my browser extension from me. What now?


In [165]:
for x in show_posts[:5]:
    print(x[1])

Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
Show HN: Webscope  Easy way for web developers to communicate with Clients
Show HN: GeoScreenshot  Easily test Geo-IP based web pages


### I will now direct my focus more towards Ask HN posts to find out if there is correlation between more comments and the time the posts were made.

In [166]:
import datetime as dt

result_list = []

#Adding the time the post was created and the number of comments of each post to result_list
for row in ask_posts:
    temp_list = [row[6], int(row[4])]
    result_list.append(temp_list)
    
    
    
    
counts_by_hour = {} #contains the number of ask posts created during each hour of the day.
comments_by_hour = {} #contains the corresponding number of comments ask posts created at each hour received.

date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    created_at_dt = dt.datetime.strptime(row[0], date_format)
    created_at_hour_dt = created_at_dt.strftime("%H")
    
    if created_at_hour_dt not in counts_by_hour:
        counts_by_hour[created_at_hour_dt] = 1
        comments_by_hour[created_at_hour_dt] = row[1]
    else:
        counts_by_hour[created_at_hour_dt] += 1
        comments_by_hour[created_at_hour_dt] += row[1]
    
print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [167]:
#Creating a new list that stores the hour and then the corresponding value which is the average number of comments per posts created
avg_by_hour_list = []

for value in counts_by_hour:
    avg_by_hour = comments_by_hour[value]/ counts_by_hour[value]
    avg_by_hour_list.append([value, avg_by_hour])

for x in avg_by_hour_list:
    print(x)

['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]


In [168]:
swap_avg_by_hour_list = []

for x in avg_by_hour_list:
    swap_avg_by_hour_list.append([x[1], x[0]])
    
print(swap_avg_by_hour_list)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [169]:
#Sorting the swapped list

sorted_swap = sorted(swap_avg_by_hour_list, reverse=True)

print("Top 5 Hours for Ask Posts Comments : ")

for x in sorted_swap[:5]:
    houred = dt.datetime.strptime(x[1], "%H")
    print("{hour}: {avg:.2f} average comments per post".format(hour=houred,avg=x[0]))

Top 5 Hours for Ask Posts Comments : 
1900-01-01 15:00:00: 38.59 average comments per post
1900-01-01 02:00:00: 23.81 average comments per post
1900-01-01 20:00:00: 21.52 average comments per post
1900-01-01 16:00:00: 16.80 average comments per post
1900-01-01 21:00:00: 16.01 average comments per post


### The timezone they used was Eastern Time in the US, as we are in a different one and might travel around I will set it up so that you can insert your own timezone and it automatically translates it

In [173]:
import pytz

format = "%H"



#list all timezones
#all_timezones

#I will make the current time to be Eastern Time in the US
eastern = pytz.timezone('US/Eastern')
eastern.zone

#Set your timezone
acific_tzinfo = pytz.timezone("Europe/Tallinn")

print(sorted_swap)

for x in sorted_swap:
    current_time = eastern.localize(dt.datetime(year=2016,month=1,day=1,hour=int(x[1])))
    
    new_time = current_time.astimezone(acific_tzinfo)
    new_time = new_time.strftime('%H')
    x[1] = new_time

[[38.5948275862069, '05'], [23.810344827586206, '16'], [21.525, '10'], [16.796296296296298, '06'], [16.009174311926607, '11'], [14.741176470588234, '03'], [13.440677966101696, '00'], [13.233644859813085, '04'], [13.20183486238532, '08'], [11.46, '07'], [11.383333333333333, '15'], [11.051724137931034, '01'], [10.8, '09'], [10.25, '22'], [10.08695652173913, '19'], [9.41095890410959, '02'], [9.022727272727273, '20'], [8.127272727272727, '14'], [7.985294117647059, '13'], [7.852941176470588, '21'], [7.796296296296297, '17'], [7.170212765957447, '18'], [6.746478873239437, '12'], [5.5777777777777775, '23']]
18


In [176]:
#Sorting the swapped list

sorted_swap = sorted(swap_avg_by_hour_list, reverse=True)

print("Top 5 Hours for Ask Posts Comments in {tz}: ".format(tz=acific_tzinfo))

for x in sorted_swap[:5]:
    houred = dt.datetime.strptime(x[1], "%H")
    houred = houred.strftime("%H: ")
    print("{hour}: {avg:.2f} average comments per post".format(hour=houred,avg=x[0]))

Top 5 Hours for Ask Posts Comments in Europe/Tallinn: 
12
12: 38.59 average comments per post
23
23: 23.81 average comments per post
17
17: 21.52 average comments per post
13
13: 16.80 average comments per post
18
18: 16.01 average comments per post


### We have now converted the times to our current timezone and can see that at 22: