# What Insight Can You Get From a Social News Website? An Analysis on Hacker News

Hacker News is a popular social news website founded by Y Combinator - a well-known startup accelerator in the US. The site mainly focuses on computer science and entrepreneurship. Users can submit content to Hacker News which can be voted and commented upon by members in the community. In this project, we look at the data about 20,000 threats submitted on Hacker News. We are especially interested in two kinds of posts: `Ask HN` and `Show HN`. `Ask HN` is the type of posts when a member wants to ask the community something, for example, `How to create a good data science project?`; and `Show HN` is the type of posts when a member wants to show the community something interesting, e.g., `CatDrone: a drone that follows your cat`. 

We will analyze the data of Hacker News to answer the following two questions: 
* Which kind of posts receives more comments on average?
* Does time of posting affect the number of comments?

# Data Extraction

Let's first take a look at the dataset we have.

In [2]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
for row in hn[:4]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']




We see that we have information about the title of a post, the number of points and comments the post received, its author, and the time at which the post was created. We then remove the header row from the main dataset for later analysis.

In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)
for row in hn[:4]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




Now, we extract the `Ask HN` and `Show HN` posts into two separate lists. 

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_lowercase = title.lower()
    if title_lowercase.startswith('ask hn'):
        ask_posts.append(row)
    elif title_lowercase.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print('Number of \'Ask HN\' posts:',len(ask_posts))
print('Number of \'Show HN\' posts:',len(show_posts))
print('Number of other posts:',len(other_posts))
print('Total number of posts:',len(hn))

Number of 'Ask HN' posts: 1744
Number of 'Show HN' posts: 1162
Number of other posts: 17194
Total number of posts: 20100


# Analysis

In this section, we calculate the average number of comments `Ask HN` and `Show HN` posts received, respectively.

In [5]:
total_ask_comments = 0

for row in ask_posts:
    no_comments = int(row[4])
    total_ask_comments += no_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average comments on \'Ask HN\' posts:',avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    no_comments = int(row[4])
    total_show_comments += no_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average comments on \'Show HN\' posts:', avg_show_comments)

Average comments on 'Ask HN' posts: 14.038417431192661
Average comments on 'Show HN' posts: 10.31669535283993


We found that `Ask HN` posts in average received 40% more comments than `Show HN` posts. We will then look into `Ask HN` posts to see if posting time affects the number of comments a thread receives. To this end, we will first extract the time a `Ask HN` thread was created together with the number of comments it received into a list named `result_list`. Then, we will divide the number of threads by the number of comments for each hour of the day to get a list of the average comments a post created at each hour of the day received. 

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    time_created = row[6]
    no_comments = int(row[4])
    result_list.append([time_created, no_comments])
    
counts_by_hours = {}
comments_by_hourse = {}

for row in result_list:
    datetime_created = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour_created = datetime_created.strftime("%H")
    if hour_created not in counts_by_hours:
        counts_by_hours[hour_created] = 1
        comments_by_hourse[hour_created] = row[1]
    else:
        counts_by_hours[hour_created] += 1
        comments_by_hourse[hour_created] += row[1]


avg_by_hour = []

for key in counts_by_hours:
    avg_by_hour.append([key, comments_by_hourse[key] / counts_by_hours[key]])


for hour in avg_by_hour:
    print(hour)

['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]


We now sort the list in descending order of the number of comments. To do this, we need to swap the `comments` and the `hour` columns. 

In [9]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

for hour in sorted_swap:
    print(hour)

[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


We narrow down the list to see the top 5 hours that a thread receives the most comments.

In [10]:
print('Top 5 Hours for \'Ask HN\' Posts Comments')
for row in sorted_swap[:5]:
    avg_comments = float(row[0])
    hour_dt = dt.datetime.strptime(row[1], "%H")
    hour_formated = hour_dt.strftime("%H:%M")
    print("{}: {:.2f} average comments per post.".format(hour_formated,avg_comments))
    

Top 5 Hours for 'Ask HN' Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


These are the hours at which a post receives the largest number of comments. We can observe that the top hour (3pm-4pm) receives a significant number of comments compared to all other hours of the day.

# Concluding Remarks

In this project, we analyzed the data about Hacker News on the `Ask HN` and `Show HN` posts. We found that `Ask HN` posts receive most number of comments if they are posted between 3pm and 4pm. This findings shed lights on how one can get the maximum number of comments on the Hacker News community. 