# Hacker News Posts

In this project we'll work with data of posts from [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories (or 'posts') are voted and commented on.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

## Opening the data

We'll start by opening the data set, read it line by line and displaying the first 5 rows:

In [3]:
from csv import reader
file = open('hacker_news.csv')
read = reader(file)
hn = list(read)
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Filtering the data

Now we'll filter our data in order to separate the posts whose titles start with `Ask HN` or `Show HN` as we mentioned previously. To do this we'll use regular expressions to create 3 lists that contain the `Ask` posts, the `Show` posts and the rest.

In [9]:
import re
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    match1 = re.search(r"^Ask HN",title,re.I)
    match2 = re.search(r"^Show HN",title,re.I)
    if match1:
        ask_posts.append(row)
    elif match2:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
ask_percentage = round((len(ask_posts)/len(hn))*100,2)
show_percentage = round((len(show_posts)/len(hn))*100,2)
other_percentage = round((len(other_posts)/len(hn))*100,2)
        
print(len(ask_posts),'Ask posts, corresponding to the',str(ask_percentage)+'%','of the total posts')
print(len(show_posts),'Show posts, corresponding to the',str(show_percentage)+'%','of the total posts')
print(len(other_posts),'Other posts, corresponding to the',str(other_percentage)+'%','of the total posts')

1744 Ask posts, corresponding to the 8.68% of the total posts
1162 Show posts, corresponding to the 5.78% of the total posts
17194 Other posts, corresponding to the 85.54% of the total posts


## Analyzing the data

Moving on we'll be working with the number of comments, contained in the `num_comments` column, each type of post has so we can compare them based on this criteria. For this matter we'll use list comprehension to extract these values and then compare them.

In [11]:
ask_comments = [int(row[4]) for row in ask_posts]
show_comments = [int(row[4]) for row in show_posts]

avg_ask_comments = sum(ask_comments)/len(ask_comments)
avg_show_comments = sum(show_comments)/len(show_comments)

print('Ask posts have',round(avg_ask_comments,2),'comments on average')
print('Show posts have',round(avg_show_comments,2),'comments on average')

Ask posts have 14.038 comments on average
Show posts have 10.317 comments on average


As we can observe `Ask` posts have more comments on average than `Show` posts, averaging around 36% more comments. From now on we'll focus only on the `Ask` posts and dig a bit deeper on information about these posts. With the help of the `datetime` module we'll be analyzing how the time they were posted impacts how many comments they have.

In [20]:
import datetime as dt
created_date = [row[6] for row in ask_posts]
counts_by_hour = {}
comments_by_hour = {}
result_list = zip(created_date,ask_comments)

for date,comment in result_list:
    date_dt = dt.datetime.strptime(date,'%m/%d/%Y %H:%M')
    hour = date_dt.strftime('%H')
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
        
avg_by_hour = [[i,round(comments_by_hour[i]/counts_by_hour[i],2)] for i in comments_by_hour]
print(avg_by_hour)

[['09', 5.578], ['13', 14.741], ['10', 13.441], ['14', 13.234], ['16', 16.796], ['23', 7.985], ['12', 9.411], ['17', 11.46], ['15', 38.595], ['21', 16.009], ['20', 21.525], ['02', 23.81], ['18', 13.202], ['03', 7.796], ['05', 10.087], ['19', 10.8], ['01', 11.383], ['22', 6.746], ['08', 10.25], ['04', 7.17], ['00', 8.127], ['06', 9.023], ['07', 7.853], ['11', 11.052]]


## Final Results

We were able to obtain the average number of comments per post by the hour of the day, but the list looks a bit messy so we'll sort it to get the values in a more clean presentation.

In [35]:
avg_by_hour = sorted(avg_by_hour,key=lambda row:row[1],reverse=True)
print('Top 5 Hours for \'Ask HN\' Comments:\n')
for couple in avg_by_hour[:5]:
    hour = dt.datetime.strptime(couple[0],'%H')
    hour = hour.strftime('%H:%M')
    print("{}: {:.2f} average comments per post".format(hour,couple[1]))

Top 5 Hours for 'Ask HN' Comments:

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


As observed in the table above, the most popular hours are after lunch (15-16hrs), after dinner (20-21hrs) and the early morning (2AM). The common denominator here is they all belong to hours when people is resting or not very productive so they have more time to spare and check some posts in the web.