# Analyzing Hacker News Posts

## Introduction

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.0

## Dataset

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but the dataset have been reduced from 300,000 rows to approximately 20,000 rows (available in the repository) by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. (Courtesy: Dataquest)

## Problem Statement

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Reading and Exploring the dataset

In [1]:
import csv

file = open("hacker_news.csv")
hn = list(csv.reader(file))

# Exploring header row
header = hn[0]
print(header)

hn = hn[1:]

# Testing the dataset
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting `Ask HN` or `Show HN` posts

Since we're only concerned with post titles beginning with `Ask H` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

We will just need to find posts whose title starts with the above two phrases.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Number of posts in each category
print("Ask HN Posts:", len(ask_posts))
print("Show HN Posts:", len(show_posts))
print("Other Posts:", len(other_posts))

Ask HN Posts: 1744
Show HN Posts: 1162
Other Posts: 17194


## Calculting average number of comments on `Ask HN` or `Show HN` posts

This brings us to our first question:

Do `Ask HN` or `Show HN` receive more comments on average?

Let's find out.

In [3]:
def average_comments(posts):
    total_comments = 0
    for post in posts:
        num_comments = int(post[4])
        total_comments += num_comments
        
    avg_comments = total_comments/len(posts)
    return round(avg_comments)

print("Average comments on Ask HN Posts:",average_comments(ask_posts))
print("Average comments on Show HN Posts:",average_comments(show_posts))
print("Average comments on All other Posts:",average_comments(other_posts))

Average comments on Ask HN Posts: 14
Average comments on Show HN Posts: 10
Average comments on All other Posts: 27


As we can observe from above:
- Other posts have more average number of comments.
- **However, on average, `Ask HN` attract more comments than `Show HN`**

## Finding number of comments by hour

We'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.

Below we are aming two dictionaries from our list: one contains the number of ask posts created during each hour of the day and other contains the corresponding number of comments ask posts created at each hour received.

In [4]:
import datetime as dt

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

def posts_by_hour(posts):
    for post in posts:
        num_comments = int(post[4])
        date = post[6]
        time = dt.datetime.strptime(date, date_format).strftime("%H")

        if time not in counts_by_hour:
            counts_by_hour[time] = 1
            comments_by_hour[time] = num_comments
        else:
            counts_by_hour[time] += 1
            comments_by_hour[time] += num_comments

posts_by_hour(ask_posts)

We have just created a function that counts total number of posts and total number of comments in an hour and store them in a dictionary.

Now, as our second goal, we will have to find out average number of comments on posts by hour.

In [5]:
avg_comments_by_hour = []

for hr in comments_by_hour:
    avg_comments_by_hour.append([round(comments_by_hour[hr]/counts_by_hour[hr]), hr])

print("Average Comments by Hour:")
sorted(avg_comments_by_hour, reverse=True)

Average Comments by Hour:


[[39, '15'],
 [24, '02'],
 [22, '20'],
 [17, '16'],
 [16, '21'],
 [15, '13'],
 [13, '18'],
 [13, '14'],
 [13, '10'],
 [11, '19'],
 [11, '17'],
 [11, '11'],
 [11, '01'],
 [10, '08'],
 [10, '05'],
 [9, '12'],
 [9, '06'],
 [8, '23'],
 [8, '07'],
 [8, '03'],
 [8, '00'],
 [7, '22'],
 [7, '04'],
 [6, '09']]

Looking at the above out it is safe to say that:

- 15:00 has 39 average comments per post
- 02:00 has 24 average comments per post
- 20:00 has 22 average comments per post
- 16:00 has 17 average comments per post

There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

## Conclusion

Let's answer the questions 
**Do `Ask HN` or `Show HN` receive more comments on average?**

`Ask HN` posts received 14 comments per post as compared to `Show HN` posts that received 10 comments per post.

**Do posts created at a certain time receive more comments on average?**

Yes. `Ask HN` posts posted between 15:00 and 16:00 EST receied most number of comments.