# Exploring Hacker News
In this project, we will analyze the popular tech-news-site [Hacker News](https://news.ycombinator.com).

Hacker News itself is an extremely popular website in technology and startup circles where users can post stories (a.k.a. "posts") which are commented upon and voted. It works just like [Reddit](https://www.reddit.com) with many posts even being referenced to it.

For our analysis, we'll use [this dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) from Kaggle which contains the posts between September 2015 and 2016. Note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Below are the descriptions of the columns:
- 0 `id`: The unique identifier from Hacker News for the post
- 1 `title`: The title of the post
- 2 `url`: The URL that the posts links to, if it the post has a URL
- 3 `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- 4 `num_comments`: The number of comments that were made on the post
- 5 `author`: The username of the person who submitted the post 
- 6 `created_at`: The date and time at which the post was submitted

## 1. Objectives
We're specifically interested in posts whose titles begin with `Ask HN` or `Show HN`.
- Users submit `Ask HN` posts to ask the community questions.
- Users submit `Show HN` posts to show the community a project.

We'll compare these two types of posts to dermine whether `Ask HN` or `Show HN` posts receive more comments on average and if posts created at a certain time receive more comments.

## 2. Loading the data
Our data is stored in `hacker_news.csv`, so let's use python to load the file and store in in a list.

In [8]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hacker_news_list = list(read_file)

# split header and body
hacker_news_header = hacker_news_list[0]
hacker_news_data = hacker_news_list[1:]

# print length and a few rows
print("Length of our dataset:", len(hacker_news_data))
print("\nHeader:",hacker_news_header)
print("\nFirst few rows:")
for row in hacker_news_data[0:4]:
    print(row)

Length of our dataset: 20100

Header: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

First few rows:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


### 2.1. Seperating the posts
Now with all of our data loaded, we need to seperate the posts beginning with either `Ask HN` or `Show HN` in all case varations.

In [10]:
ask_posts = []
show_posts = []
other_posts = []

for row in hacker_news_data:
    title = row[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('There are {posts} ask posts'.format(posts=len(ask_posts)))
print('There are {posts} show posts'.format(posts=len(show_posts)))
print('There are {posts} other posts'.format(posts=len(other_posts)))

There are 1744 ask posts
There are 1162 show posts
There are 17194 other posts


## 3. Analysis
With the prepared data, let's start our analysis.

### 3.1. Comments on Ask- vs. Show-Posts
As mentioned, we first want to determine whether ask or show posts receive more comments.

In [13]:
# calculate average comments on ask posts
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of comments on ask posts:", avg_ask_comments)

# calculate average comments on show posts
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments on show posts:", avg_show_comments)

Average number of comments on ask posts: 14.038417431192661
Average number of comments on show posts: 10.31669535283993


Ask posts (14 comments on average) clearly gather more comments than show posts - though show posts (10 comments on average) don't fall not short behind.

### 3.2. Comments and time
Next, we'll determine if posts created at a certain time are more likely to attract comments. Since we found that ask posts have more comments in general, we will focus on ask posts only for this analysis.

To perform this analysis, we will:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [33]:
import datetime as dt
result_list = []

# gather base data
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

# extract and assign times
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    created_at = row[0]
    num_comments = row[1]
    created_at_date = dt.datetime.strptime(created_at,date_format)
    created_at_hour = created_at_date.strftime("%H")
    
    if created_at_hour not in counts_by_hour:
        counts_by_hour[created_at_hour] = 1
        comments_by_hour[created_at_hour] = num_comments
    elif created_at_hour in counts_by_hour:
        counts_by_hour[created_at_hour] += 1
        comments_by_hour[created_at_hour] += num_comments

# confirm we got this right
print("Length of counts_by_hour:",len(counts_by_hour))
print("Length of comments_by_hour:",len(comments_by_hour))

comments_by_hour

Length of counts_by_hour: 24
Length of comments_by_hour: 24


{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

Next, we need to calculate the average amount of comments per hour.

In [36]:
avg_comments_by_hour = []

for hour in counts_by_hour:
    avg_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_comments_by_hour.append([hour, avg_comments])
    
avg_comments_by_hour

[['02', 23.810344827586206],
 ['22', 6.746478873239437],
 ['21', 16.009174311926607],
 ['08', 10.25],
 ['11', 11.051724137931034],
 ['01', 11.383333333333333],
 ['00', 8.127272727272727],
 ['05', 10.08695652173913],
 ['14', 13.233644859813085],
 ['17', 11.46],
 ['07', 7.852941176470588],
 ['13', 14.741176470588234],
 ['23', 7.985294117647059],
 ['10', 13.440677966101696],
 ['12', 9.41095890410959],
 ['16', 16.796296296296298],
 ['20', 21.525],
 ['06', 9.022727272727273],
 ['09', 5.5777777777777775],
 ['04', 7.170212765957447],
 ['15', 38.5948275862069],
 ['19', 10.8],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297]]

Before we can make a conclusion, we need to sort the list of average values in order to gain a clearer insight.

In [37]:
# swap the columns first
swap_avg_by_hour = []

for row in avg_comments_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[23.810344827586206, '02'],
 [6.746478873239437, '22'],
 [16.009174311926607, '21'],
 [10.25, '08'],
 [11.051724137931034, '11'],
 [11.383333333333333, '01'],
 [8.127272727272727, '00'],
 [10.08695652173913, '05'],
 [13.233644859813085, '14'],
 [11.46, '17'],
 [7.852941176470588, '07'],
 [14.741176470588234, '13'],
 [7.985294117647059, '23'],
 [13.440677966101696, '10'],
 [9.41095890410959, '12'],
 [16.796296296296298, '16'],
 [21.525, '20'],
 [9.022727272727273, '06'],
 [5.5777777777777775, '09'],
 [7.170212765957447, '04'],
 [38.5948275862069, '15'],
 [10.8, '19'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03']]

In [42]:
# sort in a descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Post Comments")
sorted_output_dummy = "{time}: {post_count:.2f} average comments per post"
for row in sorted_swap[0:5]:
    time = dt.datetime.strptime(row[1],"%H").strftime("%H:%M")
    post_count = row[0]
    print(sorted_output_dummy.format(time=time,post_count=post_count))

Top 5 Hours for Ask Post Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Since the data time uses Eastern Time in the US, we can determine that the best times to post ask posts on Hacker News in the Vienna Time Zone (+6hrs) are: 21:00h, 08:00h, 02:00h, 22:00h and 03:00h in this order.

## 4. Conclusion
In this project, we analyzed ask and show posts on the Hacker News community. We tried to figure out which posts receive the most comments and at which our they do. Based on the results, we recommend posting an ask post at 21:00h  CEST or 15:00h EST. 

What has to be noted though is that the dataset excludes all posts that haven't received any comments. Thus the result only apply to those, who do receive comments.