# Exploring Popular Hacker News Posts

This project analyzes a kaggle dataset to find what kind of posts receive more comments on average on the popular technology website Hacker News. Hacker News is a website where users submit posts which are voted and commented upon. Similar to reddit, posts that make it to the top can get thousands of visitors. We will be exploring posts that begin with "Ask HN" (posts for asking questions) or "Show HN" (posts for showing projects, or something else) and analyze if they receive more comments on average. Also, we will be analyzing if posts created at a certain time receive more comments on average.


## Concepts Used

* The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
* Jupyter Notebook
* Working with Strings
* Object-oriented programming
* Dates and times
* List and for loops
* Conditional statements
* Dictionaries
* Functions 

## Data Overview

The dataset contains information on Hacker News posts for a 12 month period (up to September 26, 2016).

[Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts): The dataset has been reduced from almost 300,000 to approximately 20,000 rows by removing submissions that didn't receive any comments and randomly sampling from the remaining submissions.

## Data Exploration

We will load the dataset and look at the first couple of rows. We will also print, describe and identify columns that can help us with our analysis.

In [1]:
from csv import reader

### Hacker News Dataset ###

file = open("hacker_news.csv")
hacker_news = reader(file)
hn = list(hacker_news)

for row in hn[:5]:
    print(row)
    print('\n')
    
print('Number of rows:', len(hn))
print('Number of columns:', len(hn[0]))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Number of rows: 20101
Number of columns: 7


For the HN dataset, we have data describing 20100 posts (first row is for the header) with 7 columns. Here's a description for each of the columns from the Kaggle website.


* "title": title of the post

* "url": the url of the item being linked to

* "num_points": the number of upvotes the post received

* "num_comments": the number of comments the post received

* "author": the name of the account that made the post

* "created_at": the date and time the post was made (the time zone is Eastern Time in the US) 

At first glance, `title`, `num_points`, `num_comments`, and `created_at` seem to be the most useful columns.

In [2]:
# Extracting the header row

hn_header = hn[0]

# Removing the header row from our dataset

hn = hn[1:]

## Data Analysis: Ask HN and Show HN Posts

* Since we want to analyze **Ask HN** and **Show HN** posts, we will create two lists which contain each.

* We will also display a couple of rows for each of the new lists.

* Then, we will calculate the average number of comments for **ASK HN** and **Show HN** lists.

In [3]:
# Using a for loop to extract posts based on title

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("# of ASK HN posts:",len(ask_posts))
print('\n')
print("# of SHOW HN posts:",len(show_posts))
print('\n')
print("# of OTHER posts:",len(other_posts))
print('\n')

print("ASK HN Posts")
print('\n')
for row in ask_posts[:5]:
    print(row)
    print('\n')

print("SHOW HN Posts")
print('\n')
for row in show_posts[:5]:
    print(row)
    print('\n')

# of ASK HN posts: 1744


# of SHOW HN posts: 1162


# of OTHER posts: 17194


ASK HN Posts


['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


SHOW HN Posts


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46

Now, let's calculate the average number of comments for *Ask HN* and *Show HN* posts.

In [4]:
total_ask_comments = 0
comment_numbers = 0

for post in ask_posts:
    comments = int(post[4])
    comment_numbers += 1
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments/comment_numbers
print("Ask Comments Average:" + str(avg_ask_comments))

total_show_comments = 0
comment_show_total = 0

for post in show_posts:
    comments = int(post[4])
    comment_show_total += 1
    total_show_comments += comments
    
avg_show_comments = total_show_comments/comment_show_total
print("Show Comments Average:" + str(avg_show_comments))

Ask Comments Average:14.038417431192661
Show Comments Average:10.31669535283993


*Ask* posts receive 4 more comments on average compared to *Show* posts. We'll focus our analysis only on these posts. We'll check if the post creation time has an effect on comment numbers by creating two dictionaries:

* 'counts_by_hour': containing the number of ask posts created during each hour of the day.

* 'comments_by_hour': containing the corresponding number of comments ask posts created at each hour received.

In [5]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    n_comments = int(post[4])
    result_list.append([created_at, n_comments])

counts_by_hour = {}
comments_by_hour = {}

for item in result_list:
    dt_object = dt.datetime.strptime(item[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(dt_object,"%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = item[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += item[1] 

print("Ask Posts created according to each hour")
print('\n')
print(counts_by_hour)
print('\n')
print("Comments received according to each hour")
print('\n')
print(comments_by_hour)

Ask Posts created according to each hour


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Comments received according to each hour


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Now, we will display the average number of comments per post according to hour. This will help us determine if we can maximize the comments a post receives if we create it at a certain time.

In [6]:
avg_by_hour = []

for comment in comments_by_hour:
    avg_by_hour.append([comment,comments_by_hour[comment]/counts_by_hour[comment]])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


We'll finish by sorting the list of lists and print the five highest values.

In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print(swap_avg_by_hour)
print('\n')

sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print("Top 5 Hours for Ask Posts Comments")
print('\n')

for item in sorted_swap[:5]:
    output = "{time}: {average:.2f} coments per post"
    time = dt.datetime.strptime(item[1],"%H")
    time_update = dt.datetime.strftime(time,"%H:%M")
    print(output.format(time=time_update,average=item[0]))
    print('\n')

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Top 5 Hours for Ask Posts Comments


15:00: 38.59 coments per post


02:00: 23.81 coments per post


20:00: 21.52 coments per post


16:00: 16.80 coments per post


21:00: 16.01 coments per post




The original dataset was recorded using the Eastern Time Zone in the US. So the best time to achieve more comments would be 3pm in the afternoon, EST.

It is worth noting that the dataset we analyzed excluded posts which had 0 comments. So, it would be appropriate to say that of the ASK posts that received comments, posts created at 3pm EST received the most comments on average.