# Exploring Hacker News Posts
### Hemanth Soni, June 2020

---

## Introduction and Overview

The goal of this project is to analyze a dataset of posts from [Hacker News](https://news.ycombinator.com/) to understand...
1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

## Importing data

I will start by importing the necessary data into the project: a subset of the [full data set from Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts). This subset was simplified by the Dataquest team to remove all submissions that didn't receive any comments, and then randomly sample the remaining submissions to a more manageable 20K rows (the original dataset has ~300k).

In [22]:
# Opening file and saving to list
from csv import reader
opened_file = open('hacker_news_posts/hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

# Splitting out header and table
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Splitting lists into sub-lists

I'll now split out the list into three separate groups: one for Ask posts, one for Show posts, and one for everything else. This will make it easier to conduct the analysis on each type of post and compare them.

In [14]:
ask_posts = []
show_posts = []
other_posts = []

for each in hn:
    title = each[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(each)
    elif title.lower().startswith('show hn'):
        show_posts.append(each)
    else:
        other_posts.append(each)
        
print('Total Ask posts:',len(ask_posts))
print('Total Show posts:',len(show_posts))
print('Other posts:',len(other_posts))

Total Ask posts: 1744
Total Show posts: 1162
Other posts: 17194


## Comparing "Ask" vs. "Show" posts

Now that each type of post is in its own list, we can compare and contrast the two to see which drives greater engagement on the platform

### Comparing engagement by comments

In [66]:
def indexCalc(dataset, index):
    
    total = 0
    
    for each in dataset:
        total += int(each[index])
    
    avg = total / len(dataset)
    
    print('Total:',total)
    print('Count:',len(dataset))
    print('Average:',avg)
    print('')
    
indexCalc(ask_posts,4)
indexCalc(show_posts,4)

Total: 24483
Count: 1744
Average: 14.038417431192661

Total: 11988
Count: 1162
Average: 10.31669535283993



### Comparing engagement by points

In [67]:
indexCalc(ask_posts,3)
indexCalc(show_posts,3)

Total: 26268
Count: 1744
Average: 15.061926605504587

Total: 32019
Count: 1162
Average: 27.555077452667813



From this quick calculation, I can conclude that Ask posts receive more comments on average (~14 vs. 10.3 for Show posts) but much less points (~15 vs. 27.5 for Show posts).

## Identifying the best time of day to submit an "Ask HN" post

### To maximize comments

By examining the dataset of Ask posts, I can begin to understand the best time of day to make a post (where the objective is to maximize comments). I'll start by calculating the number of posts and comments by hour created, and then tallying them up to identify the best hour.'

In [57]:
import datetime as dt

result_list = []

# Extracting needed data into a separate list
for each in ask_posts:
    result_list.append([each[6], int(each[4]), int(each[3])])

# Creating frequency tables of the number of posts and comments by hour
counts_by_hour, comments_by_hour = {}, {}

for each in result_list:
    date = dt.datetime.strptime(each[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(each[1])
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(each[1])

# Creating a list that calculates the average posts by hour
avg_by_hour = []

for each in counts_by_hour:
    counts = counts_by_hour[each]
    comments = comments_by_hour[each]
    avg_by_hour.append([each, comments/counts])

# Sorting the above list to be more easily readable

swap_avg_by_hour = []

for each in avg_by_hour:
    swap_avg_by_hour.append([each[1], each[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Displaying top 5 hours to post

print('Top 5 hours for Ask Hacker News comments:')

output = '{}: {:.2f} average comments per post'

for each in sorted_swap[:5]:
    time = dt.datetime.strptime(each[1], "%H")
    time = dt.datetime.strftime(time, "%H:%M")
    print(output.format(time, each[0]))

Top 5 hours for Ask Hacker News comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Since the data is conveniently already in the timezone that I am in, I do not need to adjust the timezone. From this, I can see that the best times to post are in the mid-afternoon (3pm), late night (2am), or early evening (8pm). This is an interesting spread of times (wide range), suggesting that there may be different cohorts of users logging in from different time zones throughout the day. Having a more granular understanding (eg. IP addresses) of where posts are made from could help test this hypothesis.

### To maximize points

In [63]:
# Creating frequency tables for points by hour
points_by_hour = {}

for each in result_list:
    date = dt.datetime.strptime(each[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")
    
    if hour in points_by_hour:
        points_by_hour[hour] += int(each[2])
    else:
        points_by_hour[hour] = int(each[2])

# Creating a list that calculates the average points per post by hour
avg_points_by_hour = []

for each in points_by_hour:
    counts = counts_by_hour[each]
    points = points_by_hour[each]
    avg_points_by_hour.append([each, points/counts])

# Sorting the above list to be more easily readable

swap_avg_points_by_hour = []

for each in avg_points_by_hour:
    swap_avg_points_by_hour.append([each[1], each[0]])

sorted_points_swap = sorted(swap_avg_points_by_hour, reverse=True)

# Displaying top 5 hours to post

print('Top 5 hours for Ask Hacker News points:')

output = '{}: {:.2f} average points per post'

for each in sorted_points_swap[:5]:
    time = dt.datetime.strptime(each[1], "%H")
    time = dt.datetime.strftime(time, "%H:%M")
    print(output.format(time, each[0]))

Top 5 hours for Ask Hacker News points:
15:00: 29.99 average points per post
13:00: 24.26 average points per post
16:00: 23.35 average points per post
17:00: 19.41 average points per post
10:00: 18.68 average points per post


For points, we see a narrower spread: the best time to post in order to receive the most points seems to be between 1pm and 5pm eastern, with 3pm being the ideal time. 3pm was also the idea time in terms of engagement (comments), and thus can be seen to be the ideal time to post an 'Ask HN' type post on Hacker News.