# Exploring Hacker News Posts

In this project, we'll be conducting some analysis on a subset of a dataset cataloguing posts to the [Hacker News site](https://news.ycombinator.com/news).

For this analysis, we'll hone in on "Ask HN", which ask the Hacker News community for answers/opinions on a question, and "Show HN" posts, which share a project, product, or something generally interesting with the community. Some examples:
>Ask HN: How to improve my personal website? <br>
Ask HN: Am I the only one outraged by Twitter shutting down share counts? <br>
Show HN: Something pointless I made <br>
Show HN: Shanhu.io, a programming playground powered by e8vm

The key questions we intend to answer are:
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

The [dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) we're making use of is sourced from Kaggle. Some caveats:
* we've reduced the nearly 300,000 rows to 20,000 by removing all submissions not receiving any comments, and then taking a random sample from the remainder. This will lend itself toward more relevant results and hopefully suffice for the light analysis we have planned. 
* the dataset covers only the 12 months up to September 26 2016.

Here's the description of the data (columns) from the Kaggle details:
> * title: title of the post (self explanatory)
> * url: the url of the item being linked to
> * num_points: the number of upvotes the post received
> * num_comments: the number of comments the post received
> * author: the name of the account that made the post
> * created_at: the date and time the post was made (the time zone is Eastern Time in the US)

---

## Opening and Exploring the Data

We'll begin by importing the libraries we need and reading the dataset into a list of lists


In [20]:
import csv
fhand = open('hacker_news.csv', 'r')
hn = list(csv.reader(fhand))

# Display the first five rows
print(hn[:5])

# Extract the first row of data (the dataset's headers)
headers = hn[0]
print(headers)
hn = hn[1:]
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte'

We'll now divide the dataset into lists for Ask HN, Show HN, and Other Posts (the remainder) for the analysis to come.

In [18]:
ask_posts = list()
show_posts = list()
other_posts = list()

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of Ask HN posts: {}'.format(len(ask_posts)))
print('Number of Show HN posts: {}'.format(len(show_posts)))
print('Number of Other HN posts: {}'.format(len(other_posts)))

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of Other HN posts: 17194


Next, we'll determine if Ask HN or Show HN posts receive more comments on average.

In [19]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average Ask HN comments: {}'.format(avg_ask_comments))

total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(ask_posts)
print('Average Show HN comments: {}'.format(avg_show_comments))

Average Ask HN comments: 14.038417431192661
Average Show HN comments: 6.873853211009174


The short answer is that Ask HN posts receive more comments on average (~14 comments per post vs. ~7).

We'll now progress to our second question– do posts created at a certain time receive more comments on average?

First, we'll examine the Ask HN posts.

In [30]:
import datetime as dt

# Create a list with our necessary data (date/time of post, comments count)
result_list = list()
for post in ask_posts:
    created_at, comments = post[6], int(post[4])
    result_list.append([created_at, comments])

# Create dictionaries to track number of posts and comments by hour of posting
counts_by_hour = dict()
comments_by_hour = dict()
for result in result_list:
    dt_created_at = dt.datetime.strptime(result[0], '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(dt_created_at, '%H')
    counts_by_hour[hour] = counts_by_hour.get(hour, 0) + 1
    comments_by_hour[hour] = comments_by_hour.get(hour, 0) + result[1]
    

In [31]:
# Form a list of lists of the average number of comments per posting hour
avg_by_hour = list()
for hour in list(comments_by_hour.keys()):
    hourly_average = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, hourly_average])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [52]:
# Sort hourly results by average

sorted_avg_by_hour = sorted(avg_by_hour, key=lambda avg_by_hour: avg_by_hour[1],
                            reverse=True)

print('Top 5 Hours for Ask Posts Comments')
for hourly_result in sorted_avg_by_hour[:5]:
    time = dt.datetime.strptime(hourly_result[0], '%H')
    print('{}: {:.2f} average comments per post'.format
          (time.strftime('%H:00'), hourly_result[1]))



Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## (early) Conclusions
Thus, 3PM Eastern Time looks to be the best time to post an Ask HN post, based on the highest number of comments for posts at that hour. After that, it's 2AM, 8PM, 4PM, and 9PM. Given the large variation, there doesn't seem to be much of a strong correlation there. Perhaps better content leads to better engagement.

Some next steps for analysis:
* Determine the best time for a Show HN post
* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare these results to the average number of comments and points "other posts" (non-Ask/Show) receive.
* Expand the dataset.
* Reformatting.
