# Hacker News
## DataQuest project

This notebook deals with a fictional scenario as part of the Dataquest.io curriculum.

#### Scenario
The website Hacker News is focused on tech and allows users to submit posts to the website, which other users can upvote, downvote or comment on. Two recurring post types are posts that start with either **Show HN** or **Ask HN**. 

The analysis here will focus on which of the two generates more comments and at which time of day you should post to generate the most comments.

#### Dataset
The original dataset can be found on [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts)

In [1]:
from csv import reader
import datetime as dt

hn = list(reader(open("HN_posts_year_to_Sep_26_2016.csv")))

### Data exploration

In [2]:
for idx, row in enumerate(hn):
    if idx == 5:
        break
    print(row)
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']




### Data cleaning

In [3]:
# Storing headers in a separate variable

headers = hn[0]
hn = hn[1:]

In [4]:
# Splitting posts into 3 different categories

ask_hn = []
show_hn = []
other_hn = []

for post in hn:
    post_title = post[1].lower()
    if post_title.startswith('ask hn'):
        ask_hn.append(post)
    elif post_title.startswith('show hn'):
        show_hn.append(post)
    else:
        other_hn.append(post)

In [5]:
# Verifying that we still have the expected amount of posts.

print("Ask HN: " + str(len(ask_hn)))
print("Show HN: " + str(len(show_hn)))
print("Other posts: " + str(len(other_hn)))
print("Actual total: " + str(len(ask_hn) + len(show_hn) + len(other_hn)))
print("Expected total: " + str(len(hn)))      

Ask HN: 9139
Show HN: 10158
Other posts: 273822
Actual total: 293119
Expected total: 293119


In [6]:
# Finding the total and average number of comments for ask and show posts.

total_ask_comments = 0

for post in ask_hn:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_hn)


total_show_comments = 0

for post in show_hn:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_hn)

In [7]:
print("Average comments for Ask HN posts: " + str(avg_ask_comments))
print("Average comments for Show HN posts: " + str(avg_show_comments))

Average comments for Ask HN posts: 10.393478498741656
Average comments for Show HN posts: 4.886099625910612


### Conclusion 1: Average comments Ask HN vs. Show HN

Ask HN posts on average clearly receive a higher amount of comments than Show HN posts do. Because this analysis is looking to find posts which attract the highest amount of comments, the remainder of this analysis will focus on Ask HN posts.

To determine the ideal time to post an Ask HN, we look at the created_at value of all Ask HN posts and categorise them by hour. Analysing the total number of comments that came in on posts by the hour, divided by the number of posts gives an average number of comments per hour.

In [8]:
# Number of posts and comments per hour

times_comments = {}
times_posts = {}

for post in ask_hn: 
    date_time = dt.datetime.strptime(post[6], "%m/%d/%Y %H:%M")
    time = date_time.strftime("%H")
    comments = int(post[4])
    
    if time in times_comments:
        times_comments[time] += comments
        times_posts[time] += 1
        
    else:
        times_comments[time] = comments
        times_posts[time] = 1

In [9]:
# Average comments per post per hour

times_avg = {}

for hour in times_comments:
    avg = times_comments[hour] / times_posts[hour]
    times_avg[hour] = avg

In [10]:
# Converting dictionary to list for easier sorting

times_list = []

for hour in times_avg:
    avg = [times_avg[hour], hour]
    times_list.append(avg)
    
times_list = sorted(times_list, reverse=True)

In [11]:
times_list

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

### Conclusion 2: Average comments per hour

In [12]:
print("Top 5 hours in Eastern US time (UTC-4) for Ask HN posts comments")
print("\n")

for item in times_list[:5]:
    time = item[1] + ":00"
    comments = item[0]
    print("{}: {:.2f} average comments per post.".format(time, comments))

Top 5 hours in Eastern US time (UTC-4) for Ask HN posts comments


15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


In [24]:
print("Top 5 hours in Amsterdam time (UTC+1) for Ask HN posts comments")
print("\n")

for item in times_list[:5]:
    time = str((int(item[1]) + 5)) + ":00"
    comments = item[0]
    print("{}: {:.2f} average comments per post.".format(time, comments))

Top 5 hours in Amsterdam time (UTC+1) for Ask HN posts comments


20:00: 28.68 average comments per post.
18:00: 16.32 average comments per post.
17:00: 12.38 average comments per post.
7:00: 11.14 average comments per post.
15:00: 10.68 average comments per post.
