# Exploring Hacker News Posts

The objective of this project is to determine if time is a potential factor that is attached to the amount of comments on a post on the Hacker News website. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

In [1]:
from csv import reader
opened_file=open('hacker_news.csv', encoding='utf-8')
read_file=reader(opened_file)
hn=list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
headers=hn[0]
hn=hn[1:]
print('Headers of the dataset:')
print(headers)
print('The dataset after removing headers:')
print(hn[:5])

Headers of the dataset:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Now that we've removed the headers from the dataset, we're ready to filter our data.

In [4]:
# Extract the ASK HN posts and SHOW HN posts
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of posts in "ask_posts":', len(ask_posts))
print('Number of posts in "show_posts":', len(show_posts))
print('Number of posts in "other_posts":', len(other_posts))
print('Number of posts in the dataset:', len(hn))
print(len(ask_posts)+len(show_posts)+len(other_posts))

## Calculating the average number of comments for Ask HN and Show HN posts

Determine if ask posts or show posts receive more comments on average

In [6]:
# Find the total number of comments in ask posts and assign it to `total_ask_comments`
total_ask_comments=0
for row in ask_posts:
    nb_comments=int(row[4])
    total_ask_comments+=nb_comments
avg_ask_comments=total_ask_comments/len(ask_posts)
print('Total number of comments in ask posts:',"{:,}".format(total_ask_comments))
print('Average number of comments in ask posts:',"{:,.2f}".format(avg_ask_comments))

Total number of comments in ask posts: 24,483
Average number of comments in ask posts: 14.04


In [7]:
# Find the total number of comments in show posts and assign it to `total_show_posts`

total_show_comments=0

for row in show_posts:
    nb_comments=int(row[4])
    total_show_comments+=nb_comments

avg_show_comments=total_show_comments/len(show_posts)
print('Total number of comments in show posts:',"{:,}".format(total_show_comments))
print('Average number of comments in show posts:', "{:,.2f}".format(avg_show_comments))

Total number of comments in show posts: 11,988
Average number of comments in show posts: 10.32


According to the results shown above, ask posts receive more comments on average. Even though the gap between the number of posts in ask posts and show posts isn't wide, the total number of comments in the 2 datasets are very different from one another. Ask posts are more likely to receive comments. For now, we will focus on these posts. 

## The amount of ask posts and comments by hour created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [16]:
import datetime as dt
result_list=[]
for row in ask_posts:
    created_at=row[6]
    nb_comments=int(row[4])
    result_list.append([created_at,nb_comments])

counts_by_hour={} # Number of ask posts created during each hour of the day
comments_by_hour={} # Number of comments created on ask posts at each hour.

for row in result_list:
    date_str=row[0]
    nb_comments=row[1]
    date_dt=dt.datetime.strptime(date_str,"%m/%d/%Y %H:%M")
    hour_str=date_dt.strftime("%H")
    
    if hour_str in counts_by_hour:
        counts_by_hour[hour_str]+=1
        comments_by_hour[hour_str]+=nb_comments
    else:
        counts_by_hour[hour_str]=1
        comments_by_hour[hour_str]=nb_comments
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [15]:
avg_by_hour=[]
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
sorted_swap=sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments:')
for elem in sorted_swap[:5]:
    hour=dt.datetime.strptime(elem[1],"%H").strftime("%H:00")
    avg_cmt="{:.2f}".format(elem[0])
    print(hour,':',avg_cmt, ' average comments per post')

Top 5 Hours for Ask Posts Comments:
15:00 : 38.59  average comments per post
02:00 : 23.81  average comments per post
20:00 : 21.52  average comments per post
16:00 : 16.80  average comments per post
21:00 : 16.01  average comments per post


From the results, we can conclude that we should create a post at 3PM to have the highest chance of receiving comments. Posts that are made at `02:00`, `20:00`, `16:00`, `21:00` also have high average comments per post.