# Hacker News
We are looking at a dataset for the website Hacker News. Our primary interest is in the number of comments posts receive, and when the comments are made. In order to determine the best time to create a post to receive the most comments which relates to the total number of views.  
The dataset is a modified version of [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts?select=HN_posts_year_to_Sep_26_2016.csv) from Kaggle.

In [8]:
# Open the file and create our raw dataset.
import csv
opened_file = ''
with open('hacker_news.csv') as f:
    read_file = csv.reader(f)
    raw_data = [row for row in read_file]
header = raw_data[0]
raw_data = raw_data[1:]
print('-Column names for dataset-')
for i, col in enumerate(header):
    print(i, col)

-Column names for dataset-
0 id
1 title
2 url
3 num_points
4 num_comments
5 author
6 created_at


In [2]:
def explorer(dataset, start, stop, totals=True):
    for row in dataset[start:stop]:
        print(row)
    if totals:
        rows = len(dataset)
        cols = len(dataset[0])
    print(f'rows: {rows} x cols: {cols}')

In [9]:
explorer(raw_data, 0, 5)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
rows: 20100 x cols: 7


In [11]:
ask_posts = []
show_posts = []
other_posts = []

for row in raw_data:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    if title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
        
print('Ask HN posts:', len(ask_posts))
print('Show HN posts:', len(show_posts))
print('Other posts:', len(other_posts))

Ask HN posts: 1744
Show HN posts: 1162
Other posts: 18938


Check the data in the show_posts set:

In [10]:
explorer(show_posts, 0, 5)

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']
['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']
['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']
rows: 1162 x cols: 7


### Average comments by post type
On average the Ask HN posts receive 4 more comments per post over the Show HN posts.

In [6]:
def get_comment_info(dataset, index=4):
    total = 0
    for row in dataset:
        num_comments = int(row[index])
        total += num_comments
    print(int(total/len(dataset)))
    return total

print('Average number of comments per post:')
print('ask HN: ', end='')
total_ask = get_comment_info(ask_posts)
print('show HN: ', end='')
total_show = get_comment_info(show_posts)
print('other: ', end='')
total_other = get_comment_info(other_posts)

    

Average number of comments per post:
ask HN: 14
show HN: 10
other: 25


## Hourly breakdown of average comments per post

In [7]:
import datetime as dt
result_list = []
for post in ask_posts:
    created_at = post[6]
    n_comments = int(post[4])
    result_list.append([created_at, n_comments])

    # dicts for hourly calcs
counts_by_hour = {}
comments_by_hour = {}

for i in result_list:
    date_obj = dt.datetime.strptime(i[0], '%m/%d/%Y %H:%M')
    hour = date_obj.strftime('%H')
    i[0] = hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = i[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += i[1]

avg_by_hour = []
for k, v in counts_by_hour.items():
    avg_by_hour.append([comments_by_hour[k]/counts_by_hour[k], k])
avg_sorted = sorted(avg_by_hour, reverse=True)

print('-Top 5 hours for Ask HN posts-')
for i in avg_sorted[:5]:
    avg_comments = f'{i[0]:.2f} average comments per post'
    hrs = dt.datetime.strptime(i[1], '%H')
    hrs = hrs.strftime('%H:%M')
    print(f'{hrs}: {avg_comments}')


-Top 5 hours for Ask HN posts-
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Optimal Time to Post Ask HackerNews?
Based on the average comments per post during each hour, the best time would be __3:00 pm EST or 1:00 pm MST__.