# Hacker News Post Analysis #


Welcome to Hacker News!! Hacker News is a website that is popular in technology and start-up circles. Users submit stories, known as posts.

This analysis will 

- Determine if *Ask HN* or *Show HN* receives more comments
- Determine if posts created at a certain time receive more comments on average

Users use *Ask HN* posts to ask the Hacker News community a specific. Some exmaples of *Ask HN* posts are,

* Ask HN: How to improve my personal website?
* Ask HN: Am I the only one outraged by Twitter shutting down share counts?
* Ask HN: Aby recent changes to CSS that broke mobile?

Users use *Show HN* posts to show the Hacker News community a project, product or something interesting. Some exmaples of *Show HN* posts are,
* Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
* Show HN: Something pointless I made
* Show HN: Shanhu.io, a programming playground powered by e8vm

The Hacker News data can be accessed at [Hacker News link](https://www.kaggle.com/hacker-news/hacker-news-posts). A description of the columns are:
* **id:** the unique identifier from Hacker News for the post
* **title:** the title of the post
* **url:** the URL that the posts links to, if the post has a URL
* **num_points:** the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **num_comments:** the number of comments on the post
* **author:** the username of the person who submitted the post
* **created_at:** the date and time of the post's submission


### Exploration of Haker News Dataset ###

In [1]:
#Read in the Hacker News file 

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

#show the 1st 5 rows
print('The first five rows of the Hacker News dataset') 
print('\n') 
hn[:5]


The first five rows of the Hacker News dataset




[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Since the header row of the dataset is the column names, it is not needed for the analysis and will be removed into a separate list.

In [2]:
#assign 1st row, column headers, of dataset to headers

headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extract *Ask HN* and *Show HN* Posts ###

Now that the Hacker News dataset has been uploaded, the next step is to generate a list of only those posts the are related to Ask HN and Show HN. 

This is achieved by looping through the Hacker News dateset and filtering on titles that include Ask HN and Show HN. All other posts will be saved as Other.

In [3]:
#create new lists containing posts titles with Ask HN and Show HN

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('The number of Ask HN posts are', len(ask_posts))
print('The number of Show HN posts are', len(show_posts))
print('The number of Other posts are', len(other_posts))

The number of Ask HN posts are 1744
The number of Show HN posts are 1162
The number of Other posts are 17194


**Explore *Ask HN* and *Show HN* Posts**

In [4]:
#print the first five rows from the Ask HN posts

print('These are the 1st five rows of the Ask HN posts')
print('\n')
print(ask_posts[:5])

These are the 1st five rows of the Ask HN posts


[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [5]:
#print the first five rows from the Show HN posts

print('These are the 1st five rows of the Show HN posts')
print('\n')
print(show_posts[:5])

These are the 1st five rows of the Show HN posts


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


### Determine Average Number Of Comments for *Ask HN* and *Show HN* Posts ###

One aspect of the analysis is to determine which type of posts - *Ask HN* or *Show HN* - receive more comments. In order to determine this, the average number of comments will calculated.

In [6]:
#calculate average number of comments for 
#Ask HN and Show HN posts

total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    n_ask_comments = int(row[4])
    total_ask_comments += n_ask_comments
    avg_ask_comments = total_ask_comments/n_ask_comments

for row in show_posts:
    n_show_comments = int(row[4])
    total_show_comments += n_show_comments
    avg_show_comments = total_show_comments/n_show_comments

print('The average number of Ask HN posts are', avg_ask_comments)
print('The average number of Show HN posts are', avg_show_comments)
print('\n')
print('The total number of Ask HN posts are',total_ask_comments)
print('The total number of Show HN posts are',total_show_comments)

The average number of Ask HN posts are 12241.5
The average number of Show HN posts are 5994.0


The total number of Ask HN posts are 24483
The total number of Show HN posts are 11988


The total number of comments for both the Ask HN and Show HN is a static number that is determined based on when the dataset was downloaded. However, since the Hacker News is a dynamic website, meaning the content changes regularly, determining the average number of comments is a better guage than calculating the total number of comments for each type of post. 

At the time when the Hacker News dataset was downloaded for this analysis, the average number of Ask HN posts far exceed the average number of Show HN posts. The average number of Ask HN posts are more than double the average number of Show HN posts. This trend is obviously also reflected in the total number of Ask HN and Show HN posts. 

At the time of this analysis, the Ask HN posts are more popular than the Show HN posts. In order to conclude whether or not this is a common occurence, the Hacker News dataset will have be re-analyzed when the dataset has new Ask HN and Show HN posts to determine if this is a common trend.

### Determine What Time of Day Are Most *Ask HN* Posts ###

Now that it's been determined that the *Ask HN* posts are more popular than the *Show HN* posts, the next step is to calculate the average number of *Ask HN* posts per hour.

First, determine the number of counts per hour and the number of comments per hour.

**Calculate Number of *Ask HN* Posts per Hour & Total Number of Comments**

In [7]:
#calculate the number of Ask HN posts created per hour and
#calculate the total number of comments

import datetime as dt

#generate list to include created_at (time post created) and 
#number of comments
result_list = []
for row in ask_posts:
    created_time = row[6]
    n_comments = int(row[4])
    result_list.append([created_time, n_comments])
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    hr = row[0]
    comment = row[1]
    dt_hr = dt.datetime.strptime(hr, "%m/%d/%Y %H:%M") #extract hour from created at column
    hr_only = dt.datetime.strftime(dt_hr, "%H")
    
    #populate counts_by_hour and comments_by_hour dictionaries
    #with hour from datetime object as the dictionary key
    if hr_only not in counts_by_hour:
        counts_by_hour[hr_only] = 1
        comments_by_hour[hr_only] = comment
    else:
        counts_by_hour[hr_only] += 1
        comments_by_hour[hr_only] += comment

print('The number of counts by hour is', counts_by_hour)
print('\n')
print('The number of comments by hour is', comments_by_hour)

The number of counts by hour is {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


The number of comments by hour is {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


**Calculate Average Number of *Ask HN* Posts per Hour**

Now that the number of *Ask HN* posts per hour and the number of comments per hour has been determined, the next step is to use this informaiton to calculate the average number of *Ask HN* comments per hour of the day.

In [8]:
#calculate the average number of posts per hour created during
#each hour of the day

avg_by_hour = []

#use the dictionaries previously calculated for number of comments per hour
#and number of counts per hour
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print('The average number of commnets per hour of day is', '\n', avg_by_hour)


The average number of commnets per hour of day is 
 [['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


For ease of readability, sort the average of comments per hour of day to determine which hour during the day receives the highest average number of comments.

In [35]:
#sort avg_by_hour in descending order with the average number
#of comments as the first element in the list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask HN Posts Comments')
print('Time is in EST')
for loop in sorted_swap:
    dt_object = dt.datetime.strptime(loop[1], '%H')
    loop[1] = dt.datetime.strftime(dt_object, '%I:%M %p')
    loop[0] = "{:.2f} average comments per Ask HN posts".format(loop[0])

#resort dictionary so hour is first element
resorted = []
for row in sorted_swap:
    resorted.append([row[1], row[0]])
print(resorted[:5])

Top 5 Hours for Ask HN Posts Comments
Time is in EST
[['03:00 PM', '38.59 average comments per Ask HN posts'], ['02:00 AM', '23.81 average comments per Ask HN posts'], ['08:00 PM', '21.52 average comments per Ask HN posts'], ['04:00 PM', '16.80 average comments per Ask HN posts'], ['09:00 PM', '16.01 average comments per Ask HN posts']]


***Ask HN* Post with Most Points per Hour**

Now that it's been established which hour of the day has the most comments on average, which hour of the day has the most points?

In [72]:
#calculate the number of points per hour for Ask HN posts

tot_pts_hr = {}

for row in ask_posts:
    hr = row[6]
    pts = int(row[3])
    dt_hr = dt.datetime.strptime(hr, "%m/%d/%Y %H:%M") 
    hr_only = dt.datetime.strftime(dt_hr, "%I %p")
    if hr_only not in tot_pts_hr:
        tot_pts_hr[hr_only] = pts
    else:
        tot_pts_hr[hr_only] += pts

print('The total number of points per hour is', tot_pts_hr)    

The total number of points per hour is {'09 AM': 329, '01 PM': 2062, '10 AM': 1102, '02 PM': 1282, '04 PM': 2522, '11 PM': 581, '12 PM': 782, '05 PM': 1941, '03 PM': 3479, '09 PM': 1721, '08 PM': 1151, '02 AM': 793, '06 PM': 1741, '03 AM': 374, '05 AM': 552, '07 PM': 1513, '01 AM': 700, '10 PM': 511, '08 AM': 515, '04 AM': 389, '12 AM': 451, '06 AM': 591, '07 AM': 361, '11 AM': 825}


### Conclusion ###

The best time to post an *Ask HN* comment is 3:00PM EST. 3:00PM is also the hour that received the most points. Therefore, it seems the optimal time to post *Ask HN* posts is 3:00PM.
