# Introduction

For this project, we will be comparing 2 different types of news posts from a website called Hacker News.  This site was founded by the popular Y Combinator startup incubator, where user-submitted posts are voted and commented upon, in a manner similar to reddit.

Our interest lies in 2 different types of posts -  those whose titles begin with either 'Ask HN' or 'Show HN'.  Ask HN is posted by users asking a specific question to  the community, such as 'How to improve my personal website?'.  Show HN posts are from users displaying a project or otherwise interesting post to the community.

We'll compare these two types of posts to determine the following:

* Does **Ask HN** or **Show HN** receive more comments on average?
* Do posts created at a certain time receive more comments on average?

First, we'll read the data and remove the header.


In [11]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
hn_header = hn[:1]

hn_header
hn[:5]


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

**Filtering the Data**

Now the header has been removed, we can filter our data to only include those with post titles beginning with **Ask HN** or **Show HN** and separate them into two different lists

In [18]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.lower().startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17195


**Determining which type of posts receive more comments**

Now that the posts have been separated into different lists, it's time to determine which type of posts receives more comments on average.

In [19]:
total_ask_comments = 0

for post in ask_posts:
    
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)

14.038417431192661


In [20]:
total_show_comments = 0

for post in show_posts:
    
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)


10.31669535283993


It seems as though ask posts receive more comments on average than show posts, around 36% more. This answers one of our questions, and since ask posts receive more comments we will focus our analysis on only these.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments using the following steps:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.

2. Calculate the average number of comments ask posts receive by the hour they were created in.



**Calculating the number of Ask posts and comments by the Hour of Day**

In [30]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    time = row[0]
    comment = row[1]
    time = dt.datetime.strptime(time, date_format)
    hour = dt.datetime.strftime(time, "%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
        
comments_by_hour, counts_by_hour

({'09': 251,
  '13': 1253,
  '10': 793,
  '14': 1416,
  '16': 1814,
  '23': 543,
  '12': 687,
  '17': 1146,
  '15': 4477,
  '21': 1745,
  '20': 1722,
  '02': 1381,
  '18': 1439,
  '03': 421,
  '05': 464,
  '19': 1188,
  '01': 683,
  '22': 479,
  '08': 492,
  '04': 337,
  '00': 447,
  '06': 397,
  '07': 267,
  '11': 641},
 {'09': 45,
  '13': 85,
  '10': 59,
  '14': 107,
  '16': 108,
  '23': 68,
  '12': 73,
  '17': 100,
  '15': 116,
  '21': 109,
  '20': 80,
  '02': 58,
  '18': 109,
  '03': 54,
  '05': 46,
  '19': 110,
  '01': 60,
  '22': 71,
  '08': 48,
  '04': 47,
  '00': 55,
  '06': 44,
  '07': 34,
  '11': 58})

**Calculating the average number of comments on ask posts by hour of day**

In [42]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

We have the results but will sort these lists into an easier-to-read format

In [44]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [47]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [53]:
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))
    
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour which receives the mosts comments per Ask Post is 3pm and by some distance.  This indicates that to receive the most comments on an Ask Post on Hacker News, the Ask Post should be created at 3pm.  The documentation for the dataset indicates the timezone used here is EST, so to convert to GMT that would be 8pm, or 20:00 here.

**Conclusion**

The Hacker News dataset was analysed and it was shown that Ask Posts received more comments than Show Posts.  Additionally, it was found that the ideal time to post would be 3pm EST, or 8pm GMT. Since 4pm EST is also in the top 5 times for comments, we would suggest that posts be created between 8-9pm GMT to maximise comments, although creating posts between 2am and 3am GMT would also stand a good chance of receiving more comments, as both of those times are in the top 5.