# Hacker Nesw post Analysis

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.


You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts),but note that I have reducede from almost 300,000 row to 20,000 rows by removing a;; submissions that dide't receive any comments and then randomlu sampling from the remaining submissions.Below are descriptions of the [columns](https://www.kaggle.com/hacker-news/hacker-news-posts)

- id: the unique identifier from Hacker News for the post
- title: the title of the post
- url: the URL that the posts links to, if the post has a URL
- num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: the number of comments on the post
- author: the username of the person who submitted the post
- created_at: the date and time of the post's submission

This analysis will
- Determine if *Ask HN* or *Show HN* recevice more comments 
- Determine if posts at certain time would received mote comments on average

Users use *Ask HN* posts to ask the Hacker News community a specific. Some exmaples of *Ask HN* posts are,

* Ask HN: How to improve my personal website?
* Ask HN: Am I the only one outraged by Twitter shutting down share counts?
* Ask HN: Aby recent changes to CSS that broke mobile?

Users use *Show HN* posts to show the Hacker News community a project, product or something interesting. Some exmaples of *Show HN* posts are,
* Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
* Show HN: Something pointless I made
* Show HN: Shanhu.io, a programming playground powered by e8vm

In [11]:
# read the hacker New files
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

# show the 1st 5 rows

hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Since the header row of the dataset is the column names, it is not needed for the analysis and will be removed into a seprarte lust

In [12]:
headers = hn[0]
hn = hn[1:]
print(headers)
print("\n")
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extract *Ask HN* and *Show HN* Posts ###

Since we're only concerned with post titles beginning with or , we'll create new lists of lists containing just the data for those titles. hnAsk HNShow HN

In [20]:
ask_posts = []
show_posts = []
other_posts = []


for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('The number of Ask HN post {}'.format(len(ask_posts)))
print('The number of Show HN post {}'.format(len(show_posts)))
print('The number of other HN post {}'.format(len(other_posts)))

The number of Ask HN post 1744
The number of Show HN post 1162
The number of other HN post 17194


In [26]:
for entry in range(0,2):
    print(ask_posts[entry][1])
    print(ask_posts[-entry-1][1])

print("\n")

for entry in range(0, 2):
    print(show_posts[entry][1])
    print(show_posts[-entry-1][1])

Ask HN: How to improve my personal website?
Ask HN: Why are papers still published as PDFs?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: How do you balance a serious relationship with starting a company?


Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform
Show HN: Parse recipe ingredients using JavaScript
Show HN: Something pointless I made
Show HN: PhantomJsCloud, Headless Browser SaaS


Now let's move on to answering the first basic question, whether **ask posts** or **show posts** gets more comments on average.

To check this, we have to find the number of comments in each post at index 4, add this value to a total_comments variable and divide it by the length of each list.

# Extract the average comments of Show HN& Ask HN comments

In [21]:
for row in ask_posts[0:3]:
    print(row[4])

6
29
1


In [28]:
# ask nh
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments+=num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)

# show nh average and total number
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments+=num_comments

avg_show_comments = total_show_comments/len(show_posts)

print("The averager number of comments on ask posts is {:,.2f} ".format(avg_ask_comments))
print("The averager number of comments on ask posts is {:,.2f} ".format(avg_show_comments))

The averager number of comments on ask posts is 14.04 
The averager number of comments on ask posts is 10.32 


To get a conclusion, I decide to subtract the calculated average number of ask posts froem the calculted average number of show posts and get the average number of comments which we receive more per post

In [30]:
avg_comments = avg_ask_comments - avg_show_comments
print("The ask section gets "+"{:.2f}".format(avg_comments)+' more comments')

The ask section gets 3.72 more comments


** Asl posts ** receiver on average about ***4 comments* more per post 

This is esay to explanined , for example , by the fact that in the Ask-section ,user are asked in particular fot soulation approcahes and are therefore moe invovloed than in Show-section.

Since asks posts are more likely to receive the comments , we will foucus our remaining analusisi just on these pots

As a next step, we will inverstigate whethere posts created ar a certain time receive more comments. For this we will calculate the number of asl posts created in each hour of dat, along with the number of comments receiveed and calculate the acerage number of comments ask poys reciive by hour.

In [33]:
import datetime as dt

In [37]:
result_list = []

# we iterate trough each post in ask_post, then asign
# the timestamp to created_at and the number of comments
# to num_comments. At the end we append a list with that
# data to the result_list


for row in ask_posts:
    created_at = row[6]
    num_comment = row[4]
    list = [created_at,num_comment]
    result_list.append(list)


In [38]:
result_list[0:5]

[['8/16/2016 9:55', '6'],
 ['11/22/2015 13:43', '29'],
 ['5/2/2016 10:14', '1'],
 ['8/2/2016 14:20', '3'],
 ['10/15/2015 16:38', '17']]

In [43]:
counts_by_hour = {} # contains the number of ask posts created during each hour of the day
comments_by_hour = {}
for row in result_list:
    num_comment = int(row[1])
    hour_dt = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M") # we convert the date string to a datetime object
    hour = hour_dt.strftime('%H')# we ectract the hour (%H) from the datetime object


    # now we create a frequent table and count the hour to get
    # the number of posts for each hour and set the comments by hour
    # equal to the comment number to get the comments in each hour.
    if hour in counts_by_hour:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+= num_comment
    else:
        counts_by_hour[hour]=1
        comments_by_hour[hour]= num_comment



In [79]:

hours = []
for hour in counts_by_hour:
    avg_by_hours = comments_by_hour[hour]/counts_by_hour[hour]
    hours.append([hour,avg_by_hours])
    
    

In [80]:
hours

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [81]:
avg_by_hour = hours
swap_avg_by_hour= []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)


[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [82]:
swap_avg_by_hour = sorted(swap_avg_by_hour,reverse=True)

In [83]:
swap_avg_by_hour[0:5]

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]

In [74]:
for average, hour in swap_avg_by_hour[:5]:
    hour_object = dt.datetime.strptime(hour, '%H') # convert the string to datetime object
    time = hour_object.strftime('%H:%M') # format the datetime object 
    print('{time}: {average:.2f} comments per post'.format(time=time, average=average) )

15:00: 38.59 comments per post
02:00: 23.81 comments per post
20:00: 21.52 comments per post
16:00: 16.80 comments per post
21:00: 16.01 comments per post


In [85]:
for average,hours in swap_avg_by_hour[0:5]:
    hour_object = dt.datetime.strptime(hours, '%H')
    hour = hour_object.strftime("%H:%M")
    print('{time}:{average:.2f} commonts per post'.format(time = hour,average = average))

15:00:38.59 commonts per post
02:00:23.81 commonts per post
20:00:21.52 commonts per post
16:00:16.80 commonts per post
21:00:16.01 commonts per post


We should create a post at 15:00 o'clock to have a higher chance of reveiving comments. 

This dataset refers to the time zone in eastern time, so because I live in London, we need to add 7 hours to each time to get the right time to post in germany on the website of Hacker News.

In [89]:
print('Top 5 Hours in CET (Berlin) for Ask Posts Comments')

for average, hour in swap_avg_by_hour[:5]:
    hour_object = dt.datetime.strptime(hour, '%H') # convert the string to datetime object
    cet = hour_object + dt.timedelta(hours=6)
    time = cet.strftime('%H:%M') # format the datetime object
    print('{time}: {average:.2f} comments per post'.format(time=time, average=average))

Top 5 Hours in CET (Berlin) for Ask Posts Comments
21:00: 38.59 comments per post
08:00: 23.81 comments per post
02:00: 21.52 comments per post
22:00: 16.80 comments per post
03:00: 16.01 comments per post


In our analysis we found out that there are the most comments in the Ask-Section. In order for our articles to have a higher chance of receiving as many comments as possible, we should publish an article around **3pm**. Since I'm located in London, we stick to the UTM and should therefore publish an article around **9pm**.

Even though this analysis has now given us an estimate of when it is worth writing an article, it doesn't mean that we will always get the most comments if we publish at 3pm. So, of course, the content of the article still plays a key role.