# Finding Five Hot Hacker News Posting Hours in Pakistan

In this project, we'll aim to find 5 hot Hacker News posting hours for people in Pakistan so they can have a higher chance of receiving more comments on their posts. Hacker News is a popular website where technology related posts or stories are voted and commented on. We are interested in posts who's titles begin with `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question.Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

We'll analyze existing data about Hacker News posts to find the hot HN posting hours in Pakistan. To support our recommendation, we'll try to find out:

- Do Ask HN or Show HN receive more comments on average.
- Do posts created at a certain time receive more comments on average.

### Summary of Results

After analyzing the data, the conclusion we reached is that the hottest hour for creating posts in Pakistan is `3:00pm`with higher chance of receiving more comments. The other top four hours that guarantee receiving comments are `2:00am`, `8:00pm`, `4:00pm`and `9:00pm`.

For more details, please refer to the full analysis below.

# Exploring Existing Data

To avoid spending money on organizing a survey, we'll try to make use of existing data to determine whether we can reach any reliable result. 

The dataset is publicly available on the Kaggle site with full detail of the columns, but it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments. Below, is the quick exploration of data, available to me offline, stored in `hacker_news.csv`.

In [1]:
# Read the data
import csv
with open('hacker_news.csv') as csvfile:
    hn = list(csv.reader(csvfile, delimiter=','))

# Extracting headers
headers = hn[0]
# Removing headers from data
hn = hn[1:]
# Display first five rows
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

# Extracting Ask HN and Show HN Posts
As we mentioned in the introduction that we need to find the posting hours with higher chance of receiving comments and only `Ask HN` and `Show HN` posts receive comments. For the purpose of our analysis, we need to extract posts that starts with either Ask HN or Show HN.

The dataset provides information about the Hacker News posts. Every post in the dataset contains a `title` column that specifies the title of the post. In order to extract `Ask HN` and `Show HN` posts, we need to filter the posts whose title begin with either "Ask HN" or "Show HN".

In [2]:
# Extracting posts that starts with Ask HN or Show HN
import re

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if re.search("^Ask HN",title,re.IGNORECASE):
        ask_posts.append(row)
    elif re.search("^Show HN",title,re.IGNORECASE):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Total Ask Posts: ",len(ask_posts))
print("Total Show Posts: ",len(show_posts))
print("Total Other Posts: ",len(other_posts))

Total Ask Posts:  1744
Total Show Posts:  1162
Total Other Posts:  17194


In the last code block, we separated posts that begin with `Ask HN` and `Show HN` into two list of lists named `ask_posts` and `show_posts` and the remaining posts in `other_posts` list of lists. From a quick scan, it looks like:

- The number of Ask HN posts are more than the Show HN posts. 
- Both combined are quite less than the other posts.

We can infer that the site is used more for asking technical questions from the HN Community rather than displaying projects, product or any other interesting stuff.

Below are the first five rows in the `ask_posts` list of lists:

In [3]:
# Display first five Ask HN posts
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

Below are the first five rows in the `show_posts` list of lists:

In [4]:
# Display first five Show HN posts
show_posts[:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

# Determine Ask HN or Show HN Receive More Comments

Let's begin with finding out whether the **Ask HN** posts receive more comments on average or the **Show HN** posts. This should be a good start for finding out the hot posting hours in Pakistan that guarantee receiving comments.

The dataset provides information about the number of comments each post has received over a year. We will examine the column `num_comments` which contains the total number of comments against every post. In order to determine which posts receive more comments, we'll calculate the **Average Ask Comments** and **Average Show Comments**.

In [5]:
# Determine if Ask HN or Show HN posts receive more comments
ask_comments = [int(row[4]) for row in ask_posts]
show_comments = [int(row[4]) for row in show_posts]

avg_ask_comments = sum(ask_comments)/len(ask_comments)
avg_show_comments = sum(show_comments)/len(show_comments)

print("Average Ask Comments: ",avg_ask_comments)
print("Average Show Comments: ",avg_show_comments)

Average Ask Comments:  14.038417431192661
Average Show Comments:  10.31669535283993


According to the analysis above, we determined that on average, `Ask HN` posts receive more comments than `Show HN` posts. Since `Ask HN` posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

# Ask HN Posts & Comments by Hour Created

Now that we have determined which posts receive more comments on average. It's time to address the second question in the introduction i.e. if posts created at a certain time are more likely to attract comments. This take us another step closer to finding the hotest posting hours in Pakistan.

We'll use the following steps to perform this analysis:

1. Calculate the amount of Ask HN posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments Ask HN posts receive by hour created.

In the next code block, we will tackle the first step — calculating the amount of `Ask HN` posts and comments by hour created. We will calculate the number of posts created at each hour and the total number of comments received each hour.

In [6]:
# Amount of Ask HN posts and comments by hour created
import datetime as dt

created_date = [row[6] for row in ask_posts]

counts_by_hour = {}
comments_by_hour = {}

result_list = list(zip(created_date,ask_comments))

for date,comment in result_list:
    hour = dt.datetime.strptime(date,"%m/%d/%Y %H:%M").strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

We have created two dictionaries:

- `counts_by_hour`: contains the number of **Ask HN** posts created during each hour of the day.
- `comments_by_hour`: contains the corresponding number of comments **Ask HN** posts created at each hour received.

We'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [7]:
# Calculate avg number of comments for Ask HN posts by hour
avg_by_hour = [[hour, comments_by_hour[hour]/counts_by_hour[hour]] for hour in counts_by_hour]
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In the last code block, we calculated the average number of comments for posts created during each hour of the day and stored the results in a list of lists named `avg_by_hour`.

Now that we have the number of comments received against each hour of the day, it is difficult to identify the hours that received more comments than the others. To find the top 5 hotest hours we need to sort `avg_by_hour` in the order of decreasing comments.

In [8]:
# Sort by Comments
avg_by_hour = sorted(avg_by_hour,key=lambda x:x[1], reverse=True)

# Format printing
print("Top 5 Hours for 'Ask HN' Comments")
for hour,average in avg_by_hour[0:5]:
    print("{} {:.2f} average comments per post".format(dt.datetime.strptime(hour,"%H").strftime("%H:00"),average))

Top 5 Hours for 'Ask HN' Comments
15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


# Conclusion

After analyzing the data, the conclusion we reached is that the hottest hour for creating posts in Pakistan is `3:00pm` with higher chance of receiving more comments. The other top four hours that guarantee receiving comments are `2:00am`, `8:00pm`, `4:00pm` and `9:00pm`.