# Hacker News Post Exploratory Analysis

## Introduction

The objective of the analysis is to investigate the characterisitics of posts that could increase the chance of receiving comments.

The following two perspectives are taken in consideration.
1. The types of posts that receive more comments on average.
We will be focusing our comparison on posts that contain the keywords `Ask HN` and `Show HN` respectively, which are the common post types in Hacker News.
- `Ask HN` posts contain questions that the author created to get answers from the Hacker News community.
- `Show HN` posts contain information that the author wished to inform the Hacker News community.

2. The time during a day that a post created receive more comments on average

### Data source

Data set to be investigated in this analysis is [Hacker News Post](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), which contains information of Hacker News posts from period September 2015 - September 2016.

## Open and read the data

We are going to open and read the data in the format for analysis purpose.

In [22]:
# Open and read the file
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

In [23]:
# Print to see the first five rows of hn
print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


In [24]:
# Separate the header from the data
headers = hn[0]
hn = hn[1:]

In [25]:
# Verify the separation
print(headers)
print('\n')
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Filter data

Since only two types of post, ask posts (posts that contain the keyword `Ask HN`)and show posts (posts that contain the keyword `Show HN`), are compared in this analysis, first we are going to separate data of these posts from the rest of the posts.

In [26]:
# Create three empty list to store separated data
ask_posts = []
show_posts = []
other_posts = []

In [27]:
# Separate the data into the three different lists
for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [28]:
# Check if the separation is correct
print(ask_posts[:5])
print('\n')
print(show_posts[:5])
print('\n')
print(other_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 

We have separated the data into three different lists:
- Ask post: 1744 posts
- Show post: 1162 posts
- Others: 17194

We will be looking at the lists of **ask posts** and **show posts** respectively in the analysis.

## Part I: Compare comments received by ask posts and show posts

In this part, we are going to understand if posts that contain `Ask HN` or posts that contain `Show HN` receive more comments.

Average comments received by **ask posts**.

In [29]:
# Find the total number of comments in ask posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
print(total_ask_comments)

24483


In [30]:
# Compute the number of comments on average
total_ask_posts = len(ask_posts)

avg_ask_comments = total_ask_comments / total_ask_posts
print(avg_ask_comments)

14.038417431192661


Average comments received by **show posts**.

In [31]:
# Find the total number of comments in show posts
total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    
print(total_show_comments)

11988


In [32]:
# Compute the number of comments on average
total_show_posts = len(show_posts)

avg_show_comments = total_show_comments / total_show_posts
print(avg_show_comments)

10.31669535283993


Observation:

From the result above, **ask posts** receives 14.04 comments per post created while **show posts** receives 10.32 comments per post created.

**Ask posts** receive more comments than show posts in general.

## Part II: Time factor on comments received

In this part, we are going to understand the hours of the day that the post is created that received more comments on average.

We will be focusing on the data of **ask posts** in this analysis.

In [33]:
# Import datetime module for analysis of time
import datetime as dt
from datetime import datetime

We are going to calculate the number of ask posts and comments by hour created.

First, we are going to isolate the data of post creation date and number of comments from the list.

In [34]:
# Create a list to store the value of post reaction time
# and number of comments
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    num_comments = int(num_comments)
    result_list.append([created_at, num_comments])

print(result_list[:3])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]


From the list created, we are going to count the following:
1. The number of posts created by hour
2. The number of comments posted by hour

And the data will be stored in separated dictionaries for calculation purpose.

In [35]:
# Create frequency tables to compare
# created hours and commented hours
counts_by_hours = {}
comments_by_hours = {}

for row in result_list:
    created_date = row[0]
    num_comment = row[1]
   
    hour = created_date.split(' ')
    hour = hour[1]
    
    hour = dt.datetime.strptime(hour,'%H:%M')
    hour = hour.strftime('%H')
    
    if hour not in counts_by_hours:
        counts_by_hours[hour] = 1
        comments_by_hours[hour] = num_comment
    else:
        counts_by_hours[hour] += 1
        comments_by_hours[hour] += num_comment

In [36]:
# Check if they are created successfully
print(counts_by_hours)
print('\n')
print(comments_by_hours)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


From the two lists, we are going to compute the average number of comments per post during each hour of the day.

In [37]:
avg_by_hour = []

for hour, x in counts_by_hours.items():
    num_post = int(x)
    
    for hour_com, y in comments_by_hours.items():
        num_comment = int(y)
        
        if hour == hour_com:
            avg_comment = num_comment / num_post
            avg_by_hour.append([hour, avg_comment])
            
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Sort the result into a more readable format. 

In [38]:
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    average = row[1]
    swap_avg_by_hour.append([average, hour])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [39]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


Here are the top 5 hours for **ask posts** comments.

In [42]:
print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:5]:
    average = row[0]
    hour = row[1]
    hour = datetime.strptime(hour,'%H')
    hour = hour.strftime('%H:00')
    
    string = '{}: {:.2f} average comments per post'
    formatted_string = string.format(hour, average)
    
    print(formatted_string)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Observation:

From the result, it is observed that a post created during the following times in Central European Time (CEST) receives a higher number of comments:

- 21:00
- 08:00
- 02:00
- 22:00
- 03:00

## Conclusion

The results in this analysis give a preliminary idea of how to increase the chances of receiving more comments in Hacker News based on the factors of post nature and post creation time.

- **Post nature: Ask posts receive more comments in general**

By comparing posts that have the intention of asking a question (posts that contain `Ask HN`) and those for displaying information (posts that contain `Show HN`), it is observed that the former receive an average of 4 comments per post more.

However, it should be taken into account that there might be more common post nature in addition to the two mentioned above which are not studied in this analysis.

- **Time factor: Post created in the late night, after midnight and early morning tends to receive more comments**

It is observed that there are three periods that the posts created receive more comments in general:

   - Late night (9pm, 10pm),  after midnight (2am, 3am): Possible explanation is that it is the leisure time after dinner and chores and when users are active on internet
    - Early morning (8am): Possible explanation is that it is the time when users are commuting to schools or work

Recommendation:

From the result, it is recommended that the nature of posts and time of post created can be considered in order to receive more comments for the posts. Nevertheless, more in-depth analysis is recommended to performed in order to have a better insights in this regard.