# Analysis of Hacker News Posts

[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The full dataset, from which a sample is used for this project, can found on this [link](https://www.kaggle.com/hacker-news/hacker-news-posts). The dataset used has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

This project compares posts on Hacker News (HN) platform specifically Ask HN posts (users ask the HN community a question) and Show HN posts (users show the HN community a project, product, or generally something interesting).

The objective is to discover:

- Type of HN posts (ask or show) that recieve more comments
- The time of day at which posts are created and how this relates to the number of comments recieved

# Introduction
First we read the dataset and extract the header

In [1]:
# Read in the data.
from csv import reader

### Opening the Hacker News data set ###
opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn_dataset = list(read_file)
hn_dataset[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

# Removing Headers from a List of Lists

In [2]:
headers = hn_dataset[0] # extracting the header row from the full dataset
hn = hn_dataset[1:] # remaining data without header row

In [3]:
headers # displaying the header row

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [4]:
hn[:5] # Displaying the first five rows of the hn dataset

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

# Filtering Ask HN and Show HN Posts

Because we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, `string1`, we can check if starts with, say, `dq` by inspecting the output of the object `string1.startswith('dq')`. If `string1` starts with `dq`, it will return True, otherwise it will return False.

Below we use these methods to separate posts beginning with `Ask HN` and `Show HN` (and case variations) into two different lists.

In [5]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
            show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


This indicates that there are 1,744 Ask HN posts, 1,162 Show HN posts and 17,194 Other posts.

Exploring the first five rows in the `ask_posts` list of lists

In [6]:
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

Exploring the first five rows in the `show_posts` list of lists

In [7]:
show_posts[:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

# Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [8]:
# Calculating the total and average number of ask HN comments
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)

# Calculating the total and average number of show comments
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)


print('Average Ask HN posts: {:.2f}'.format(avg_ask_comments))
print('Average Show HN comments: {:.2f}'.format(avg_show_comments))

Average Ask HN posts: 14.04
Average Show HN comments: 10.32


The average number of ask posts is 14.04 and that for show posts is 10.32 which means that the ask posts receive more comments on average than show posts.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. 

# Determining Total Ask Posts and Comments Created by Hour

The next step is to determine if ask posts created at a certain time are more likely to attract comments. To acheive this, we will use the following steps:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments received by ask posts receive per hour.


In this part, we calculate the amount of ask posts and comments created by hour. We use the [datetime module](https://docs.python.org/3/library/datetime.html) to work with the data in the `created_at` column.

In [9]:
#importing datetime library
import datetime as dt

In [10]:
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])]) # Appending created_at and number of comments in result_list

In [11]:
# This code caggregates the total number of posts and comments per hour
posts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    comments = row[1]
    date = row[0]
    date = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = date.hour
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

In [12]:
# Output from the sorted counts_by_hour dictionary in descending order
# We can see that the most posts were made around 15 hours (116)

print('Posts created per hour (ascending order):')
dict(sorted(posts_by_hour.items(), key=lambda item: item[1], reverse=True))

Posts created per hour (ascending order):


{15: 116,
 19: 110,
 21: 109,
 18: 109,
 16: 108,
 14: 107,
 17: 100,
 13: 85,
 20: 80,
 12: 73,
 22: 71,
 23: 68,
 1: 60,
 10: 59,
 2: 58,
 11: 58,
 0: 55,
 3: 54,
 8: 48,
 4: 47,
 5: 46,
 9: 45,
 6: 44,
 7: 34}

In [13]:
# Output from the sorted comments_by_hour dictionary in ascending order
# We see that most comments were made around 15 hours (4477)

print('Total comments added by hour (Descending order):')
dict(sorted(comments_by_hour.items(), key=lambda item: item[1], reverse=True))

Total comments added by hour (Descending order):


{15: 4477,
 16: 1814,
 21: 1745,
 20: 1722,
 18: 1439,
 14: 1416,
 2: 1381,
 13: 1253,
 19: 1188,
 17: 1146,
 10: 793,
 12: 687,
 1: 683,
 11: 641,
 23: 543,
 8: 492,
 22: 479,
 5: 464,
 0: 447,
 3: 421,
 6: 397,
 4: 337,
 7: 267,
 9: 251}

After the last step, we created two dictionaries as decribed below:

- posts_by_hour: contains the number of ask posts created during each hour of the day.
- comments_by_hour: contains the corresponding number of comments received for the ask posts at each hour received.

After having calculated the amount of posts and comments created for each hour, we can now calculate the average number of comments for posts created during each hour of the day.

# Determing Average Number of Comments for Ask HN Posts by Hour

In this part we will calculate the average number of comments per post for posts created during each hour of the day.

The result is a list of lists in which the first element is the hour and the second element is the average number of comments per post. The result is stored in a variable named `avg_by_hour`.

In [14]:
avg_by_hour = []
for row in posts_by_hour:
    avg_by_hour.append([row, round(comments_by_hour[row]/posts_by_hour[row],2)])

In [15]:
print('Average comments made by hour (Descending order):')
sorted(avg_by_hour, key=lambda x:(x[1], 2), reverse=True)

Average comments made by hour (Descending order):


[[15, 38.59],
 [2, 23.81],
 [20, 21.52],
 [16, 16.8],
 [21, 16.01],
 [13, 14.74],
 [10, 13.44],
 [14, 13.23],
 [18, 13.2],
 [17, 11.46],
 [1, 11.38],
 [11, 11.05],
 [19, 10.8],
 [8, 10.25],
 [5, 10.09],
 [12, 9.41],
 [6, 9.02],
 [0, 8.13],
 [23, 7.99],
 [7, 7.85],
 [3, 7.8],
 [4, 7.17],
 [22, 6.75],
 [9, 5.58]]

After sorting the results in ascending order for the average number of comments per post added by each hour, we can see that, the highest occurs around 15hrs with an average of 38.59 comments.

# Printing Sorted Values from a List of Lists

In [16]:
# Getting top 5 Hours (EST) for Ask Post Comments
ask_sorted = sorted(avg_by_hour, key=lambda x:(x[1], 2), reverse=True)

print('Top 5 Hours for Ask Posts Comments in EST Timezone')
for hour, avg in ask_sorted[:5]:
    hour_est = dt.datetime.strptime(str(hour),'%H') 
    hour_est = hour_est.strftime('%H:%M')

    print('{}(EST): {} average comments per post'.format(hour_est,avg))

Top 5 Hours for Ask Posts Comments in EST Timezone
15:00(EST): 38.59 average comments per post
02:00(EST): 23.81 average comments per post
20:00(EST): 21.52 average comments per post
16:00(EST): 16.8 average comments per post
21:00(EST): 16.01 average comments per post


From this we see that for one to have a higher chance of receiving comments, they have to create a post during 15hrs. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set documentation, the timezone used is Eastern Time in the US. Below is the converted time to the my local time zone (GMT+2).

In [17]:
print('Top 5 Hours for Ask Posts Comments in GMT Timezone')
for hour, avg in ask_sorted[:5]:
    hour_gmt2 = dt.datetime.strptime(str(hour),'%H')
    hour_gmt2 += dt.timedelta(hours=7) #converting EST to GMT+2  
    hour_gmt2 = hour_gmt2.strftime('%H:%M')

    print('{}(GMT+2): {} average comments per post'.format(hour_est,avg))

Top 5 Hours for Ask Posts Comments in GMT Timezone
21:00(GMT+2): 38.59 average comments per post
21:00(GMT+2): 23.81 average comments per post
21:00(GMT+2): 21.52 average comments per post
21:00(GMT+2): 16.8 average comments per post
21:00(GMT+2): 16.01 average comments per post


# Conclusion

The analysis done in this project was to determine which type of posts and during which time, the ask posts or show posts receive the most comments on average. In order to increase the chance of having a post receiving most comments, based on this analysis, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) which would be between 21:00 abd 22:22 (in my local time zone GMT+2)

The conclusion that can be made with certainty from this analysis is that, for the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.