# Guided Project: Exploring Hacker News Post

In this project, we are going to explore the Hacker News post. Hacker News is a very popular site in technical and startup circles. Site functions in a way that users submit posts which are later voted and commented upon, similar to Reddit. Posts that make it to the top of Hacker News listing can get hundreds of thousands of visitors as a result.


For our purpose of the exercise, we will analyze a reduced data set, which contains approximately 20,000 rows. You can find the data set here - https://www.kaggle.com/hacker-news/hacker-news-posts. Also, our data set consists of 7 columns. Here are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

## 1. Importing Data Set

In [6]:
#Import and read the data set
import csv
opened_file = open("hacker_news.csv")
hn = list(csv.reader(opened_file))

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## 2. Removing Headers from a List of Lists

In [7]:
# Remove the headers.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## 3. Extracting Ask HN and Show HN Posts

As we said in the introduction, we will only analyze posts starting with Ask HN and Show HN. To do that, we will create new lists of lists containing just the data for those titles.


To find the posts that begin with either Ask HN or Show HN, we'll use the string method 'starts with'. Since this method is case sensitive, we will use the 'lower method', which returns a lowercase version of the starting string.


Now, let's start extracting Ask HN and Show HN posts.

In [8]:
# Creating three empty list of lists:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## 4. Calculating Average Number of Comments for Ask HN and Show HN Posts

Now, we will calculate ask posts or show posts receive more comments on average.

In [9]:
# Calculating the total number of comments at asking posts
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
# Calculating the average number of comments at asking posts
avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)

14.038417431192661


In [10]:
# Calculating total number of comments at showing posts
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
# Calculating average number of comments at showing posts
avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

10.31669535283993


As we see in the code above, the average number of comments for asking posts is 14, and an average number of comments for showing posts 10. Since ask posts are more likely to receive comments, we will focus our remaining analysis only on them.

## 5. Calculating the Amount of Ask Posts and Comments per Hour

Let's now calculate do ask posts created at a certain time are more likely to attract comments. To do that, we will use the DateTime module.

In [11]:
# Importing datetime module
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    date = dt.datetime.strptime(date, date_format)
    time = dt.datetime.strftime(date, "%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1
        
print(comments_by_hour)
print(counts_by_hour)

{'03': 421, '13': 1253, '00': 447, '19': 1188, '22': 479, '08': 492, '14': 1416, '04': 337, '01': 683, '23': 543, '10': 793, '05': 464, '17': 1146, '20': 1722, '16': 1814, '07': 267, '02': 1381, '09': 251, '12': 687, '21': 1745, '18': 1439, '06': 397, '15': 4477, '11': 641}
{'03': 54, '13': 85, '00': 55, '19': 110, '22': 71, '08': 48, '14': 107, '04': 47, '01': 60, '23': 68, '10': 59, '05': 46, '17': 100, '20': 80, '16': 108, '07': 34, '02': 58, '09': 45, '12': 73, '21': 109, '18': 109, '06': 44, '15': 116, '11': 58}


## 6. Calculating the Average Number of Comments for Ask HN Posts by Hour

In [12]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['03', 7.796296296296297],
 ['13', 14.741176470588234],
 ['00', 8.127272727272727],
 ['19', 10.8],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['14', 13.233644859813085],
 ['04', 7.170212765957447],
 ['01', 11.383333333333333],
 ['23', 7.985294117647059],
 ['10', 13.440677966101696],
 ['05', 10.08695652173913],
 ['17', 11.46],
 ['20', 21.525],
 ['16', 16.796296296296298],
 ['07', 7.852941176470588],
 ['02', 23.810344827586206],
 ['09', 5.5777777777777775],
 ['12', 9.41095890410959],
 ['21', 16.009174311926607],
 ['18', 13.20183486238532],
 ['06', 9.022727272727273],
 ['15', 38.5948275862069],
 ['11', 11.051724137931034]]

## 7. Sorting the List of Lists

In the previous step, we calculated the average number of comments for ask posts created during each hour of the day - stored in the list of lists named avg_by_hour.

Since this format makes results hard to read, we will now create a new list of lists with sorted values in it. 

In [13]:
# Creating a list that equals avg_by_hour with swapped columns.
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[7.796296296296297, '03'],
 [14.741176470588234, '13'],
 [8.127272727272727, '00'],
 [10.8, '19'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [13.233644859813085, '14'],
 [7.170212765957447, '04'],
 [11.383333333333333, '01'],
 [7.985294117647059, '23'],
 [13.440677966101696, '10'],
 [10.08695652173913, '05'],
 [11.46, '17'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [7.852941176470588, '07'],
 [23.810344827586206, '02'],
 [5.5777777777777775, '09'],
 [9.41095890410959, '12'],
 [16.009174311926607, '21'],
 [13.20183486238532, '18'],
 [9.022727272727273, '06'],
 [38.5948275862069, '15'],
 [11.051724137931034, '11']]

In [14]:
# Sorting the swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    string = "{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg)
    print(string)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour with the most average comments is 15:00. It receives 38.59 comments per post. The second hour which is convenient for posting is 02:00. It receives approximately 24 comments per post. However, there's about a 60% increase in the number of comments in the first hour compared to the second-highest hour.


According to the set documentation - https://www.kaggle.com/hacker-news/hacker-news-posts, the timezone used is Eastern Time in the US.

## 8. Conclusion

In this guided project, we analyze Ask HN and Show HN posts in reduced Hacker News data set. Ask HN posts are submitted when users want to ask a certain question. Show HN posts are submitted when users want to show the community generally something interesting. We aimed to find which of these posts are likely to receive more comments. Since we found that ask posts receive more comments on average than show posts, we continued further analysis only with ask posts. Here we found that posts created at 15:00 (Eastern Time zone) receive the most comments on average.