# Exploring Hacker News Post

In this project, we will explore news posts on the website _Hacker News_. 

_Hacker News_ is a site similar to reddit where user-submitted stories are voted and commented upon. It is extremely popular in technology and startup circles, inviting hundreds of thousands of visitors. 

In our [data set](https://www.kaggle.com/hacker-news/hacker-news-posts), the number of rows has been reduced by removing all submissions that did not receive any comments, and then using a randomly sample from the remaining submissions. Clearer descriptions of the data set can be found in the link above.

We are specifically interested in posts with titles that begin with ask `Ask HN` or `Show HN`. This stands for users submitting posts to ask the community a specific question or users showing the community a project, product or generally something interesting.

The aim of our project is to determine the following:
- Does `Ask HN` or `Show HN` receive more comments on average?
- Do posts created a certain time receive more comments on average?

## Introduction

Firstly, we will begin by reading the data set needed for this project

In [1]:
# Import csv to open the file
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)

# Convert file into a list of lists
hn = list(read_file)

# Display the first five rows of hn

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


### Remove headers from data set

In [2]:
# Extract data to variable headers
headers = hn[0]
hn = hn[1:]

# Display Headers
print(headers)
print('\n')

# Display the first five rows of hn
print(hn[:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN Posts

We are only interested in `Ask HN` and `Show HN` posts, hence we will extract these posts from the data set. I will utilise the string method `startswith` which returns a Boolean that shows if a the string starts with the input given. As the capitalization matters when utilizing the string method, it is important that we make the string of common capitalization before using the method.

In [3]:
# Initial three empty lists for Ask HN, Show HN and others
ask_posts = []
show_posts = []
other_posts = []

# Loop through the data set to get the title of posts
for row in hn:
    title = row[1]
    
# Ensure common captilization 
    title = title.lower()
    
# Filter the three types of posts 
    if title.startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check Number of posts in three lists

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
     

1744
1162
17194


### Check the average number of comments for Ask HN and Show HN posts

To fulfill our first aim - Does `Ask HN` or `Show HN` receive more comments on average?, we must compare the average number of comments.

Average Number of Comments = Total Number of Comments/Number of posts

In [4]:
# Find Total Number of Comments for Ask HN posts
# Initialize Empty Variable
total_ask_comments = 0 

for row in ask_posts:
    comments = float(row[4])
# Add comments from each posts to total comments
    total_ask_comments += comments
# Calculate Average number of comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of ask comments:", avg_ask_comments)

# Use Similar method to find Average Number of Comments for show posts
total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

avg_show_comments = total_show_comments/len(show_posts)
print("Average number of show comments:",avg_show_comments)


Average number of ask comments: 14.038417431192661
Average number of show comments: 10.31669535283993


From this findings, we can observe that there are more comments on average for `Ask HN` posts compared to `Show HN` posts. This could be due to people being more likely to comment a reply to a question rather than commenting on something others wish to show.

## Do posts created at a certain time receive more comments on average?

Since we have fulfilled our first objective, we shall now look into our second aim.

Since `Ask HN` posts are more likely to receive comments, we will focus our remaining analysis on these posts. As we are working with time, we will make use of the `datetime` module  as well as the data present in the `created_at` column.

### Finding the number of Ask Posts and Comments Created by hour

We will make use of two frequency tables to display the number of comments and ask posts that are being created at each specific hour.



In [5]:
# Import datetime module to assist us
import datetime as dt
# Filter Data Set to only include what we require 
# Number of comments and time comment was created
result_list = []

for row in ask_posts:
    time = row[6]
    comments = int(row[4])
    result_list.append([time,comments])
# Initialize two dictionaries to create frequency tables for
# Total Number of posts by hour
# Total Number of Comments in that hour
counts_by_hour = {}
comments_by_hour = {}

#Create Frequency Table for number of comments and posts by hour
for row in result_list:
    time = row[0]
    comments = row[1]
    time_dt = dt.datetime.strptime(time, '%m/%d/%Y %H:%M')
    hour = time_dt.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### Calculating the Average Number of Comments for Ask HN Posts by hour

We have seperated the number of comments and posts by hour. We will move on to calculate the average number.

Average Numbers of Comments = Total Number of Comments/Number of Posts

In [6]:
# Calculate Average number of comments per post
# During each hour of the day

avg_by_hour = []

for key in counts_by_hour:
    avg = comments_by_hour[key]/counts_by_hour[key]
    avg_by_hour.append([key,avg])

In [7]:
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


### Sorting and Printing Values

We have gotten our results, but the format makes it difficult to identify the hours with the highest values. We will sort out the list of lists and print the five highest average comments for `Ask HN` posts in a format that is easier to read.

To sort by the average number of comments, we need to swap change the indexes of the values inside the lists

In [8]:
# Initialise empty list 
swag_avg_by_hour = []
# Reverse the index 
for row in avg_by_hour:
    time = row[0]
    avg_num = row[1]
    swag_avg_by_hour.append([avg_num,time])
    
print(swag_avg_by_hour)
print('\n')
# Sort the new list
sorted_swap = sorted(swag_avg_by_hour, reverse = True)

# Print the 5 hours with highest average number of comments
print('Top 5 Hours for Ask Posts Comments.')

for avg,hour in sorted_swap[:5]:
    hour = dt.datetime.strptime(hour, '%H')
    hour = hour.strftime('%H:%M')
    string = ('{time}: {num:.2f} average comments per post').format(time= hour, num= avg)
    print(string)
    

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Top 5 Hours for Ask Posts Comments.
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Based on our findings, the time which sees the highest average number of comments is 15:00, with an average number of 38.59 comments. 

According to the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the time zone is Eastern time in the US. Hence, this translates to 3:00pm EST.

## Conclusion

In conclusion, we found out that the best timing to get the most response from the users is at 3:00pm EST. 

However, it is important to note that our analysis may be inaccurate to a certain extent. This is due to the exclusion of posts without comments and using random sampling on the remaining data in order to reduce the size of the data set for easier calculation. Also, our analysis is based of `Ask Hn` posts only, creating an assumption that the trend for other posts follows suit. 