# Hacker News Comment Submission Analysis

## Introduction

This project aims to collect data from a popular technology site, [Hacker News](https://news.ycombinator.com/), and perform analysis on the popularity of the website's user submissions. Hacker News, which functions similarly to Reddit, is known for its social news geared towards technology, computer science, and entrepreneurship audiences. With the ever-increasing advancements in the technological world today, more news and discussions are submitted in Hacker News occassionally where top posts can get hundreds and thousands of visitors as a result. Particularly, there has been an ongoing popularly with posts whose title begin with either `Ask HN` or `Show HN` where users submit `Ask HN` posts to ask the Hacker News community a specific question and `Show HN` posts to show the Hacker News community a project, product, or interesting topic.

A sample Hacker News comments data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts) via Kaggle. This data will be collected, cleaned, and analyzed to determine the following:
* Do `Ask HN` or `Show HN` receive more commentss on average?
* Do posts crteated at a certain time receive more comments on average?

Note: For this analysis, the data set was already reduced from almost 300,000 rows to 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

In [4]:
# Import dependencies
from csv import reader
import datetime as dt

In [5]:
# Open and read hacker news file.
open_file = open("Resources/hacker_news.csv", encoding='utf-8')
read_file = reader(open_file)
hn = list(read_file)

# Separate column names from data rows.
hn_head = hn[0]
print(hn_head)

print("\n")

# Display the first five rows of data
hn = hn[1:]
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## Data Analysis

In [6]:
# Create three empty lists to separate different posts.
ask_posts = []
show_posts = []
other_posts = []

# Loop through each title post to separate them accordingly.
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Count the total number of post per post type.
print("There are {:,} 'ask hn' posts.".format(len(ask_posts)))
print("There are {:,} 'show hn' posts.".format(len(show_posts)))
print("There are {:,} other posts.".format(len(other_posts)))


There are 9,139 'ask hn' posts.
There are 10,158 'show hn' posts.
There are 273,822 other posts.


In [7]:
total_ask_comments = 0

# Iterate over the ask posts and add its comments to get the sum.
for post in ask_posts:
    num_ask_comments = int(post[4])
    total_ask_comments = total_ask_comments + num_ask_comments
    
# Compute the average number of comments on ask posts.
avg_ask_comments = total_ask_comments / len(ask_posts)
print("There are {:,} comments in 'ask hn' posts.".format(total_ask_comments))
print("There is an average of {:,} comments in 'ask hn' posts.".format(avg_ask_comments))

There are 94,986 comments in 'ask hn' posts.
There is an average of 10.393478498741656 comments in 'ask hn' posts.


In [8]:
total_show_comments = 0

# Iterate over the ask posts and add its comments to get the sum.
for post in show_posts:
    num_show_comments = int(post[4])
    total_show_comments = total_show_comments + num_show_comments
    
# Compute the average number of comments on ask posts.
avg_show_comments = total_show_comments / len(show_posts)
print("There are {:,} comments in 'show hn' posts.".format(total_show_comments))
print("There is an average of {:,} comments in 'show hn' posts.".format(avg_show_comments))

There are 49,633 comments in 'show hn' posts.
There is an average of 4.886099625910612 comments in 'show hn' posts.


Based on this data set, `ask hn` posts has more overall submissions and comments than `show hn` posts (approximately 14 comments vs 10 comments).

Since ask posts receive more comments than show posts, the remaining analysis will focus on the `ask hn` posts. In order to determine if ask posts created at a certain time are more likely to attract comments, the following analysis are needed:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts received by hour created.


In [9]:
# Create an empty list to store the time of posts and number of post comments.
result_list = []

# Iterate over ask posts.
for post in ask_posts:
    post_time = post[6]
    num_comments = int(post[4])
    result_list.append([post_time, num_comments])

# Preview the list of lists.
result_list[:5]

[['9/26/2016 2:53', 7],
 ['9/26/2016 1:17', 3],
 ['9/25/2016 22:57', 0],
 ['9/25/2016 22:48', 3],
 ['9/25/2016 21:50', 2]]

In [10]:
counts_by_hour = {}
comments_by_hour = {}

# Iterate over the list of lists created in the previous cell.
# Extract the hour from the date.
for list in result_list:
    post_time = list[0]
    post_time = dt.datetime.strptime(post_time,"%m/%d/%Y %H:%M")
    post_hour = post_time.strftime("%H")
    
    # Create a frequency table for the 'hour' and 'comment' keys
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = list[1]    
    
    elif post_hour in counts_by_hour:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += list[1]


In [11]:
# Calculate average comments per post during each hour of the day.
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

In [12]:
# Recreate the hour average list by swapping the hour and average elements.
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])
    
swap_avg_by_hour

[[11.137546468401487, '02'],
 [7.407801418439717, '01'],
 [8.804177545691905, '22'],
 [8.687258687258687, '21'],
 [7.163043478260869, '19'],
 [9.449744463373083, '17'],
 [28.676470588235293, '15'],
 [9.692007797270955, '14'],
 [16.31756756756757, '13'],
 [8.96474358974359, '11'],
 [10.684397163120567, '10'],
 [6.653153153153153, '09'],
 [7.013274336283186, '07'],
 [7.948339483394834, '03'],
 [6.696793002915452, '23'],
 [8.749019607843136, '20'],
 [7.713298791018998, '16'],
 [9.190661478599221, '08'],
 [7.5647840531561465, '00'],
 [7.94299674267101, '18'],
 [12.380116959064328, '12'],
 [9.7119341563786, '04'],
 [6.782051282051282, '06'],
 [8.794258373205741, '05']]

In [13]:
# Sort descending the list for better readability.
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Print formatted results.
print("Top 5 Hours for Ask posts Comments")
print("==================================")
for average, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),average))

            
    

Top 5 Hours for Ask posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


## Conclusion

The purpose of this project is to utilze [Hacker News](https://news.ycombinator.com/) [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) to determine the popularity of posts and their communal interactions via user comments at certain times of the day. Commonly used by most Hacker News users, the `ask hn` and `show hn` posts were compared and analyzed.

The findings in the sample data set show that `ask hn` posts did not only receive the most posts and repective comments, but also has the most average comments per post as compared to `show hn` posts. To maximize the average comments per `ask hn` post throughout the day, it is recommended to create an `ask hn` post at around 15:00 (or 3:00 pm EST) with an average of 38.59 comments.

### Analysis Next Steps:
* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare results to the average number of comments and points other posts receive.