## Exploring Hacker News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

For the analysis, we are specifically interested in posts with titles that begin with either *'Ask HN'* or *'Show HN'*. Users submit *'Ask HN'* posts to ask the Hacker News community a specific question. Likewise, users submit *'Show HN'* posts to show the Hacker News community a project, product, or just something interesting.

In this project we'll compare these two types of posts to determine the following:
- Do *Ask HN* or *Show HN* recieve more comments on average?
- Do posts created at a certain time recieve more comments on average?

---

In [1]:
from csv import reader

# read dataset
opened_file = open('data/hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
list_file = list(read_file)
hn_headers = list_file[0]
hn = list_file[1:]

print('Column headers:', hn_headers)
print('Number of rows:', len(hn))

Column headers: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
Number of rows: 20100


---

Now we must filter our data in order to conduct our analysis on *'ask HN'* and *'show HN'* posts. Seeing as we can determine the type of post based on whether the start of the title contains either *'ask HN'* or *'show HN'*, we can create new lists containing just the data for those titles.

---

In [2]:
# splitting rows based on whether they are 'ask' or 'show'
ask_posts = []
show_posts = []
other_posts = []
for i in hn:
    title = i[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(i)
    elif title.lower().startswith('show hn'):
        show_posts.append(i)
    else:
        other_posts.append(i)

# number of posts in each list
print('ask_posts:', len(ask_posts))
print('show_posts:', len(show_posts))
print('other:', len(other_posts))

ask_posts: 1744
show_posts: 1162
other: 17194


---

Now we want to calculate the average number of comments per post type. This will give us a general insight into which type of post gets more comments. 

---

In [3]:
# calculate average number of comments for 'ask hn' posts
total_ask_comments = 0
for i in ask_posts:
    num_comments = int(i[4])
    total_ask_comments = total_ask_comments + num_comments    
avg_ask_comments = total_ask_comments / len(ask_posts)

# calculate average number of comments for 'show hn' posts
total_show_comments = 0
for i in show_posts:
    num_comments = int(i[4])
    total_show_comments = total_show_comments + num_comments
avg_show_comments = total_show_comments / len(show_posts)

print('average number of \'ask hn\' comments:', avg_ask_comments)
print('average number of \'showhn\' comments:', avg_show_comments)

average number of 'ask hn' comments: 14.038417431192661
average number of 'showhn' comments: 10.31669535283993


---

On average, *'ask hn'* posts recieve 4 more comments per posts than *'show hn'* posts. Since *ask* posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. 

Next, we'll determine if *ask* posts created at a certain time are more likely to attract comments.

---

In [4]:
import datetime as dt

result_list = []
for i in ask_posts:
    created = dt.datetime.strptime(i[6], '%m/%d/%Y %H:%M')
    num_comments = int(i[4])
    result_list.append([created, num_comments])

# Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
counts_by_hour = {}
comments_by_hour = {}
for i in result_list:
    hour = i[0].hour
    num_comments = i[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
        
# Calculate the average number of comments ask posts receive by hour created.
avg_by_hour = []
for i in comments_by_hour:
    avg = round(comments_by_hour[i] / counts_by_hour[i], 2)
    avg_by_hour.append([avg, i])

In [5]:
sorted_avg = sorted(avg_by_hour, reverse=True)

# print the top five hours with the highest number of average comments
print('Top 5 Hours for Ask Post Comments')
for i in sorted_avg[:5]:
    print('{}:00 {:.2f} average comments per post'.format(i[1], i[0]))

Top 5 Hours for Ask Post Comments
15:00 38.59 average comments per post
2:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


---

As shown by the output of the code above, the top five hours to create a post that will have a greater chance of recieving comments are: 15:00, 02:00, 20:00, 16:00, 21:00 (in order or highest to lowest).

---