<h1>Guided Project: Exploring Hacker News Posts</h1>

This is a guided project from the DataQuest site. We will use a sample of a dataset from <a href='https://www.kaggle.com/hacker-news/hacker-news-posts' a>Kaggle</a> that contains information on various posts to the <a href='https://news.ycombinator.com/' a>Hacker News site</a>.

In [6]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

First let's see which type of post gets more comments, on average: Ask posts or Show posts.

In [15]:
total_ask_comments = []
for post in ask_posts:
    value = int(post[4])
    total_ask_comments.append(value)
avg_ask_comments = sum(total_ask_comments) / len(total_ask_comments)
print("Average number of ask comments: ", avg_ask_comments)

total_show_comments = []
for post in show_posts:
    value = int(post[4])
    total_show_comments.append(value)
avg_show_comments = sum(total_show_comments) / len(total_show_comments)
print("Average number of show comments: ", avg_show_comments)

Average number of ask comments:  14.038417431192661
Average number of show comments:  10.31669535283993


It appears that ask posts (14.04) garner more comments on average than show posts(at 10.31). As ask posts receive more comments, we will focus our analysis on these posts.

Next, are ask posts created at a certain time more likely to attract comments? First we will calculate the number of ask posts created during each hour of the day as well as the number of comments received. Then, we will calculate the average number of comments that ask posts receive by hour created.

In [86]:
import datetime as dt

result_list = []
for post in ask_posts:
    date_created = post[6] # pull date post was created
    n_comments = int(post[4])
    result_list.append([date_created, n_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

In [88]:
avg_by_hour = []

for hour, value in counts_by_hour.items():
    n_comments = comments_by_hour[hour]
    count = value
    if count == 0:
        average = 0
    else:
        average = n_comments / count
    avg_by_hour.append([hour, average])
        
avg_by_hour[0:9]

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069]]

In [103]:
swap_avg_by_hour = []
for row in avg_by_hour:
    row0 = row[1]
    row1 = row[0]
    swap_avg_by_hour.append([row0, row1])
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Post Comments:")

for row in sorted_swap:
    average = row[0]
    hour = dt.datetime.strptime(row[1], "%H")
    hour = dt.datetime.strftime(hour, "%H:%M")
    #print(hour, type(average))
    template = "{hour_window}: {avg_comments:.2f} average comments per post"
    output = template.format(hour_window=hour, avg_comments=average)
    print(output)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Top 5 Hours for Ask Post Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00:

According to the results of the data analysis above, the posts made during the hour beginning at 3pm receive the most comments on average at 38.59 comments. This is followed by 2am and 8pm. So while there is no real period of the day that stands out (the top hours are scattered) this does give us some time periods during which posting may get optimal activity. However, further statistical testing, sampling and data analysis is needed to gain a better understanding.