# Hacker news post analyze (Part of Certification Program)
* Users submit Ask HN posts to ask the Hacker News community a specific question.
* Users submit Show HN posts to show the Hacker News community a project, product, or just something interesting

## Aim
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## Dataset
* **id**: the unique identifier from Hacker News for the post
* **title**: the title of the post
* **url**: the URL that the posts links to, if the post has a URL
* **num_points**: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **num_comments**: the number of comments on the post
* **author**: the username of the person who submitted the post
* **created_at**: the date and time of the post's submission (the time zone is Eastern Time in the US)


In [31]:
# Open dataset, create a list of lists and display first five rows.
opened_file = open("hacker_news.csv")
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:3])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


In [32]:
# Extract headers and remove it from the main data set
headers = hn[:1]
hn = hn[1:]
print(hn[:2])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


In [33]:
#Divide posts as ask, show and other
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(f"ask_posts: {len(ask_posts)}, show_posts: {len(show_posts)}, other_posts: {len(other_posts)}")    

ask_posts: 1744, show_posts: 1162, other_posts: 17194


In [34]:
# Find the total number of comments in ask posts
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)

print(f"average ask comments: {avg_ask_comments}")
    
# Find the total number of comments in show posts
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)

print(f"Average show comments: {avg_show_comments}")


average ask comments: 14.038417431192661
Average show comments: 10.31669535283993


On average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

* Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [52]:
from datetime import datetime as dt
import pytz

result_list = []

for row in ask_posts:
    creation_time = row[6]
    number_of_comments = int(row[4])
    result_list.append([creation_time,number_of_comments])

#contains the number of ask posts created during each hour of the day.
counts_by_hour = {}

#contains the corresponding number of comments ask posts created 
#at each hour received.
comments_by_hour = {}

#change the eastern US zone to local 
my_local_tmz = pytz.timezone('Europe/Istanbul')

for row in result_list:
    d1 = dt.strptime(row[0],"%m/%d/%Y %H:%M")
    my_local_time = d1.astimezone(my_local_tmz)
    
    hour = my_local_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])



Now, calculating the average number of comments per post for posts created during each hour of the day.

In [53]:
avg_by_hour = []

for hour in counts_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour,avg])

print(avg_by_hour)

[['12', 8.970588235294118], ['15', 10.342105263157896], ['13', 12.403225806451612], ['17', 17.017241379310345], ['19', 12.112244897959183], ['02', 7.764705882352941], ['11', 9.885245901639344], ['20', 14.0], ['16', 13.247191011235955], ['18', 38.554621848739494], ['23', 26.71276595744681], ['00', 8.927083333333334], ['22', 8.121212121212121], ['05', 23.63157894736842], ['14', 11.095238095238095], ['04', 8.587301587301587], ['21', 13.10576923076923], ['06', 8.183673469387756], ['07', 7.148936170212766], ['03', 11.51063829787234], ['01', 7.5], ['08', 10.386363636363637], ['09', 8.85], ['10', 6.861111111111111]]


In [54]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap = sorted(swap_avg_by_hour, key=None, reverse=True)

#Top 5 Hours for Ask Posts Comments
print(sorted_swap[:5])

[[38.554621848739494, '18'], [26.71276595744681, '23'], [23.63157894736842, '05'], [17.017241379310345, '17'], [14.0, '20']]


In [55]:
for row in sorted_swap[:5]:
    time = dt.strptime(row[1],"%H").strftime("%H:00")
    print(f"{time} : {row[0]:.2f} average comments per post")
    

18:00 : 38.55 average comments per post
23:00 : 26.71 average comments per post
05:00 : 23.63 average comments per post
17:00 : 17.02 average comments per post
20:00 : 14.00 average comments per post


We should create posts at 18:00 to receive highest comments.