## Exploring Hacker News Posts ##

The data set I'm using contains approximately 20,000 rows and is made up of Hackner News posts. 
We're specifically interested in posts whose titles begin with 'Ask HN' or 'Show HN'. The purpose of this project is to do analysis and determine which type of post receive the most comments and also which hours receive the most comments.

- Ask HN: posts to ask the Hacker News community a specifc question
- Show HN: posts to show the Hacker News community a project, product, or generally something interesting. 

In [41]:
from csv import reader

# open and read the file
opened_file = open('hacker_news.csv')

read_file = reader(opened_file)
hn = list(read_file)
opened_file.close()

#create a header variable
#remove the header from the hn list
header = hn[0]
hn = hn[1:]

print(hn[0:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Create lists for each post type
I want to separate the post types in order to determine the total of each type of post. I am specifically looking for 'ask hn' and 'show hn' posts. 
This will be useful in the next the step for calculating the average number of comments for each post type. 

In [42]:
ask_posts = []
show_posts = []
other_posts = []

#loops through hn and search for 'ask hn' or 'show hn' posts
#finds the total number of each type of post
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print('Number of Ask HN posts:',len(ask_posts))
print('Number of Show HN posts:', len(show_posts))
print('Number of Other posts:', len(other_posts))

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of Other posts: 17194


### Calculate the average number of comments 
The next two cells will loop through the ask_posts and show_posts lists to find the total number of comments for each type of post. Then I will divide the total number of comments by the total number of posts to determine the average number of comments for each post.  

In [43]:
total_ask_comments = 0

#loop through ask_posts to determine the total number of comments
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = round(total_ask_comments/len(ask_posts), 2)

print("The avgerage number of comments on a Ask post is:", avg_ask_comments)
    

The avgerage number of comments on a Ask post is: 14.04


In [44]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = round(total_show_comments/len(show_posts), 2)
print('The average number of comments on a Show post is:', avg_show_comments)


The average number of comments on a Show post is: 10.32


So far I've been able to determine that 'ask hn' posts receive the most amount of comments. I will now just focus on 'ask hn' posts and determine the hours that posts are most likely to receive the most comments. 

In [45]:
result_list = []
#extract the timestamp from each post in the ask_post list
#take the number of comments for each post
#append the results to the result_list as a list
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    post_list = [created_at, num_comments]
    result_list.append(post_list)
#prints and example of how the result_list is structured
print(result_list[0])

['8/16/2016 9:55', 6]


### Extracting the hour the post was created
I created two dictionaries for:
- counts by hour
- comments by hour
I loop through the result_list created in the cell above. I create a datetime object to then pull the hour the post was created. I then assign each hour as a key in for both dictionaries and find the counts and comments for each hour. 

In [46]:
import datetime as dt

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    date_str = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = date_str.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print(counts_by_hour)
print(comments_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


### Store the average number of comments by hour
I take the sum of the total comments in the comments_by_hour dictionary. I use the hour to match between both dictionaries and with the number of posts found, I am able to find the average of comments made by the hour. 

In [47]:
avg_by_hour_list = []
total_comments = sum(comments_by_hour.values())
for hour in comments_by_hour:
    num_of_comments = comments_by_hour[hour]
    num_of_posts = counts_by_hour[hour]
    avg_by_hour = num_of_comments/num_of_posts
    comment_list = [hour, avg_by_hour]
    avg_by_hour_list.append(comment_list)

print(avg_by_hour_list)

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


### Swap the avg_by_hour list for sorting
I need to swap the values in the avg_by_hour list to find the hours with the highest averages. This is done by changing the order the values are listed and sorting the values in reverse order.

In [48]:
swap_avg_by_hour = []

for row in avg_by_hour_list:
    avg = row[1]
    hour = row[0]
    new_list = [avg, hour]
    swap_avg_by_hour.append(new_list)
sorted_swap = sorted(swap_avg_by_hour,reverse = True)
print("Top 5 Hours for Ask Posts Comments")

Top 5 Hours for Ask Posts Comments


### Find the top five hours to post
I print the top five values found after sorting the values and printing them  in a string. If a user is wanting to get the most comments on a post, it is best to post a 'Ask Hn' post during these hours. 

In [59]:
top_five_hours_list = sorted_swap[0:5]

for x in top_five_hours_list:
    hour = dt.time(x[1])
    hour_str = hour.strftime("%H:%M")
    avg_comments = x[0]
    template = "{0}: {avg_comments:.2f} average comments per post.".format(hour_str, avg_comments = avg_comments)
    print(template)

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.
