# Analyzing Submissions on Hacker News
Hacker News is a site started by the startup incubator Y Combinator, where user submitted stores receive votes and comments. I would like to analyze posts with titles that begin with **Ask HN** or **Show HN**. Users submit **Ask HN** posts to ask the Hacker News community a specific question. Likewise users submit **Show HN** posts to show the Hacker News community a project, product or just something interesting. I want to compare these type of posts to determine the following:
- Do Ask HN or SHOW HN posts receive more comments on average?
- Do posts created at a certain time receive more comments on average?

I'm going to start by importing the needed libraries and reading the dataset into a list of lists.

In [41]:
import csv
opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)

hn = []
for row in read_file:
    hn.append(row)
    
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


From the above output I can see that I need to remove the column headers in order to analyze the data. I will do that by removing the first row of the data using list splicing.

In [42]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now I want to create separate lists for posts beginning with Ask HN and Show HN including any case variations

In [43]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Now I want to determine if ask posts or show posts receive more comments on average.

In [44]:
total_ask_comments = 0
for row in ask_posts:
    comments = row[4]
    comments_int = int(comments)
    total_ask_comments += comments_int
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    comments = row[4]
    comments_int = int(comments)
    total_show_comments += comments_int

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


From this simple analysis, it looks like Ask HN posts receive more comments on average compared to Show HN posts.

Since ask posts are more likely to receive comments, I want to focus my remaining analysis on just those posts.

Next I want to determine if ask posts created at a certain time are more likely to attract comments. I will perform that analysis by following the steps below:
- calculate the number of ask post created in each hour of the day, along with the number of comments received
- calculate the average number of comments ask posts receive by hour created

First I want to focus on calculating the number of ask posts created in each hour a day. 
- I will do that by first importing the datetime module as dt.
- Create an empty list and assign it to a list
- I will iterate over the ask_posts list and append it to the new list with two elements
    - the first element should be the column created at
    - the second element should be the number of comments converted into an integer
- Create two empty dictionaries
- loop through each row of the list and extract teh hour from the date
- use datetime.strptime() method to parse the date 
- Create a frequency table for the hours

In [45]:
import datetime as dt
result_list = []
for row in ask_posts:
    created = row[6]
    comments = row[4]
    comments_int = int(comments)
    list_elem = [created, comments_int]
    result_list.append(list_elem)

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date = row[0]
    date_obj = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date_obj.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Now that I have two dictionaries:
- counts_by_hour: containing the number of ask posts created during each hour of the day
- comments_by_hour: containing the corresponding number of comments ask posts created at each hour received

I will now use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [46]:
avg_by_hour = []
for comment in comments_by_hour:
    avg_by_hour.append([comment, comments_by_hour[comment]/counts_by_hour[comment]])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Now that I have identified the average amount of comments per hour. I want to format this list so that it is easy to identify the hours with the highest values.
I want to finish this project by sorting the list of lists and printing the five highest values in a format that is easier to read

In [47]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 hours for ask posts comments")
template = "{hour}: {comments:.2f} average comments per post"
for row in sorted_swap[:5]:
    hour = row[1]
    comment = row[0]
    hour_dt = dt.datetime.strptime(hour, "%H")
    hour_str = hour_dt.strftime("%H:%M")
    final_str = template.format(hour=hour_str, comments=comment)
    print(final_str)
    

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Top 5 hours for ask posts comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Now I would like to convert the times to the time zone I live in to better understand which hours are best for posting if you want high comment engagement.

In [48]:
for row in sorted_swap[:5]:
    hour = row[1]
    comment = row[0]
    hour_dt = dt.datetime.strptime(hour, "%H")
    hour_str = hour_dt.strftime("%I:%M %p")
    final_str = template.format(hour=hour_str, comments=comment)
    print(final_str)

03:00 PM: 38.59 average comments per post
02:00 AM: 23.81 average comments per post
08:00 PM: 21.52 average comments per post
04:00 PM: 16.80 average comments per post
09:00 PM: 16.01 average comments per post


# Final Analysis
From these calculations I am able to determine that the best times for posting Ask HN posts fall into two ranges 3-4pm and 8-9pm. There is also an outlier in posts created at 2AM that have a lot of activity. This could be because some programmers are nocturnal and work into the night or because Hacker News is globally accessible

# Additional Analysis
Now that I have completed the guided portion of this project. I would like to look at some other types of analysis. Such as determing if show or ask posts receive more points on average from other users. Determing if posts created at a certain time are more likely to receive more points and then compare my results to the average number of comments and points other posts receive

## Calculating extra steps
First I want to use the lists I created earlier for ask, show and other posts. I need to loop through them to gather information about number of points and determine which set of posts receive the most points on average.

In [49]:
total_ask_points = 0
for row in ask_posts:
    points = row[3]
    points_int = int(comments)
    total_ask_points += points_int
    
avg_ask_points = total_ask_points / len(ask_posts)
print(avg_ask_points)

total_show_points = 0
for row in show_posts:
    points = row[3]
    points_int = int(points)
    total_show_points += points_int

avg_show_points = total_show_points / len(show_posts)
print(avg_show_points)

2.0
27.555077452667813


From this calculation its obvious that Show HN posts receive more points on average. So now I want to determine if show posts created at a certain time are more likely to receive more points.

In [50]:
import datetime as dt
result_list_show = []
for row in show_posts:
    created = row[6]
    points = row[3]
    points_int = int(points)
    list_elem_s = [created, points_int]
    result_list_show.append(list_elem_s)

show_counts_by_hour = {}
points_by_hour = {}
for row in result_list_show:
    date = row[0]
    date_obj = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date_obj.strftime("%H")
    if hour not in show_counts_by_hour:
        show_counts_by_hour[hour] = 1
        points_by_hour[hour] = row[1]
    else:
        show_counts_by_hour[hour] += 1
        points_by_hour[hour] += row[1]

print(show_counts_by_hour)
print(points_by_hour)

{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}
{'14': 2187, '22': 1856, '18': 2215, '07': 494, '20': 1819, '05': 104, '16': 2634, '19': 1702, '15': 2228, '03': 679, '17': 2521, '06': 375, '02': 340, '13': 2438, '08': 519, '21': 866, '04': 386, '11': 1480, '12': 2543, '23': 1526, '09': 553, '01': 700, '10': 681, '00': 1173}


Now I want to use these two new dictionaries for finding the average points per hour.
- show_counts_by_hour: containing the number of show posts created each hour
- points_by_hour: containing the corresponding number of points per hour

In [55]:
avgpoints_by_hour = []
for points in points_by_hour:
    avgpoints_by_hour.append([points, points_by_hour[points]/show_counts_by_hour[points]])
    
print(avgpoints_by_hour)

[['14', 25.430232558139537], ['22', 40.34782608695652], ['18', 36.31147540983606], ['07', 19.0], ['20', 30.316666666666666], ['05', 5.473684210526316], ['16', 28.322580645161292], ['19', 30.945454545454545], ['15', 28.564102564102566], ['03', 25.14814814814815], ['17', 27.107526881720432], ['06', 23.4375], ['02', 11.333333333333334], ['13', 24.626262626262626], ['08', 15.264705882352942], ['21', 18.425531914893618], ['04', 14.846153846153847], ['11', 33.63636363636363], ['12', 41.68852459016394], ['23', 42.388888888888886], ['09', 18.433333333333334], ['01', 25.0], ['10', 18.916666666666668], ['00', 37.83870967741935]]


Now I want to compare my results to the average number of comments and points other posts receive. First I need to organize the average number of comments and posts that Ask HN and Show HN posts receive into lists

In [56]:
avg_ask_posts = [avg_ask_comments, avg_ask_points]
avg_show_posts = [avg_show_comments, avg_show_points]
print(avg_ask_posts)
print(avg_show_posts)

[14.038417431192661, 2.0]
[10.31669535283993, 27.555077452667813]


Now I need to calculate the average number of comments on other posts as well as the average number of points.

In [59]:
total_other_comments = 0
for row in other_posts:
    comments = row[4]
    comments_int = int(comments)
    total_other_comments += comments_int
    
avg_other_comments = total_other_comments / len(other_posts)
print(avg_other_comments)

total_other_points = 0
for row in other_posts:
    points = row[3]
    points_int = int(comments)
    total_other_points += points_int
    
avg_other_points = total_other_points / len(other_posts)
print(avg_other_points)


26.8730371059672
58.0


Now lets add the average comments and points into a list together for future analysis.

In [60]:
avg_lists = []
avg_lists.append(avg_ask_posts)
avg_lists.append(avg_show_posts)
avg_lists.append(avg_other_posts)
print(avg_lists)

NameError: name 'avg_other_posts' is not defined