# <font color=blue>Analyzing Hacker News Posts</font>

We are analyzing the comparison of 'Ask HN' posts vs. 'Show HN' from [Hacker News](https://news.ycombinator.com).

For 'Ask HN' posts, users will submit a question to the Hacker News community (ie. "What are your favorite pod casts?"). In 'Show HN' posts, users post project, products, and other interesting technology related items.
We will campare these two types of posts to learn the following:

- Does "Ask HN" or "Show HN" recieve more comments on average?
- Do posts created at a certain time receive more comments on average?

We are working with a reduced dataset, from 300,000 to 20,000 rows. This was done by removing posts with no comments and randomly sampling the rest to hit the smaller size.

In [1]:
opened_file = open('hacker_news.csv')

from csv import reader

read_file = reader(opened_file)
hn = list(read_file)                     #making a list of lists

headers = hn[0] #check for header
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


We checked for a header and will keep as a reference.

Now let's take a random sample and make the data readable.

In [2]:
for row in hn[14:20]:      #random rows to show Ask HN vs Show HN
    print(row)
    print("")

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']

['11587596', 'Custom Deleters for C++ Smart Pointers', 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html', '59', '18', 'ingve', '4/28/2016 10:01']

['12335860', 'How often to update third party libraries?', '', '7', '5', 'rabid_oxen', '8/22/2016 12:37']

['11403750', 'Review my AI based marketing bot', 'http://beta.crowdfireapp.com/?beta=agnipath', '1', '2', 'abhishekmaddy', '4/1/2016 9:45']

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']

['10837634', "Ten years later, did Boston's Big Dig deliver?", 'https://www.bostonglobe.com/magazine/2015/12/29/years-later-did-big-dig-deliver/tSb8PIMS4QJUETsMpA7SpI/story.html', '109', '116', 'jseliger', '1/4/2016 18:58']



As we can see, we are working with 7 total columns and most interested in the 'title', 'num_comments', and 'created_at'(date).  Within the 'title' column, it will specify startig with "Ask HN", "Show HN", or neither.

# Separating Ask vs. Show posts

We are going to split the data into separate lists.(i.e. Ask/Show/Other)

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:                       #looping through dataset
    title = row[1]                   #assigning title column
    title = title.lower()            #standarizing our looping check for "ask hn"
    if title.startswith('ask hn'):   #splitting our lists up based on startswith method
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("There are " + str(len(ask_posts)) + " \'Ask HN\'' posts.") 
print("There are " + str(len(show_posts)) + " \'Show HN\' posts.") 
print("There are " + str(len(other_posts)) + " other posts.") 

There are 1744 'Ask HN'' posts.
There are 1162 'Show HN' posts.
There are 17195 other posts.


# Finding the average number of comments per post

Let's find the average number of comments per post for Ask HN and Show HN.

In [4]:
total_ask_comments = 0

for asks in ask_posts:
    num_comments = int(asks[4])                #string to int
    total_ask_comments += num_comments         #running total        

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
total_show_comments = 0

for asks in show_posts:
    num_comments = int(asks[4])         #string to int
    total_show_comments += num_comments  #running total

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


As we can see ask posts get an average of 14 comments per post opposed to show posts of an average of 10 comments per post.  Since ask posts have a higher average we will focus on them.

# Ask posts and comments by hour

Let's look at which hour the most comments occur.

In [6]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])    #created new list
    
count_by_hour = {}              #amount of ask posts for each hour
comments_by_hour = {}           #number of comments from the post

for row in result_list:
    time_stamp = row[0]
    comments = row[1]
    date = dt.datetime.strptime(time_stamp, "%m/%d/%Y %H:%M")   #standardized the date
    hour = date.strftime("%H")                                  #parsing it out to our liking
    
    if hour not in count_by_hour:                                #creating a frequency table with hour and comments
        count_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        count_by_hour[hour] += 1
        comments_by_hour[hour] += comments

print(count_by_hour)
print("")
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In our first list we see that it is separated by amount of ask posts during that hour.  For example, '09': 45 means that at 9:00 am, there were 45 posts.

In our second list we see that it is separeted by amount of ask comments during that hour.  For example, '09': 251 means that at 9:00 am, there were 251 comments made.

_Now let's look at average number of comments during that specific hour._

In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, round((comments_by_hour[hour]/count_by_hour[hour]), 2)])   #finding average and rounding 2 decimal places
    
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


For example, '09', 5.58 means at 9:00 am there was an average of almost 6 comments per post.

Let's sort this out to make the data more ledgible.

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])   #avg comments, hour - personal preference
    
print(swap_avg_by_hour)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]


In [9]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)   #descending order
sorted_swap

[[38.59, '15'],
 [23.81, '02'],
 [21.52, '20'],
 [16.8, '16'],
 [16.01, '21'],
 [14.74, '13'],
 [13.44, '10'],
 [13.23, '14'],
 [13.2, '18'],
 [11.46, '17'],
 [11.38, '01'],
 [11.05, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.09, '05'],
 [9.41, '12'],
 [9.02, '06'],
 [8.13, '00'],
 [7.99, '23'],
 [7.85, '07'],
 [7.8, '03'],
 [7.17, '04'],
 [6.75, '22'],
 [5.58, '09']]

In [10]:
print("Top 5 Hours for Ask Post Comment")
print("")
for avg, hour in sorted_swap[:5]:
    hr_dt = dt.datetime.strptime(str(hour),'%H')     #standarizing
    hr_str = hr_dt.strftime("%H:%M")                 #formatting to our liking
    print("{}: {:.2f} average comments per post".format(hr_str, avg))       

Top 5 Hours for Ask Post Comment

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion

We can see that at 3:00 p.m.(EST) or hour 15 has the most average comments per post with 38.59.  We can use this business intelligence to maximize the amount of comments that are aquired when making a post on Hacker Noon in the Ask HN catergory.