## Predict the number of votes a post will attract

This study is based in a Dataset from Hacker News posts from the last 12 months (up to September 26 2016).

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Users submit Ask HN posts to ask the Hacker News community a specific question.

Below are a couple examples:


>Ask HN: How to improve my personal website?Ask HN: Am I the only one outraged >by Twitter shutting down share counts?
>Ask HN: Aby recent changes to CSS that broke mobile?


Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

>Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
>Show HN: Something pointless I made
>Show HN: Shanhu.io, a programming playground powered by e8vm


We'll compare these two types of posts to determine the following:
Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
Let's start by importing the libraries we need and reading the data set into a list of lists.


**Step 1**

1. Opening the HN_posts_year_to_Sep_26_2016.csv data sets and assign the data to a list hn
2. Extract the first row of data, and assign it to the variable headers.
3. Remove the first row from hn.
4. Display headers.
5. Display the first five rows of hn to verify that you removed the header row properly.

In [1]:
from csv import reader

### The App Store data set ###
opened_file = open(r"C:\Users\ojesus\Desktop\pt_morj\Formation\Python\my_datasets\HN_posts_year_to_Sep_26_2016.csv", encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0] # name of columns
hn = hn[1:]
print(hn_header)
print(hn[1:6])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']]


**Step 2**

Create new lists of lists containing in the begin the words Ask HN or Show HN and other lists of lists for other postd check in the title.

In [2]:
ask_posts   = []
show_posts  = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):       
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))    

print(ask_posts[:6])
print(show_posts[:6])
#print(other_posts[:6])        
  


9139
10158
273822
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'], ['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']]
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/201

**Step 3**

Calculate total number and average of comments in ask posts 

In [3]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])

print(total_ask_comments)
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

94986
10.393478498741656


**Step 4**

Calculate total number and average of comments in show posts 

In [4]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

print(total_show_comments)
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

49633
4.886099625910612


**Conclusion:** I can say that comments in ask posts on average are more than double of the show posts.

**Step 5**

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Calculate the amount of ask posts created per hour, along with the total amount of comments.

In [57]:
result_list = []
for row in ask_posts:
    create_date = row[6] 
    result_list.append([row[6],row[4]])



print(result_list[:100])

[['9/26/2016 2:53', '7'], ['9/26/2016 1:17', '3'], ['9/25/2016 22:57', '0'], ['9/25/2016 22:48', '3'], ['9/25/2016 21:50', '2'], ['9/25/2016 19:30', '1'], ['9/25/2016 19:22', '22'], ['9/25/2016 17:55', '3'], ['9/25/2016 15:48', '0'], ['9/25/2016 15:35', '13'], ['9/25/2016 15:28', '0'], ['9/25/2016 14:43', '0'], ['9/25/2016 14:17', '3'], ['9/25/2016 13:08', '2'], ['9/25/2016 11:27', '2'], ['9/25/2016 10:51', '0'], ['9/25/2016 10:47', '6'], ['9/25/2016 9:04', '97'], ['9/25/2016 7:09', '4'], ['9/25/2016 3:00', '1'], ['9/24/2016 23:04', '0'], ['9/24/2016 22:02', '7'], ['9/24/2016 21:18', '2'], ['9/24/2016 20:58', '0'], ['9/24/2016 19:57', '1'], ['9/24/2016 19:02', '0'], ['9/24/2016 17:55', '0'], ['9/24/2016 17:27', '1'], ['9/24/2016 16:50', '0'], ['9/24/2016 16:03', '5'], ['9/24/2016 15:29', '66'], ['9/24/2016 14:03', '1'], ['9/24/2016 10:10', '11'], ['9/24/2016 8:46', '7'], ['9/24/2016 8:39', '1'], ['9/24/2016 8:38', '1'], ['9/24/2016 8:28', '1'], ['9/24/2016 3:36', '3'], ['9/24/2016 0:21

In [26]:
from datetime import datetime
counts_by_hour = {}
comments_by_hour = {}
date_format = "%H"
num_comments = ''
for row in result_list:
    num_comments = int(row[-1])
    ask_date = datetime.strptime(row[-2], '%m/%d/%Y %H:%M')
    ask_date = dt.datetime.strftime(ask_date,date_format)

    if ask_date not in counts_by_hour:
        counts_by_hour[ask_date]  = 1
    else:
        counts_by_hour[ask_date]  += 1

    if ask_date not in comments_by_hour:
        comments_by_hour[ask_date]  = num_comments
    else:
        comments_by_hour[ask_date]  += num_comments

print(counts_by_hour)
print(comments_by_hour)


{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


**Step 6**

Based in above information, calculate the average by hour.

In [28]:
avg_by_hour = []
for countsbyhour in counts_by_hour:
    tot_countsbyhour =  counts_by_hour[countsbyhour]
    if countsbyhour in  comments_by_hour:  
        tot_commentsbyhour = comments_by_hour[countsbyhour]
        avg_calc = tot_commentsbyhour / tot_countsbyhour
        avg_by_hour.append([countsbyhour, avg_calc])

print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


**Step 7**

Sort the data and print the 5 hours with a higher average show the data like this:

15:00: 28.68 average comments per post.

In [56]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap =  sorted(swap_avg_by_hour, reverse=True)   
print(sorted_swap[:5])    

for row in sorted_swap[:5]:
    ask_hour = datetime.strptime(str(row[1]), '%H')
    ask_hour = dt.datetime.strftime(ask_hour,'%H:%M')
    num_comm = float(row[0])
    var_final = "{0}: {1:.2f} average comments per post.".format(ask_hour, num_comm)
    print (var_final)


[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10']]
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


**Step 8**

__Conclusion__: The thre hours of the day higher in ask show comments are:

15:00: 28.68 average comments per post.  
13:00: 16.32 average comments per post.  
12:00: 12.38 average comments per post.  
We can say the lunch time hours are the ones with more comments.