## Hacker News Posts

[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

`id`: The unique identifier from Hacker News for the post

`title`: The title of the post

`url`: The URL that the posts links to, if it the post has a URL

`num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

`num_comments`: The number of comments that were made on the post

`author`: The username of the person who submitted the post

`created_at`: The date and time at which the post was submitted

In [2]:
# Importing the HackerNews.csv into a list


from csv import reader

opened_file = open("HN_posts_year_to_Sep_26_2016.csv", encoding = "utf8")
read_file = reader(opened_file)
hn = list(read_file)

# Print the first 5 rows

for row in hn[:5]:
    print(row)
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']




In [3]:
# Only run this once
# Assign header row to headers variable
# Remove header row from hn

hn_raw = list(hn)
headers = hn_raw[0]
hn = hn_raw[1:]

print(headers)
print("\n")
print(hn[:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [4]:
# There are three types of posts
# 1. Ask HN posts, 2. Show HN posts, and 3. Other posts
# We will create three lists to contain these types of posts

ask_posts = []
show_posts = []
other_posts= []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check the length of each list

print("Asks posts number is", len(ask_posts))
print("Asks posts number is", len(show_posts))
print("Asks posts number is", len(other_posts))

Asks posts number is 9139
Asks posts number is 10158
Asks posts number is 273822


In [5]:
# Determine if asks posts or show posts receive more comments on average

total_ask_comments = 0
for row in ask_posts:
    comments = row[4]
    comments = int(comments)
    total_ask_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,2))

total_show_comments = 0
for row in show_posts:
    comments = row[4]
    comments = int(comments)
    total_show_comments += comments

avg_show_comments = total_show_comments / len(ask_posts)
print(round(avg_show_comments,2))

# Average ask comments is higher than average show comments

10.39
5.43


In [6]:
# Let's focus on ask posts since they have higher avg comments
# In ask posts, let's find out what time of the day 
# there are the highest comments


# Creating a list to store created_at and comments
import datetime as dt
result_list = []

for row in ask_posts:
    created_at = row[6]
    comments = row[4]
    comments = int(comments)
    result_list.append([created_at, comments])

# Creating two dictionarys, to hold count, and to hold comments, by hour
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row[0]
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    hour = hour.strftime("%H")
    hour = int(hour)
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

# Create a list to store the average number of comments by hour
avg_by_hour = []

for row in counts_by_hour:
    avg_by_hour.append([row, round((comments_by_hour[row] / counts_by_hour[row]),2)])
    
avg_by_hour


        



[[2, 11.12],
 [1, 7.39],
 [22, 8.78],
 [21, 8.67],
 [19, 7.15],
 [17, 9.44],
 [15, 28.63],
 [14, 9.67],
 [13, 16.29],
 [11, 8.94],
 [10, 10.65],
 [9, 7.06],
 [7, 7.0],
 [3, 7.92],
 [23, 6.68],
 [20, 8.73],
 [16, 7.7],
 [8, 9.18],
 [0, 7.55],
 [18, 7.95],
 [12, 12.36],
 [4, 9.68],
 [6, 6.76],
 [5, 8.75]]

In [7]:
# Create a list to help sort the values

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[28.63, 15],
 [16.29, 13],
 [12.36, 12],
 [11.12, 2],
 [10.65, 10],
 [9.68, 4],
 [9.67, 14],
 [9.44, 17],
 [9.18, 8],
 [8.94, 11],
 [8.78, 22],
 [8.75, 5],
 [8.73, 20],
 [8.67, 21],
 [7.95, 18],
 [7.92, 3],
 [7.7, 16],
 [7.55, 0],
 [7.39, 1],
 [7.15, 19],
 [7.06, 9],
 [7.0, 7],
 [6.76, 6],
 [6.68, 23]]

In [37]:
print("Top 5 Hours for Ask Posts Comments")
print("\n")
for row in sorted_swap[:5]:
    row[1] = str(row[1])
    avg_comments = row[0]
    time = dt.datetime.strptime(row[1], "%H")
    time = time.strftime("%H:%M")
    template = "{0}: {1:.2f} average comments per post"
    print(template.format(time, avg_comments))
    
# Looks like the best time to post for high average comments would be 
# between 12 pm - 3 pm EST

Top 5 Hours for Ask Posts Comments


15:00: 28.63 average comments per post
13:00: 16.29 average comments per post
12:00: 12.36 average comments per post
02:00: 11.12 average comments per post
10:00: 10.65 average comments per post
