<a href="https://colab.research.google.com/github/ipshitaRB/Explore-Hacker-News/blob/master/ExploreHackerNewsPosts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to gain more participation from readers on your Hacker News post?

### Analysis of the popular posts on Hacker News

The dataset only includes data about posts that have comments.Below are the description of the columns:

*   id: The unique identifier from Hacker News for the post
*   title: The title of the post
*   url: The URL that the posts links to, if it the post has a URL
*   num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
*   num_comments: The number of comments that were made on the post
*   author: The username of the person who submitted the post
*   created_at: The date and time at which the post was submitted


These are the questions to focus on 

*   Do Ask HN or Show HN recieve more comments on average?
*   Do posts created at a certain time receive more posts on average?





In [0]:
# Read in the data
from google.colab import drive
drive.mount('/content/drive')

In [0]:
from csv import reader

# Transform read content into list of lists
opened_file = open("/content/drive/My Drive/HN_posts_year_to_Sep_26_2016.csv")
content = reader(opened_file)
posts = list(content)

# Quick exploration of data
for post in posts[:5]:
        print(post)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Seggregate posts into Ask HN, Show HN and others


In [0]:
# Split the data set into ask HN, show HN and other categories of posts
ask_posts, show_posts, other_posts = [],[],[]

for post in posts[1:]:
    title = post[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
        

9139
10158
273822


Compute average number of comments in Ask HN posts

In [0]:
# Find average number of comments for Ask HN posts
total_ask_comments = 0
for post in ask_posts:
    num_comments = post[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments))

10


Compute avergage number of comments in Show HN posts

In [0]:
# Find total number comments in Show HN posts and compute their average

total_show_comments = 0
for post in show_posts:
    num_comments = post[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments))

5


**The average number of comments in Ask HN posts appear to be significantly higher than the average number of comments in Show HN posts.**



---


Explore the number of posts and comments by the hour



In [0]:
import datetime as dt

# make a list of number of comments and time of creation for each post
result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = post[4]
    num_comments = int(num_comments)
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}


for row in result_list:
    created_at_dt = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour_dt = dt.datetime.strftime(created_at_dt, "%H")
    comment = row[1]
    if hour_dt not in counts_by_hour:
        counts_by_hour[hour_dt] = 1
        comments_by_hour[hour_dt] = comment
    else:
        counts_by_hour[hour_dt] += 1
        comments_by_hour[hour_dt] += comment

print(counts_by_hour)
print(comments_by_hour)


{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}




---

Calculate average number of comments per post for each hour of the day



In [0]:
avg_by_hour = []

for hour in counts_by_hour:
    average_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, average_comments])

print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]




---

When should you post to receive most number of comments?
-Identify the top 5 hours where highest number of comments are received

In [0]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [0]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
output = "{}:00: {:.2f} average"
for row in sorted_swap[:5]:
    print(output.format(row[1],row[0]))
    

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average
13:00: 16.32 average
12:00: 12.38 average
02:00: 11.14 average
10:00: 10.68 average


**As shown, the top 5 hours to post to receive highest number of comments are after noon in Eastern US time zone**