## Hacker News Data Analysis

### Overview of the project
[Hacker News](https://news.ycombinator.com) is a popular site in tech and startup circles, where user posts are voted and comented upon. It was started by the startup incubator Y Combinator. In this project, we will dig into a sample dataset and hopefully find out some interesting results.


### Datasets
This project has used only approximately 20,000 rows and 8 columns of records randomly sampled which had received comments. 

In [10]:
from csv import reader
opened_file = open("/Users/Ming/jupyter/p_apps/HN_posts_year_to_Sep_26_2016.csv")
read_file = reader(opened_file)
data_all = list(read_file)
headers = data_all[0]
hn = data_all[1:]

In [11]:
from pprint import pprint
pprint(headers)
pprint(hn[:1])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008',
  'You have two days to comment if you want stem cells to be classified as '
  'your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26']]


In [17]:
# separate ask and show posts from other posts into diff lists.
ask_posts, show_posts, other_posts = [], [], []
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts),len(show_posts),len(other_posts))

9139 10158 273822


In [18]:
print(ask_posts[:10])


[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'], ['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30'], ['12576899', 'Ask HN: What is that one deciding factor that makes a website successful?', '', '22', '22', 'ziggystardust', '9/25/2016 19:22'], ['12576398', 'Ask HN: Is the world really short of software developers?', '', '2', '3', 'chirau', '9/25/2016 17:55'], ['12575803', 'Ask HN: Geolocalized publ

In [21]:
# find the number of comments for ask and show posts
total_ask_comments, total_show_comments = 0, 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
for row in show_posts:
    total_show_comments += int(row[4])  
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)

print(f"Ask comments receive {avg_ask_comments} comments on average.")
print(f"Show comments receive {avg_show_comments} comments on average.")
    

Ask comments receive 10.393478498741656 comments on average.
Show comments receive 4.886099625910612 comments on average.


### Analysis on number of comments

The average number of comments for ask posts is more than twice that of show post. This result is expected since the ask posts were aimed to getting an answer therefore receiving more comments.

Next we will focus on ask posts and look at the relationship between posted time and the number of comments.

In [33]:
import datetime as dt
result_list = []
counts_by_hour = {}
comments_by_hour = {}
for row in ask_posts:
    created_time = row[6]
    num_comments = int(row[4])
    created_time = dt.datetime.strptime(created_time, '%m/%d/%Y %H:%M')
    created_hour = created_time.hour
    
    if created_hour in counts_by_hour:
        counts_by_hour[created_hour] += 1
        comments_by_hour[created_hour] += num_comments
    else:
        counts_by_hour[created_hour] = 1
        comments_by_hour[created_hour] = num_comments       
    

In [37]:
# create the average number of comments per hour
avg_by_hour = []

for h in counts_by_hour:
    num_post_hour = counts_by_hour[h]
    for k in comments_by_hour:
        num_comment_hour = comments_by_hour[k]
        if k == h:
            avg_by_hour.append([k, num_comment_hour/num_post_hour])
    
print(len(comments_by_hour))

24


In [47]:
swap_avg_by_hour=[]
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for i in sorted_swap[:5]:
    t = dt.datetime.strptime(str(i[1]),"%H")
    t = t.strftime("%H:%M")
    n = i[0]
    print(f"{t}: {n:.2f} average comments per post" )

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


#### Results
It is clear that posting a ask post at 15 hour of the day (US time?) has much better chance of getting a comment than any other time, somehow. 

Let's compare to the result of show posts, where there is no obvious favorite hour. Indeed, the following top 5 hours for show posts are rather close. 

In [48]:
import datetime as dt
result_list = []
counts_by_hour = {}
comments_by_hour = {}
for row in show_posts:
    created_time = row[6]
    num_comments = int(row[4])
    created_time = dt.datetime.strptime(created_time, '%m/%d/%Y %H:%M')
    created_hour = created_time.hour
    
    if created_hour in counts_by_hour:
        counts_by_hour[created_hour] += 1
        comments_by_hour[created_hour] += num_comments
    else:
        counts_by_hour[created_hour] = 1
        comments_by_hour[created_hour] = num_comments    

In [50]:
# create the average number of comments per hour
avg_by_hour = []

for h in counts_by_hour:
    num_post_hour = counts_by_hour[h]
    for k in comments_by_hour:
        num_comment_hour = comments_by_hour[k]
        if k == h:
            avg_by_hour.append([k, num_comment_hour/num_post_hour])
    
swap_avg_by_hour=[]
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Show Posts Comments")
for i in sorted_swap[:5]:
    t = dt.datetime.strptime(str(i[1]),"%H")
    t = t.strftime("%H:%M")
    n = i[0]
    print(f"{t}: {n:.2f} average comments per post" )

Top 5 Hours for Show Posts Comments
12:00: 6.99 average comments per post
07:00: 6.68 average comments per post
11:00: 6.00 average comments per post
08:00: 5.60 average comments per post
14:00: 5.52 average comments per post


#### Number of Points Received

Next, we study the average points received by show and ask posts. Although on average, show post had less comments, show post received more than double the points got by ask post. Let's study further on the time of the posts to determine if certain time are more likely to get points. 

In [52]:
avg_ask_points = 0
avg_show_points = 0
point = 0
for row in ask_posts:
    point += int(row[3])
avg_ask_points = point/len(ask_posts)

for row in show_posts:
    point += int(row[3])
avg_show_points = point/len(show_posts)

print(f'On average, ask post received {avg_ask_points:.2f} points.')
print(f'On average, show post received {avg_show_points:.2f} points.')

On average, ask post received 11.31 points.
On average, show post received 25.02 points.


In [61]:
# points by hour for show posts
points_by_hour = {}
for row in show_posts:
    created_time = row[6]
    num_points = int(row[3])
    created_time = dt.datetime.strptime(created_time, '%m/%d/%Y %H:%M')
    created_hour = created_time.hour
    
    if created_hour in points_by_hour:
        points_by_hour[created_hour] += num_points
        counts_by_hour[created_hour] += 1
    else:
        points_by_hour[created_hour] = num_points  
        counts_by_hour[created_hour] = 1

avg_by_hour = []

for h in points_by_hour:
    num_post_hour = counts_by_hour[h]
    for k in points_by_hour:
        num_point_hour = points_by_hour[k]
        if k == h:
            avg_by_hour.append([k, num_point_hour/num_post_hour])
pprint(avg_by_hour)

[[0, 15.547101449275363],
 [23, 15.862068965517242],
 [20, 13.234285714285715],
 [19, 16.057553956834532],
 [18, 15.144817073170731],
 [16, 14.340823970037453],
 [14, 15.09051724137931],
 [10, 13.321981424148607],
 [9, 12.456953642384105],
 [8, 14.683544303797468],
 [6, 15.994791666666666],
 [3, 10.524271844660195],
 [21, 13.930232558139535],
 [17, 13.88042049934297],
 [15, 13.94377990430622],
 [11, 19.258706467661693],
 [7, 13.995762711864407],
 [4, 13.95360824742268],
 [13, 17.018032786885247],
 [12, 20.905038759689923],
 [1, 11.866396761133604],
 [22, 13.331564986737401],
 [2, 13.224880382775119],
 [5, 10.662790697674419]]


In [63]:
# create the average points per hour for show post
avg_by_hour = []

for h in counts_by_hour:
    num_post_hour = counts_by_hour[h]
    for k in points_by_hour:
        num_point_hour = points_by_hour[k]
        if k == h:
            avg_by_hour.append([k, num_point_hour/num_post_hour])
    
swap_avg_by_hour=[]
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Show Posts Points")
for i in sorted_swap[:5]:
    t = dt.datetime.strptime(str(i[1]),"%H")
    t = t.strftime("%H:%M")
    n = i[0]
    print(f"{t}: {n:.2f} average points per post" )

Top 5 Hours for Show Posts Points
12:00: 20.91 average points per post
11:00: 19.26 average points per post
13:00: 17.02 average points per post
19:00: 16.06 average points per post
06:00: 15.99 average points per post


In [64]:
# points by hour for ask posts
points_by_hour = {}
for row in ask_posts:
    created_time = row[6]
    num_points = int(row[3])
    created_time = dt.datetime.strptime(created_time, '%m/%d/%Y %H:%M')
    created_hour = created_time.hour
    
    if created_hour in points_by_hour:
        points_by_hour[created_hour] += num_points
        counts_by_hour[created_hour] += 1
    else:
        points_by_hour[created_hour] = num_points  
        counts_by_hour[created_hour] = 1

avg_by_hour = []

for h in points_by_hour:
    num_post_hour = counts_by_hour[h]
    for k in points_by_hour:
        num_point_hour = points_by_hour[k]
        if k == h:
            avg_by_hour.append([k, num_point_hour/num_post_hour])
pprint(avg_by_hour)

[[2, 10.944237918215613],
 [1, 9.439716312056738],
 [22, 9.402088772845953],
 [21, 9.733590733590734],
 [19, 8.66304347826087],
 [17, 12.189097103918229],
 [15, 21.637770897832816],
 [14, 10.50682261208577],
 [13, 17.93243243243243],
 [11, 9.153846153846153],
 [10, 13.436170212765957],
 [9, 7.941441441441442],
 [7, 9.026548672566372],
 [3, 9.3690036900369],
 [23, 7.626822157434402],
 [20, 8.805882352941177],
 [16, 10.310880829015543],
 [8, 10.67704280155642],
 [0, 9.418604651162791],
 [18, 11.156351791530945],
 [12, 13.576023391812866],
 [4, 10.905349794238683],
 [6, 8.675213675213675],
 [5, 9.789473684210526]]


In [65]:
# create the average points per hour for ask posts
avg_by_hour = []

for h in counts_by_hour:
    num_post_hour = counts_by_hour[h]
    for k in points_by_hour:
        num_point_hour = points_by_hour[k]
        if k == h:
            avg_by_hour.append([k, num_point_hour/num_post_hour])
    
swap_avg_by_hour=[]
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Points")
for i in sorted_swap[:5]:
    t = dt.datetime.strptime(str(i[1]),"%H")
    t = t.strftime("%H:%M")
    n = i[0]
    print(f"{t}: {n:.2f} average points per post" )

Top 5 Hours for Ask Posts Points
15:00: 21.64 average points per post
13:00: 17.93 average points per post
12:00: 13.58 average points per post
10:00: 13.44 average points per post
17:00: 12.19 average points per post


#### Results 
From the calculation above, yet again 3pm is the best hour for receiving points on average, followed by 1pm. On the other hand, if you want to get more points, it might be wise to submit a show post around noon, from 11 to 13. Just like the comments per hour result, the advantage of the favorite hour is more prominent for ask post. Both number of comments and points per hour are more evenly distributed for show posts. 


#### Authors with the most points and comments

Let's list the top authors with the most number of points/comments recieved, by creating a dictionary of author and points/comments.


In [80]:
author_points = {}
author_comments = {}
for row in hn:
    author = row[5]
    point = int(row[3])
    comment = int(row[4])
    if author in author_points:
        author_points[author] += point
    else:
        author_points[author] = point
        
    if author in author_comments:
        author_comments[author] += comment
    else:
        author_comments[author] = comment
        
author_point_list = []
author_comment_list = []
for k, v in author_points.items():
    author_point_list.append([v, k])
    
for k, v in author_comments.items():
    author_comment_list.append([v, k])
    
author_point_list = sorted(author_point_list,reverse = True)
author_comment_list = sorted(author_comment_list,reverse = True)
for k, item in enumerate(author_point_list[:10], 1):
    print(f"{k}. {item[1]} has {item[0]} points.")
    
print("\n")
for k, item in enumerate(author_comment_list[:10],1):
    print(f"{k}.{item[1]} has {item[0]} comments.")    

1. ingve has 69465 points.
2. prostoalex has 32510 points.
3. jonbaer has 26157 points.
4. nkurz has 21085 points.
5. adamnemecek has 21071 points.
6. walterbell has 19810 points.
7. dnetesn has 19253 points.
8. jseliger has 17740 points.
9. uptown has 16900 points.
10. DiabloD3 has 15846 points.


1.ingve has 27940 comments.
2.prostoalex has 21572 comments.
3.jseliger has 13911 comments.
4.whoishiring has 12892 comments.
5.nkurz has 11643 comments.
6.walterbell has 9814 comments.
7.jonbaer has 9271 comments.
8.dnetesn has 8650 comments.
9.uptown has 8555 comments.
10.coloneltcb has 8262 comments.


#### Results
Author "ingve" tops both of the lists, with a whopping 69465 points which more than double that of "prostoalex" the runner up in both lists. In the matter of fact, most of the top 10 authors at the point list also show up at the comment list. This could be explained that the posts received more comments also been rewarded with more points. 