# Exploring Hacker News Posts
In this project, we'll compare two different types of posts from Hacker News, a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either Ask HN or Show HN.

Ask HN and Show HN posts are standard posts for Haker News in which users ask the Hacker News community a specific question or share some projects, product or in general some information.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

Part 1

* Explores the most post on Hacker News？
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Part 2 (use pandas)

* For all post, Determine if posts created at a certain time are more likely to receive more points. 
* Which users are popular in HN post?

In [1]:
import datetime as dt
#Read the file in as a list of lists.
from csv import reader
open_file=open('HN_posts.csv')
read_file=reader(open_file)
hn=list(read_file)
#Extract the first row of data, and assign it to the variable headers.
hn_header=hn[0]
#Remove the first row from hn.
hn=hn[1:]
#print first 5 rows of dataset:
print(hn[:2])


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']]


## Part 1

1. Explores the most post on Hacker News？
2. Do Ask HN or Show HN receive more comments on average?
3. Do posts created at a certain time receive more comments on average?

In [2]:
# separate posts with Ask HN and Show HN types from other posts:
#-----------------------------------------------------------------
# create empty lists to store ask, show and other posts:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    #assign the title in each post to variable called title:
    title = row[1]
    #if title has 'Ask HN' phrase in it, add it to ask_posts list:
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    #if title has 'Show HN' phrase in it, add it to show_posts list:  
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    #else add post to other_posts list:    
    else:
        other_posts.append(row)


print ("Number of 'Ask HN' posted:",len(ask_posts))
print ("Number of 'Show HN' posted:",len(show_posts))
print ("Number of 'Other' posted:",len(other_posts))


Number of 'Ask HN' posted: 9139
Number of 'Show HN' posted: 10158
Number of 'Other' posted: 273822


### Conclusions about amount of Ask HN and Show HN posts on Hacker News:
1. Number of Ask HN and Show HN posts is 9139 and 10158. 
2. About 3,1% of dataset is Ask HN posts (9139 out of 292724)
3. About 3,5% of dataset is Show HN posts (10148 out of 293119)

## Next, let's determine if ask posts or show posts receive more comments on average.

In [3]:
#comments on average for 'Ask HN'
total_ask_comments=0

for post in ask_posts:
    total_ask_comments+=int(post[4])
avg_ask_comments=total_ask_comments/len(ask_posts)

print(avg_ask_comments)

10.393478498741656


In [4]:
#comments on average for 'Show HN'
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.886099625910612


In [5]:
#comments on average for 'Other'
total_other_comments=0

for post in other_posts:
    total_other_comments+=int(post[4])
avg_other_comments=total_other_comments/len(other_posts)
print(avg_other_comments)

6.4572678601427205


### Conclusions about amount of Ask HN and Show HN posts on Hacker News:
On average, the comment received by ask posts in our sample is more Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.



## Find the best time to post
We'll determine if ask posts created at a certain time are more likely to attract comments.

In [6]:
#convert date to datetime
dateformat='%m/%d/%Y %H:%M'
for row in ask_posts:
    date_clean=dt.datetime.strptime(row[-1],dateformat)
    row[-1]=date_clean


In [7]:
#store time(only hour) and comment number in a list
date_comment_only=[]
for row in ask_posts:
    #append can only add one list,so use .append[[list1],[list2]]
    date_comment_only.append([row[-1].strftime('%H'),int(row[4])])

count_by_hour={}
comment_by_hour={}
for row in date_comment_only:
    time=row[0]
    comment=row[1]
    if time not in count_by_hour:
        count_by_hour[time]=1
        comment_by_hour[time]=comment
    else:
        count_by_hour[time]+=1
        comment_by_hour[time]+=comment      
        
print(comment_by_hour)
print(count_by_hour)

{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}
{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


In [8]:
#Average Comments Number for Ask HN Posts by Hour
avg_comment_by_hour={}
for hour in comment_by_hour:
    avg_comment_by_hour[hour]=round(comment_by_hour[hour]/count_by_hour[hour],2)

#Sort based on dict value
sorted_avg_comment_by_hour = sorted(avg_comment_by_hour.items(), key=lambda kv: kv[1],reverse=True)
sorted_avg_comment_by_hour

[('15', 28.68),
 ('13', 16.32),
 ('12', 12.38),
 ('02', 11.14),
 ('10', 10.68),
 ('04', 9.71),
 ('14', 9.69),
 ('17', 9.45),
 ('08', 9.19),
 ('11', 8.96),
 ('22', 8.8),
 ('05', 8.79),
 ('20', 8.75),
 ('21', 8.69),
 ('03', 7.95),
 ('18', 7.94),
 ('16', 7.71),
 ('00', 7.56),
 ('01', 7.41),
 ('19', 7.16),
 ('07', 7.01),
 ('06', 6.78),
 ('23', 6.7),
 ('09', 6.65)]

In [14]:
#Show top five 5 period with the most comment
print("Top 5 Hours for 'Ask HN' Comments")
for hr,comment in sorted_avg_comment_by_hour[:5]:
    print('{}: {} average comment per post'.format(dt.datetime.strptime(hr,'%H').strftime('%H:%M'),comment))


Top 5 Hours for 'Ask HN' Comments
15:00: 28.68 average comment per post
13:00: 16.32 average comment per post
12:00: 12.38 average comment per post
02:00: 11.14 average comment per post
10:00: 10.68 average comment per post


## Part 1 Conclusion: 
The hour that receives the most comments per post on average is 15:00, with an average of 28.68 comments per post. There's about a 75% increase in the number of comments between the hours with the highest and second highest average number of comments.

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on the analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00.

## Part 2 (use pandas)

1. Determine if show or ask posts receive more points on average.
2. Which users are popular in HN post?

In [17]:
#load in libraries
import pandas as pd
import re
%matplotlib inline

#read in the data set and convert the date
df_hn = pd.read_csv('HN_posts.csv',parse_dates=['created_at'],index_col=[0])

### Top Ten Posts 

Apple's letter to customer about the US Gov request to break into the iPhone received the most upvotes followed by a BBC article about the UK voting to leave the EU.

In [28]:
df_hn[['title','url','num_points','created_at']].sort_values(by='num_points',ascending=False)[0:10]

Unnamed: 0_level_0,title,url,num_points,created_at
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11116274,A Message to Our Customers,http://www.apple.com/customer-letter/,5771,2016-02-17 08:38:00
11966167,UK votes to leave EU,http://www.bbc.co.uk/news/uk-politics-36615028,3125,2016-06-24 03:48:00
12494998,Pardon Snowden,https://www.pardonsnowden.org/,2553,2016-09-14 08:31:00
12073675,Tell HN: New features and a moderator,,2381,2016-07-11 19:34:00
11390545,Ubuntu on Windows,http://blog.dustinkirkland.com/2016/03/ubuntu-...,2049,2016-03-30 16:35:00
11893153,Microsoft to acquire LinkedIn for $26B,http://news.microsoft.com/2016/06/13/microsoft...,2049,2016-06-13 12:34:00
11080701,"Physicists Detect Gravitational Waves, Proving...",http://www.nytimes.com/2016/02/12/science/ligo...,2011,2016-02-11 15:37:00
10226196,14-Year-Old Boy Arrested for Bringing Homemade...,http://techcrunch.com/2015/09/16/14-year-old-b...,1952,2015-09-16 13:00:00
10982340,Request For Research: Basic Income,https://blog.ycombinator.com/basic-income,1876,2016-01-27 19:23:00
12136578,Why Im Suing the US Government,https://www.bunniestudios.com/blog/?p=4782,1855,2016-07-21 13:10:00


### Best time to post (for all posts)

In [27]:
df_hn['hour'] = df_hn['created_at'].dt.hour
df_groupby = df_hn.groupby(by='hour')
df_groupby['num_points'].mean().sort_values(ascending=False)
#should really strip out outliers before doing analyzing impact of hour of day or day of week


hour
12    16.785927
2     16.406170
11    16.192910
13    16.109430
0     15.879906
1     15.555303
4     15.403210
5     15.375918
19    15.362623
18    15.279771
10    15.034617
3     15.010244
17    14.987266
8     14.941080
15    14.757951
6     14.750407
7     14.740000
21    14.580325
16    14.509668
23    14.504527
9     14.499006
22    14.127970
14    14.051935
20    13.607835
Name: num_points, dtype: float64

So, Midday is the best time to post

In [37]:
df_hn['day_of_week'] = df_hn['created_at'].dt.dayofweek
df_groupby = df_hn.groupby(by='day_of_week')
df_groupby['num_points'].mean().sort_values(ascending=False)
#Monday is 0 and Sunday is 6

day_of_week
6    17.752834
5    17.331082
0    15.408457
3    14.525682
2    14.435828
4    14.372634
1    13.856638
Name: num_points, dtype: float64

So, the weekend is the best time to post

### Popular Author

In [40]:
##top 10 users whose posts attract the most upvotes
df_groupby = df_hn.groupby(by='author')
df_groupby['num_points'].sum().sort_values(ascending=False)[:10]


author
ingve          69465
prostoalex     32510
jonbaer        26157
nkurz          21085
adamnemecek    21071
walterbell     19810
dnetesn        19253
jseliger       17740
uptown         16900
DiabloD3       15846
Name: num_points, dtype: int64

In [41]:
#top 10 users who attracts the most upvotes per post on average (of those who have made more 10+ posts)
df_groupby['num_points'].mean()[df_groupby['num_points'].count() > 9].sort_values(ascending=False)[0:10]

author
sama           425.200000
epaga          284.954545
dang           261.166667
whoishiring    207.694444
erlend_sh      192.812500
firloop        192.588235
urs2102        183.705882
MarcScott      180.090909
platz          173.636364
potshot        166.800000
Name: num_points, dtype: float64

## Part 2 Conclusion:

The best time to post is at the midday during the weekend.
ingve has attracted the most upvotes in total.
Sam attracts the most upvotes per post on average (of those who have made more 10+ posts).
