# Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

In [122]:
import numpy as np
import pandas as pd
hn=pd.read_csv('hacker_news.csv')
hn.head(5)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [123]:
ask_posts=hn[hn['title'].str.lower().str.startswith('ask hn')]['title']
ask_posts

7              Ask HN: How to improve my personal website?
17       Ask HN: Am I the only one outraged by Twitter ...
22       Ask HN: Aby recent changes to CSS that broke m...
30         Ask HN: Looking for Employee #3 How do I do it?
31       Ask HN: Someone offered to buy my browser exte...
                               ...                        
20039    Ask HN: Is it feasible to port Apple's Swift t...
20042       Ask HN: What to do when a developer goes dark?
20045                           Ask HN: Killer app for AR?
20048    Ask HN: How do you balance a serious relations...
20061      Ask HN: Why are papers still published as PDFs?
Name: title, Length: 1744, dtype: object

In [124]:
show_posts=hn[hn['title'].str.lower().str.startswith('show hn')]['title']
show_posts

13       Show HN: Wio Link  ESP8266 Based Web of Things...
39                     Show HN: Something pointless I made
46       Show HN: Shanhu.io, a programming playground p...
84       Show HN: Webscope  Easy way for web developers...
97       Show HN: GeoScreenshot  Easily test Geo-IP bas...
                               ...                        
19993    Show HN: Geocoding API built with government o...
19999    Show HN: Decorating: Animated pulsed for your ...
20014                             Show HN: Idea to startup
20065       Show HN: PhantomJsCloud, Headless Browser SaaS
20070    Show HN: Parse recipe ingredients using JavaSc...
Name: title, Length: 1162, dtype: object

In [125]:
bool1=hn['title'].str.lower().str.startswith('show hn')
bool2=hn['title'].str.lower().str.startswith('ask hn')
bool3=bool1 | bool2
bool_not=~bool3

In [126]:
other_posts=hn[bool_not]['title']

In [127]:
total_asks=len(ask_posts)
total_show=len(show_posts)
total_others=len(other_posts)
print("total ask posts " +str(total_asks))
print("total show posts " +str(total_show))
print("total others posts " +str(total_others))

total ask posts 1744
total show posts 1162
total others posts 17194


In [128]:
# Avg of number of comments - Show HN
avg_show_comments=(hn[bool1]['num_comments'].sum())/total_show
avg_ask_comments=(hn[bool2]['num_comments'].sum())/total_asks
avg_other_comments=(hn[bool_not]['num_comments'].sum())/total_others

In [129]:
print("Avg show comments: "+str(avg_show_comments))
print("Avg ask comments: "+str(avg_ask_comments))
print("Avg other comments: "+str(avg_other_comments))


Avg show comments: 10.31669535283993
Avg ask comments: 14.038417431192661
Avg other comments: 26.8730371059672


**Ask HN posts have more comments than Show HN posts**

 We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [130]:
hn[bool2]['created_at']

7          8/16/2016 9:55
17       11/22/2015 13:43
22         5/2/2016 10:14
30         8/2/2016 14:20
31       10/15/2015 16:38
               ...       
20039      1/29/2016 9:42
20042      8/7/2016 12:58
20045       7/4/2016 8:50
20048       3/5/2016 1:25
20061      5/21/2016 9:22
Name: created_at, Length: 1744, dtype: object

In [131]:
import datetime as dt
ask_df=hn[bool2]
created_at=ask_df[bool2]['created_at']
created_at_dttime=pd.to_datetime(ask_df[bool2]['created_at'], format='%m/%d/%Y %H:%M')
created_at_hr=created_at_dttime.dt.strftime('%H')
created_at_hr

  created_at=ask_df[bool2]['created_at']
  created_at_dttime=pd.to_datetime(ask_df[bool2]['created_at'], format='%m/%d/%Y %H:%M')


7        09
17       13
22       10
30       14
31       16
         ..
20039    09
20042    12
20045    08
20048    01
20061    09
Name: created_at, Length: 1744, dtype: object

**Number of comments each hour for ask hn**

In [133]:
#Ask dataframe

ask_df['created_hr']=created_at_hr
grouped_df=ask_df.groupby(['created_hr'])['num_comments'].sum()
grouped_df



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ask_df['created_hr']=created_at_hr


created_hr
00     447
01     683
02    1381
03     421
04     337
05     464
06     397
07     267
08     492
09     251
10     793
11     641
12     687
13    1253
14    1416
15    4477
16    1814
17    1146
18    1439
19    1188
20    1722
21    1745
22     479
23     543
Name: num_comments, dtype: int64

**Number of posts each hour for ask hn**

In [137]:
ask_df['created_hr']=created_at_hr
grouped_df=ask_df.groupby(['created_hr'])['id'].count()
grouped_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ask_df['created_hr']=created_at_hr


created_hr
00     55
01     60
02     58
03     54
04     47
05     46
06     44
07     34
08     48
09     45
10     59
11     58
12     73
13     85
14    107
15    116
16    108
17    100
18    109
19    110
20     80
21    109
22     71
23     68
Name: id, dtype: int64

**Average number of posts each hour**

In [142]:
grouped_df=ask_df.groupby(['created_hr'])['num_comments'].mean()
grouped_df

created_hr
00     8.127273
01    11.383333
02    23.810345
03     7.796296
04     7.170213
05    10.086957
06     9.022727
07     7.852941
08    10.250000
09     5.577778
10    13.440678
11    11.051724
12     9.410959
13    14.741176
14    13.233645
15    38.594828
16    16.796296
17    11.460000
18    13.201835
19    10.800000
20    21.525000
21    16.009174
22     6.746479
23     7.985294
Name: num_comments, dtype: float64

In [148]:
ask_df.groupby(['created_hr'])['num_comments'].mean().sort_values(ascending=False).head(5)

created_hr
15    38.594828
02    23.810345
20    21.525000
16    16.796296
21    16.009174
Name: num_comments, dtype: float64

Thus top 5 hours of the most ask comments are 15,02,20,16,21

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).