# Exploring Hackers News Posts

이 프로젝트는 [Hacker News](https://www.kaggle.com/hacker-news/hacker-news-posts)의 데이터 셋을 이용해 진행됩니다.
Ask HN posts 와 Show HN posts를 비교해보는 작업을 해볼 것입니다.

- Ask HN 또는 Show HN이 평균 이상의 Comments를 받았는지?
- 특정 시간에 만들어진 Posts가 평균 이상의 Comments를 받았는지?

# Introduction

먼저 데이터를 읽고, header를 삭제합니다.

In [1]:
#Read in the data
import csv

opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

# Removing Headers

In [4]:
# Remove the headers
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
[['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578954', "Macalifa  A new open-source music app for UWP that won't suck", 'http://forums.windowscentral.com/windows-phone-apps/440523-m

# Extracting Ask HN and Show HN Posts

*Ask HN* 또는 *Show HN* 으로 시작하는 posts 들을 서로 다른 list 안으로 분류할 것이다.

In [8]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
#print(ask_posts[:5])
#print(show_posts[:5])
#print(other_posts[:5])

9139
10158
273820


# Calculating the Average Number of Comments for Ask HN and Show HN Posts

*Ask HN* 과 *Show HN*의 평균 comments 갯수를 구해보자


In [9]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

10.393478498741656


In [10]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.886099625910612


평균적으로 *ask posts*는 약10개 , *show posts*는 약 5개 정도의 comments를 받는 것을 알 수 있고, 일반적으로 *ask posts*가 더 많은 comments를 받는다

# Finding the Amount of Ask Posts and Comments by Hour Created

특정시간에 *ask posts*의 comments 숫자를 maximzie 시킬 수 있는지 알아볼 것이다.
먼저, 하루동안 각각의 시간동안 *ask posts* 가 몇개 생성되는지, 그리고 거기에 대한 comments 수는 몇개 달리는지 파악할 것이다. 그 다음에 매 시간에 생성된 *ask posts* 에 대한 평균적인 comments 수를 계산할 것이다


In [12]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])]) # 6번재 column: post 생성 날짜 , 4번째 colum: comments 갯수

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    n_comment = row[1]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if hour in counts_by_hour:
        comments_by_hour[hour] += n_comment
        counts_by_hour[hour] += 1
    else:
        comments_by_hour[hour] = n_comment
        counts_by_hour[hour] = 1

In [13]:
comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In [14]:
counts_by_hour

{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

# Calculating the Average Number of Comments for Ask HN Posts by Hour

In [19]:
avg_comments_by_hour = []

for hour in comments_by_hour:
    avg_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_comments_by_hour.append([avg_comments, hour])

sorted_avg_comments_by_hour = sorted(avg_comments_by_hour, reverse = True)
sorted_avg_comments_by_hour

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [21]:
# 상위 평균 comments 수를 가진 시간 타임 5개 출력

print("Top 5Hours for 'Ask HN' Comments")
for avg_n_comments, hour in sorted_avg_comments_by_hour[:5]:
    print("{} => {:.2f} average comments per post".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg_n_comments))

Top 5Hours for 'Ask HN' Comments
15:00 => 28.68 average comments per post
13:00 => 16.32 average comments per post
12:00 => 12.38 average comments per post
02:00 => 11.14 average comments per post
10:00 => 10.68 average comments per post


평균적으로 오후 3시에 가장 높은 comments수를 받을 수 있음 (평균 28.68개의 comments per posts)


# Conclusion

이번 프로젝트에서,*ask posts* 와 *show posts* 중에서 어떤 타입의 posts가 평균적으로 comments 갯수를 많이 받는지 알아보았다. 오후 3시~오후4시 사이에 *ask posts*를 할때 제일 많은 평균 comments 갯수를 받을 수 있었다. 