# Exploring Hacker News Posts

### About:
#### Analyse the dataset of Hacker News (https://news.ycombinator.com/) posts

### Goal:
#### Determine best time and topic for post creation to make them more popular

#### Import required libs

In [24]:
from csv import reader
import datetime as dt

#### Open and read file into list

In [2]:
opened_file = open("HN_posts_year_to_Sep_26_2016.csv")
hn = list(reader(opened_file))

In [4]:
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

In [5]:
# extract header
header = hn[0]
hn = hn[1:]

In [7]:
hn[:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

#### Separate posts into categories: Ask posts, Show posts and other

In [15]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


#### Check what are Ask or Show posts more popular

In [19]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments / len (ask_posts)
print(avg_ask_comments)

10.393478498741656


In [20]:
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments / len (show_posts)
print(avg_show_comments)

4.886099625910612


The audience seems more likes to answer questions than discussing someone's ideas and projects

#### Check if there is a relation between post time of creation and its popularity

In [28]:
result_list = []
for post in ask_posts:
    created_at = post[6]
    comm_num = int(post[4])
    result_list.append([created_at, comm_num])

In [30]:
counts_by_hour = {}
comments_by_hour = {}
for post in result_list:
    date = dt.datetime.strptime(post[0], "%m/%d/%Y %H:%M")
    if date.hour not in counts_by_hour:
        counts_by_hour[date.hour] = 1
        comments_by_hour[date.hour] = post[1]
    else:
        counts_by_hour[date.hour] += 1
        comments_by_hour[date.hour] += post[1]

In [59]:
def display_avg_post_by_hour(counts_by_hour, comments_by_hour):
    l = []
    for h in counts_by_hour:
        avg = comments_by_hour[h] / counts_by_hour[h]
        l.append([h, avg])
    return l

In [62]:
# stats of comments distribution by hours
avg_by_hour = display_avg_post_by_hour(counts_by_hour, comments_by_hour)

In [64]:
avg_by_hour

[[2, 11.137546468401487],
 [1, 7.407801418439717],
 [22, 8.804177545691905],
 [21, 8.687258687258687],
 [19, 7.163043478260869],
 [17, 9.449744463373083],
 [15, 28.676470588235293],
 [14, 9.692007797270955],
 [13, 16.31756756756757],
 [11, 8.96474358974359],
 [10, 10.684397163120567],
 [9, 6.653153153153153],
 [7, 7.013274336283186],
 [3, 7.948339483394834],
 [23, 6.696793002915452],
 [20, 8.749019607843136],
 [16, 7.713298791018998],
 [8, 9.190661478599221],
 [0, 7.5647840531561465],
 [18, 7.94299674267101],
 [12, 12.380116959064328],
 [4, 9.7119341563786],
 [6, 6.782051282051282],
 [5, 8.794258373205741]]

#### Format list with average posts by hour: sort and print in friendly format

In [68]:
swap_avg_by_hour = []
for h in avg_by_hour:
    swap_avg_by_hour.append([h[1],h[0]])

In [71]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [76]:
# top 5 Hourse for Ask Posts Comments
print(sorted_swap[:5])

[[28.676470588235293, 15], [16.31756756756757, 13], [12.380116959064328, 12], [11.137546468401487, 2], [10.684397163120567, 10]]


In [90]:
out_format = "{hr:%H}:00: {num:.2f} average comments per post"
for h in sorted_swap:
    hour = dt.datetime.strptime(str(h[1]), "%H")
    print(out_format.format(hr=hour, num=h[0]))

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
04:00: 9.71 average comments per post
14:00: 9.69 average comments per post
17:00: 9.45 average comments per post
08:00: 9.19 average comments per post
11:00: 8.96 average comments per post
22:00: 8.80 average comments per post
05:00: 8.79 average comments per post
20:00: 8.75 average comments per post
21:00: 8.69 average comments per post
03:00: 7.95 average comments per post
18:00: 7.94 average comments per post
16:00: 7.71 average comments per post
00:00: 7.56 average comments per post
01:00: 7.41 average comments per post
19:00: 7.16 average comments per post
07:00: 7.01 average comments per post
06:00: 6.78 average comments per post
23:00: 6.70 average comments per post
09:00: 6.65 average comments per post


### Conclusion
According to this research:
 - Best topic for the post is "Ask HN"
 - Best time for post creation is 15:00