# Exploring Hacker News Posts

[Hacker News](https://news.ycombinator.com/) is a site that is similar in format to Reddit. Users submit posts that are subject to "votes" that indicate their popularity. Other users can comment on these posts and receives votes of their own. Hacker News is more centered around the tech industry, whereas Reddit encompasses a wide range of topics that are divided into subsections called "subreddits." Our goal for this project is to determine if "Ask HN" posts or "Show HN" posts receive more comments on average, and we will also determine if time of posting has an affect on the number of comments.

We'll start by reading in data from a [Hacker News dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) found on Kaggle. 

In [6]:
import csv

with open('HN_posts_year_to_Sep_26_2016.csv', 'r', encoding='utf-8') as file:
    hn = list(csv.reader(file))

hn_headers = hn[0]
hn = hn[1:]

In [43]:
hn_headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [40]:
hn[:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

Since we only want posts that start with "Ask HN" and "Show HN," we will filter them into individual lists using the startswith() method. Since capitalization isn't enforced for the posts, we will use the lower() method to ignore case.

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

In [14]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


Now, we'll check to see which type of post receives more comments on average.

In [16]:
total_ask_comments = 0
total_show_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
for post in show_posts:
    total_show_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)    

In [17]:
print(avg_ask_comments)
print(avg_show_comments)

10.393478498741656
4.886099625910612


As we can see, Ask HN posts gain more comments per post on average. This can be largely attributed to the fact the questions tend to invoke more of a discussion (depending on the complexity and opinion-bias of the question). Since we determined that Ask HN tend to garner more comments, we will focus on that for the rest of the analysis. Now, we'll determine how much of an impact time has on the popularity of a post. We'll use the datetime module for this.

In [18]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])    

For the following code, counts_by_hour represents the total number of posts for a certain hour. comments_by_hour represents the total number of comments for a certain hour.

In [25]:
counts_by_hour = {}
comments_by_hour = {}

# 9/26/2016 3:26
dt_template = '%m/%d/%Y %H:%M'
for result in result_list:
    dt_obj = dt.datetime.strptime(result[0], dt_template)
    hour = dt_obj.strftime('%H')
    num_comments = result[1]
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

In [37]:
counts_by_hour

{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

In [38]:
comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

Now, we'll calculate the average number of comments per post for a certain hour.

In [30]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_comments_per_post = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comments_per_post])

In [33]:
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

We'll want to see this list sorted by the average comments per post, but since the sorted() method only works on the first index of the nested list, we'll have to swap the values using a new list.

In [49]:
swap_avg_by_hour = []
for avg in avg_by_hour:
    swap_avg_by_hour.append([avg[1], avg[0]])
    
swap_avg_by_hour = sorted(swap_avg_by_hour, reverse=True)

for avg, hour in swap_avg_by_hour:
    print(f'{hour}:00: {avg:.2f} average comments per post')

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
04:00: 9.71 average comments per post
14:00: 9.69 average comments per post
17:00: 9.45 average comments per post
08:00: 9.19 average comments per post
11:00: 8.96 average comments per post
22:00: 8.80 average comments per post
05:00: 8.79 average comments per post
20:00: 8.75 average comments per post
21:00: 8.69 average comments per post
03:00: 7.95 average comments per post
18:00: 7.94 average comments per post
16:00: 7.71 average comments per post
00:00: 7.56 average comments per post
01:00: 7.41 average comments per post
19:00: 7.16 average comments per post
07:00: 7.01 average comments per post
06:00: 6.78 average comments per post
23:00: 6.70 average comments per post
09:00: 6.65 average comments per post


From this result, we can determine that the hour that receives the most comments per hour is at 3pm EST. The next two times tell us that noon to afternoon is generally a good time to submit an "Ask HN" post.