# Hacker News Posts

In this project, we will work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We are specifically interested in posts whose titles begin with either Ask HN or Show HN.
    
__We will compare these two types of posts to determine the following:__
 * Do Ask HN or Show HN receive more comments on average?
 * Do posts created at a certain time receive more comments on average?

## Data Set Source

Resource: https://www.kaggle.com/hacker-news/hacker-news-posts

It contains 293,119 rows data gathered in 12 months up to September 26 2016. Below are descriptions of the columns:

* `id` : The unique identifier from Hacker News for the post
* `title` : The title of the post
* `url` : The URL that the posts links to, if it the post has a URL
* `num_points` : The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments` : The number of comments that were made on the post
* `author` : The username of the person who submitted the post
* `created_at` : The date and time at which the post was submitted

## Importing Data Set
Let's start by importing the libraries we need and reading the data set into a list of lists.

In [11]:
from csv import reader

opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
list_data = list(read_file)
headers = list_data[0]
hn = list_data[1:]

# display headers, total rows and first three samples
print(headers,"\n")
print("Rows:",len(hn))
hn[:3]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

Rows: 293119


[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19']]

Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

    Ask HN: How to improve my personal website?
    Ask HN: Am I the only one outraged by Twitter shutting down share counts?
    Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

    Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
    Show HN: Something pointless I made
    Show HN: Shanhu.io, a programming playground powered by e8vm


## Extracting Ask HN and Show Hn Posts

Since our concerns are post titles beginning with __Ask HN__ or __Show HN__, we will create new lists of lists containing just the data for those titles.

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.startswith('Ask HN'):
        ask_posts.append(row)
    elif title.startswith('Show HN'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Total Posts: ", len(hn))
print("Ask Posts: ", len(ask_posts))
print("Show Posts: ", len(show_posts))
print("Other Posts: ", len(other_posts))

Totals Posts:  293119
Ask Posts:  9122
Show Posts:  10150
Other Posts:  273847


__Note that it has been reduced from almost 300,000 rows to approximately 20,000 rows if we sum the `Ask HN` and `Show HN` posts.__

## Calculating Average Number of Comments

Next we will be working with the number of comments for each `Ask HN` and `Show 
HN` starting with the total comments and then the average number of comments per post.

In [5]:
# calculating total number for Ask HN
total_ask_comments = 0
for row in ask_posts:
    n_comments =int(row[4])
    total_ask_comments += n_comments
    
# calculating average number Ask HN posts
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Total ask comments:", total_ask_comments)
print("Average comments per post:", avg_ask_comments)

Total ask comments: 94930
Average comments per post: 10.406709055031792


In [13]:
# calculating total number for Show HN
total_show_comments = 0
for row in show_posts:
    n_comments =int(row[4])
    total_show_comments += n_comments
    
# calculating average number Show HN posts
avg_show_comments = total_show_comments / len(show_posts)

print("Total show comments:", total_show_comments)
print("Average comments per post:", avg_show_comments)

Total show comments: 49617
Average comments per post: 4.888374384236453


We're discovering that __`Ask HN` posts receive more comments averaging in 10.40__ comments per post in comparison to `Show HN` with average 4.88 comments per post.

## Finding the Amount of Posts and Comments By Hour

Next, we will determine if ask posts created at a certain time are more likely to attract comments by using these steps:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

We will tackle the first step by using `datetime` module to work with the data in the `created_at` column. Beginning with parsing the dates stored as string and return datetime objects by using `datetime.strptime()`. The date format based on `created_at` value is `%m/%d/%Y %H:%M`

In [7]:
import datetime as dt

result_list = []

for row in ask_posts:
    n_comments = int(row[4])
    created_at = row[6]
    result_list.append([created_at, n_comments])
    
result_list[0]

['9/26/2016 2:53', 7]

By looping through `ask_posts` we created a `result_list` containing list creation time in string format and number of comments per post. 

Below we will create two dictionaries:
 * `counts_by_hour` : contains the number of ask posts created during each hour of the day.
 * `comments_by_hour` : contains the corresponding number of comments ask posts created at each hour received.

In [8]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    creation = row[0]
    creation_date = dt.datetime.strptime(creation, "%m/%d/%Y %H:%M")
    creation_time = creation_date.strftime("%H")
    if creation_time not in counts_by_hour:
        counts_by_hour[creation_time] = 1
    else:
        counts_by_hour[creation_time] += 1
    
    n_comments = row[1]
    if creation_time not in comments_by_hour:
        comments_by_hour[creation_time] = n_comments
    else:
        comments_by_hour[creation_time] += n_comments

Next, we will use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [9]:
avg_by_hour = []

for key, value in comments_by_hour.items():
    n_posts_at_hour = counts_by_hour[key]
    avg_posts_at_hour = value / n_posts_at_hour
    
    avg_by_hour.append([key, avg_posts_at_hour])
    
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.819371727748692],
 ['21', 8.720930232558139],
 ['19', 7.176043557168784],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.70703125],
 ['13', 16.350678733031675],
 ['11', 9.012903225806452],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.04],
 ['03', 7.974074074074074],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.717993079584775],
 ['08', 9.190661478599221],
 ['00', 7.575250836120401],
 ['18', 7.954248366013072],
 ['12', 12.380116959064328],
 ['04', 9.743801652892563],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

## Sorting Result List

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read. 

We will swap the first value and the second value of each list so we can sort it with built-in `sorted()` Python function.

### Top 5 Hours for Ask Posts Comments

In [14]:
swap_avg_by_hour = []

for avg in avg_by_hour:
    swap_avg_by_hour.append([avg[1], avg[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
template = "{time}: {avg:.2f} average comments per post"

for avg in sorted_swap[:5]:
    created_at = dt.datetime.strptime(avg[1], "%H")
    time_format = created_at.strftime("%H:%M")    
    print(template.format(time=time_format, avg=avg[0]))

15:00: 28.68 average comments per post
13:00: 16.35 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


## Conclusion

* `Ask HN` received more comments than `Show HN` posts
* Ask posts posted between 1 PM to 3 PM (Eastern Time in the US) tend to receive more comments