# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

## 1. Collect and Explore the Data

First, we will collect the dataset from [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

According to the description that we can find in our source, the dataset (Hacker News Posts) has 12 months of data, up to September 26 2016. 

In [1]:
from csv import reader

# Hacker News Posts dataset
file = 'HN_posts_year_to_Sep_26_2016.csv'
opened_file = open(file, encoding='utf8')
read_file = reader(opened_file)
hn_data = list(read_file) # Transform our data from a reader object to a list of lists
hn_header = hn_data[0]
hn_data = hn_data[1:]

print("Number of columns: ",len(hn_header))
print('Number of rows: ',len(hn_data))


Number of columns:  7
Number of rows:  293119


Our dataset has a total of 7 columns and 293119 rows. Below we can find the respective information regarding the columns:

- `id` : The unique identifier from Hacker News for the post
- `title` : The title of the post
- `url` : The URL that the posts links to, if it the post has a URL
- `num_points` : The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments` : The number of comments that were made on the post
- `author` : The username of the person who submitted the post
- `created_at` : The date and time at which the post was submitted

Next, we will create a function to return the data of a given subset of our dataset and then display the first 5 rows of our dataset.

In [2]:
def explore_data(dataset: list, start: int, end: int):
    data = dataset[start:end]
    for row in data:
        print(row)
        print('\n')

print(hn_header)
explore_data(hn_data,0,5) # Display first 5 rows

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




## 2. Filtering by `Ask HN` or `Show HN`

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

In [3]:
ask_posts = [] # list only for Ask HN posts
show_posts = [] # list only for Show HN posts
other_posts = [] # list for other posts

for post in hn_data:
    title = post[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(other_posts)


Let's check the number of post on each list created before.

In [4]:
print('Total of Ask HN posts: ',len(ask_posts))
print('Total of Show HN posts: ',len(show_posts))
print('Total of remaing posts: ',len(other_posts))

Total of Ask HN posts:  9139
Total of Show HN posts:  10158
Total of remaing posts:  273822


In [5]:
print('----------------------ASK POSTS----------------------')
explore_data(ask_posts,0,5)
print('\n')
print('----------------------SHOW POSTS----------------------')
explore_data(show_posts,0,5)

----------------------ASK POSTS----------------------
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']




----------------------SHOW POSTS----------------------
['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']


['12578098'

## 3. Analyzing Filtered datasets

### 3.1 Posts with more comments on average

Next, we will determine which kind of posts have more comments on average, Ask or Show posts.

In [6]:
def calculate_avg_post(dataset: list, post_type: str):
    total_comments = 0
    for post in dataset:
        num_comments = int(post[4])
        total_comments += num_comments
    
    avg_comments = total_comments/len(dataset)
    print("The average number of comments in {} posts is: {}".format(post_type,round(avg_comments,2)))

In [7]:
calculate_avg_post(ask_posts, "Ask")
calculate_avg_post(show_posts, "Show")

The average number of comments in Ask posts is: 10.39
The average number of comments in Show posts is: 4.89


We can see below that we have more, on average, number of comments in Ask posts than Show Posts. This can be totally normal because people tend to comment more if you ask something compared to if you present a subject to everybody. This happens because everyone has a different opinion 😊.

### 3.2 Determine the amount of Ask comments per hour

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.


In [43]:
import datetime as dt


result_list = []

for post in ask_posts:
    created_date = post[6]
    num_comments = int(post[4])
    result_list.append([created_date,num_comments])

counts_by_hour = {} # contains the number of ask posts created during each hour of the day. 
comments_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received

for post in result_list:
    date_hour = post[0]
    num_comments = int(post[1])
    hour = dt.datetime.strptime(date_hour,'%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(hour,'%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

In [44]:
counts_by_hour

{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

In [47]:
comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

### 3.3 Hours with more comments on average

Next, we will calculate the average number of comments per post for posts created during each hour of the day.

comments by hour / counts by hour

In [15]:
avg_by_hour = []

for post in comments_by_hour:
    avg = round(comments_by_hour[post]/counts_by_hour[post],2)
    avg_by_hour.append([post,avg])

avg_by_hour.sort()
avg_by_hour

[['00', 7.56],
 ['01', 7.41],
 ['02', 11.14],
 ['03', 7.95],
 ['04', 9.71],
 ['05', 8.79],
 ['06', 6.78],
 ['07', 7.01],
 ['08', 9.19],
 ['09', 6.65],
 ['10', 10.68],
 ['11', 8.96],
 ['12', 12.38],
 ['13', 16.32],
 ['14', 9.69],
 ['15', 28.68],
 ['16', 7.71],
 ['17', 9.45],
 ['18', 7.94],
 ['19', 7.16],
 ['20', 8.75],
 ['21', 8.69],
 ['22', 8.8],
 ['23', 6.7]]

Let's check what are the top 5 hours with more comments on average.

In [28]:
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [42]:
# Sort values and print the top 5 hours the highest average comments.
print("---------------Top 5 Hours for Ask Posts Comments---------------")
for post in sorted_swap[:5]:
    hour = dt.datetime.strptime(str(post[1])+":00","%H:%M")
    hour = dt.datetime.strftime(hour,"%H:%M")
    print("{}: {} average comments per post.".format(hour,post[0]))
    
    

---------------Top 5 Hours for Ask Posts Comments---------------
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


The hour that receives the most comments on average per post is 15:00. It has 28.68 on average comments per post. We can observe that the top 3 hours with more comments are during and after lunch time.

## 4.Conclusion

In this project, we analyzed ask posts and show posts from [Hacker News Post](https://news.ycombinator.com/) to determine which type of post and time receive the most comments on average.
We verified that we have more Show posts (10158) than Ask posts (9139) in the dataset. We decided only to analize the Ask posts because it has more comments per post in average, 10.39 compared to the 4.89 posts from Show posts.  

Based on our study, to get the most comments in an ask post, the post should be created & post during the period between 15h-16h (3pm-4pm). The time period that has less comments on asks comments is between 8pm to 7am.

