# Exploring Hacker News Posts

This is the final project suggested in the _Python for Data Science: Intermediate_ course hosted by [Dataquest.io](https://www.dataquest.io/).

In this project, we will analyze two different kinds of posts from the well-known technology webpage [Hacker News](https://news.ycombinator.com/). These two kinds of posts are `Ask HN` and `Show HN`.

People submit `Ask HN` posts to ask the community a specific tech-related question. On the other hand, `Show HN` ones are those in which people share projects, products or anything they find interesting.

We will compare these posts to answer the following questions:

- Do `Ask HN` or `Show HN` posts receive more comments?
- At what time do these posts receive the most comments on average?

## Opening and Exploring the Data

For this project, we will use a [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) from [Kaggle.com](https://www.kaggle.com/).
This data set contains Hacker News posts from September 2015 to September 2016.

First of all, we'll read the data, separate the headers and have a look at the first few rows.

In [1]:
from csv import reader

opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0]
hn = hn[1:]

print(len(hn))
print(headers)
hn[0:5]

293119
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

We can see this data set contains the title, url and author of the posts, as well as more interesting information for our project, such as number of comments and date of creation.

## Extracting Ask HN and Show HN Posts

Since we only want to analyze `Ask HN` and `Show HN` posts, we will separate the two of them from the rest. For that, we will search for posts that begin with those strings.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print("Number of Ask HN posts:   ", len(ask_posts))
print("Number of Show HN posts:  ", len(show_posts))
print("Number of posts not used: ", len(other_posts))

Number of Ask HN posts:    9139
Number of Show HN posts:   10158
Number of posts not used:  273822


Note that even though the `lower()` and `startswith()` methods are useful, we could also use regex to find these posts.
To do so, we match any title that starts with either _ask hn_ or _show hn_, using case insensitive matching.

In [3]:
# import the regex module
import re

# our regex to find Ask HN posts
regex_ask = r"^ask\shn"

# our regex to find Show HN posts
regex_show = r"^show\shn"

ask_posts_re = []
show_posts_re = []
other_posts_re = []

for post in hn:
    title = post[1]
    if re.search(regex_ask, title, re.IGNORECASE):
        ask_posts_re.append(post)
    elif re.search(regex_show, title, re.IGNORECASE):
        show_posts_re.append(post)
    else:
        other_posts_re.append(post)
        
print("Number of Ask HN posts:   ", len(ask_posts_re))
print("Number of Show HN posts:  ", len(show_posts_re))
print("Number of posts not used: ", len(other_posts_re))

Number of Ask HN posts:    9139
Number of Show HN posts:   10158
Number of posts not used:  273822


We check that we get the same output using both methods.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now, we can calculate the number of comments for both types of posts on average, and therefore, answer our first question of the project.

In [4]:
total_ask_comments = 0

for post in ask_posts:
    comments = int(post[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of comments in Ask HN posts: ", avg_ask_comments)

Average number of comments in Ask HN posts:  10.393478498741656


In [5]:
total_show_comments = 0

for post in show_posts:
    comments = int(post[4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments in Show HN posts: ", avg_show_comments)

Average number of comments in Show HN posts:  4.886099625910612


We can see that, on average, `Ask HN` posts get more or less two times the amount of comments than `Show HN` ones.

For now on, we will only analyze these posts.

## Finding the Amount of Ask HN Posts and Comments by Hour

Now we will start working with dates and times, using the `datetime` module, to find the amount of Ask HN posts and comments by hour.

We will create two dictionaries, one containing the number of posts by hour and another one containing the number of comments each hour, using the proper date and time formatting.

In [6]:
import datetime as dt

result_list = []

for post in ask_posts:
    post_list = [post[6], int(post[4])]
    result_list.append(post_list)
    
posts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comments = row[1]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

Let's look at the total number of comments each hour, to get a general idea.

In [7]:
comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

Now we can get the average number of comments by hour, which we will store in a list of lists for the next step of our porject.

In [8]:
avg_by_hour = []

for hour in comments_by_hour:
    avg = comments_by_hour[hour] / posts_by_hour[hour]
    avg_by_hour.append([hour, avg])
    
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

## Sorting and Printing Values from a List of Lists

For a better reading and a cleaner look, let's sort the results and swap the columns of the list of lists, so we can see in which hours Ask HN posts got more comments on average.

In [9]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

As a final step, we will print a top 5 in a cleaner way. Looking into the [documentation of the data set](https://www.kaggle.com/hacker-news/hacker-news-posts), we can see that the time zone used is Eastern Time in the US (EST).

In [10]:
print("Top 5 Hours for Ask Posts Comments")

string_format = "At {} EST, {:.2f} average comments per post."
for avg, hour in sorted_swap[:5]:
    print(string_format.format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg))

Top 5 Hours for Ask Posts Comments
At 15:00 EST, 28.68 average comments per post.
At 13:00 EST, 16.32 average comments per post.
At 12:00 EST, 12.38 average comments per post.
At 02:00 EST, 11.14 average comments per post.
At 10:00 EST, 10.68 average comments per post.


## Conclusions

In this project we analyzed data about two different kinds of posts in Hacker News, `Ask HN`, and `Show HN`, to determine which posts and what times get the most comments on average. In our analysis we found out that to maximize the comments received, we recommend posting Ask HN entries at 15:00 EST.