# Hacker News Data Exploration

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. In this project, we are going to explore a dataset containing Hacker News data.

The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if it the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either `"Ask HN"` or `"Show HN"` substring. Users submit `"Ask HN"` posts to ask the Hacker News community a specific question. Likewise, users submit `"Show HN"` posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do `"Ask HN"` or `"Show HN"` posts receive more comments on average?
- Do posts created at a certain time receive more comments on average?
- Whether `"Ask HN"` or `"Show HN"` posts receive more points on average.
- Whether posts created at a certain time are more likely to receive more points.

We can also compare our results from the above 2 types posts to the average number of comments and points other posts receive.

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [1]:
from csv import reader

opened_file = open('dataset/hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(headers)
print("\n")
for each_row in hn[:5]:
    print(each_row)
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `"Ask HN"` or `"Show HN"`, we'll create new lists of lists containing just the data for those titles.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for each_row in hn:
    title = each_row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(each_row)
    elif title.startswith('show hn'):
        show_posts.append(each_row)
    else:
        other_posts.append(each_row)
        
print("Number of posts asking ('Ask HN'): " + str(len(ask_posts)))
print("Number of posts showing ('Show HN'): " + str(len(show_posts)))
print("Number of other posts: " + str(len(other_posts)))

Number of posts asking ('Ask HN'): 1744
Number of posts showing ('Show HN'): 1162
Number of other posts: 17194


Next, let's determine if ask posts or show posts receive more comments on average.

We can create a function `avg_no_of()` as there will be redundant line of codes otherwise as we will essentially be doing the same thing (finding average number of comments for certain type of posts). `avg_no_of()` takes a list of lists (rows for `"Ask HN"` or `"Show HN"` posts from the dataset) as the first parameter and as the second parameeter it takes the index of the column for which we are going to find the average of (in this case index `4` which is the index of `num_comments` column in the dataset). It then counts the total number of comments in that certain types of posts and returns the average number of comments.

In [3]:
def avg_no_of(rows, col):
    total = 0
    for each_row in rows:
        n_cmnt = int(each_row[col])
        total += n_cmnt
        
    avg = total/len(rows)
    return avg

avg_ask_comments = avg_no_of(ask_posts, 4)
avg_show_comments = avg_no_of(show_posts, 4)

print("Average comments for posts asking ('Ask HN'): " + str(round(avg_ask_comments, 2)))
print("Average comments for posts showing ('Show HN'): " + str(round(avg_show_comments,2)))

Average comments for posts asking ('Ask HN'): 14.04
Average comments for posts showing ('Show HN'): 10.32


Looking from the above data, it is clear that `"Ask HN"` posts receive more comments on average (**14.04**) than `"Show HN"` posts (**10.32**).

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain *time* are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2. Calculate the average number of comments ask posts receive by hour created.

As we are going to be working with time, we can import the `datetime` module in our next script below. Then we find out the number of `"Ask HN"` posts by hour and save it in the variable `counts_by_hour`. Also, we find out the number of `"Ask HN"` posts' comments by hour and save it in the variable `comments_by_hour`.

In [4]:
import datetime as dt

result_list = []

for each_row in ask_posts:
    temp_list = [each_row[6], int(each_row[4])]
    result_list.append(temp_list)
    
counts_by_hour = {}
comments_by_hour = {}

for elem in result_list:
    dt_obj = dt.datetime.strptime(elem[0], "%m/%d/%Y %H:%M")
    hour = dt_obj.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = elem[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += elem[1]

Now that we know hour-wise number of posts and comments for `"Ask HN"`, we can print out to see the distribution. It will be easier for us to take a look at this distribution if they are sorted by the total number in a descending order. 

Below, we create a function `print_sorted_dict_by_val_desc()` that takes a dictionary as parameter and prints the data of the dictionary in a descending order (by its value). This function can be used to help us interpret the data inside `counts_by_hour` and `comments_by_hour` more easily.

In [5]:
def print_sorted_dict_by_val_desc(input_dict):
    converted_list = []
    for k,v in input_dict.items():
        a, b = v, k
        converted_list.append([a,b])
    
    for elem in sorted(converted_list, reverse=True):
        print('{}: {}'.format(elem[1],elem[0]))
    
print("Number of posts by hour [High->Low]: ") 
print_sorted_dict_by_val_desc(counts_by_hour)
    
print("\n")
    
print("Number of comments by hour [High->Low]: ")    
print_sorted_dict_by_val_desc(comments_by_hour)

Number of posts by hour [High->Low]: 
15: 116
19: 110
21: 109
18: 109
16: 108
14: 107
17: 100
13: 85
20: 80
12: 73
22: 71
23: 68
01: 60
10: 59
11: 58
02: 58
00: 55
03: 54
08: 48
04: 47
05: 46
09: 45
06: 44
07: 34


Number of comments by hour [High->Low]: 
15: 4477
16: 1814
21: 1745
20: 1722
18: 1439
14: 1416
02: 1381
13: 1253
19: 1188
17: 1146
10: 793
12: 687
01: 683
11: 641
23: 543
08: 492
22: 479
05: 464
00: 447
03: 421
06: 397
04: 337
07: 267
09: 251


From the distribution of `"Ask HN"` posts and comments by hour (printed above), We can see a pattern. In the morning and late at night, we do not see much activity from the users. However, in the afternoon leading up to the evening/ late-evening, we see more activity from the users.

Let's now calculate the average number of comments per post for posts created during each hour of the day. We can create a list of lists named `avg_by_hour` where each inner lists will have 2 elements - the first element being the hour and the second one is average comments per post in that hour.

In [6]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_cmnts_per_post = round(comments_by_hour[hour]/counts_by_hour[hour],2)
    avg_by_hour.append([hour, avg_cmnts_per_post])
    
avg_by_hour

[['09', 5.58],
 ['13', 14.74],
 ['10', 13.44],
 ['14', 13.23],
 ['16', 16.8],
 ['23', 7.99],
 ['12', 9.41],
 ['17', 11.46],
 ['15', 38.59],
 ['21', 16.01],
 ['20', 21.52],
 ['02', 23.81],
 ['18', 13.2],
 ['03', 7.8],
 ['05', 10.09],
 ['19', 10.8],
 ['01', 11.38],
 ['22', 6.75],
 ['08', 10.25],
 ['04', 7.17],
 ['00', 8.13],
 ['06', 9.02],
 ['07', 7.85],
 ['11', 11.05]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

Below, we create a list `swap_avg_by_hour` that equals `avg_by_hour` wit
h swapped columns.

In [7]:
swap_avg_by_hour = []

for elem in avg_by_hour:
    temp_list = [elem[1], elem[0]]
    swap_avg_by_hour.append(temp_list)
    
swap_avg_by_hour

[[5.58, '09'],
 [14.74, '13'],
 [13.44, '10'],
 [13.23, '14'],
 [16.8, '16'],
 [7.99, '23'],
 [9.41, '12'],
 [11.46, '17'],
 [38.59, '15'],
 [16.01, '21'],
 [21.52, '20'],
 [23.81, '02'],
 [13.2, '18'],
 [7.8, '03'],
 [10.09, '05'],
 [10.8, '19'],
 [11.38, '01'],
 [6.75, '22'],
 [10.25, '08'],
 [7.17, '04'],
 [8.13, '00'],
 [9.02, '06'],
 [7.85, '07'],
 [11.05, '11']]

We can now use the `sorted()` [function](https://docs.python.org/3/library/functions.html#sorted) to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments. We can save the sorted result in `sorted_swap`.

Finally, we can print the top 5 hours for `"Ask HN"` posts' comments.

In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments: ")
print("\n")
for elem in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post'.format(dt.datetime.strptime(elem[1], "%H").strftime("%H:%M"), elem[0]))

Top 5 Hours for Ask Posts Comments: 


15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From the above data, it can be concluded that to have a higher chance of receiving comments, a post should be created in the following hours (*Eastern Time US* [as per the documentation](https://www.kaggle.com/hacker-news/hacker-news-posts)):
- 15:00
- 02:00
- 20:00
- 16:00
- 21:00

Next, let's determine if ask posts or show posts receive more points on average. We can use the `avg_no_of()` function that was used earlier to calculate average comments. For the second parameter of this function, now, we will provide `3` as it is the index of the `num_points` column in the dataset.

In [9]:
avg_ask_points = avg_no_of(ask_posts, 3)
avg_show_points = avg_no_of(show_posts, 3)

print("Average points for posts asking ('Ask HN'): " + str(round(avg_ask_points, 2)))
print("Average points for posts showing ('Show HN'): " + str(round(avg_show_points,2)))

Average points for posts asking ('Ask HN'): 15.06
Average points for posts showing ('Show HN'): 27.56


Looking from the above data, it is clear that `"Ask HN"` posts receive less points on average (**15.06**) than `"Show HN"` posts (**27.56**). In terms of average comments, it was the opposite.

Since show posts are more likely to receive more points, we'll focus our remaining analysis just on these posts. Next, we'll determine if show posts created at a certain time are more likely to attract points. First let's find out the total points for each available hours below.

In [13]:
result_list = []

for each_row in show_posts:
    temp_list = [each_row[6], int(each_row[3])]
    result_list.append(temp_list)
    
points_by_hour = {}

for elem in result_list:
    dt_obj = dt.datetime.strptime(elem[0], "%m/%d/%Y %H:%M")
    hour = dt_obj.strftime("%H")
    if hour not in points_by_hour:
        points_by_hour[hour] = elem[1]
    else:
        points_by_hour[hour] += elem[1]
        
print("Number of points by hour [High->Low]: ")    
print_sorted_dict_by_val_desc(points_by_hour)

Number of points by hour [High->Low]: 
16: 2634
12: 2543
17: 2521
13: 2438
15: 2228
18: 2215
14: 2187
22: 1856
20: 1819
19: 1702
23: 1526
11: 1480
00: 1173
21: 866
01: 700
10: 681
03: 679
09: 553
08: 519
07: 494
04: 386
06: 375
02: 340
05: 104


Let's now calculate the average number of points per post for posts created during each hour of the day. We can create a list of lists named `avg_pts_by_hour` where each inner lists will have 2 elements - the first element being the hour and the second one is average points per post in that hour.

In [14]:
avg_pts_by_hour = []

for hour in points_by_hour:
    avg_pts_per_post = round(points_by_hour[hour]/counts_by_hour[hour],2)
    avg_pts_by_hour.append([hour, avg_pts_per_post])
    
avg_pts_by_hour

[['14', 20.44],
 ['22', 26.14],
 ['18', 20.32],
 ['07', 14.53],
 ['20', 22.74],
 ['05', 2.26],
 ['16', 24.39],
 ['19', 15.47],
 ['15', 19.21],
 ['03', 12.57],
 ['17', 25.21],
 ['06', 8.52],
 ['02', 5.86],
 ['13', 28.68],
 ['08', 10.81],
 ['21', 7.94],
 ['04', 8.21],
 ['11', 25.52],
 ['12', 34.84],
 ['23', 22.44],
 ['09', 12.29],
 ['01', 11.67],
 ['10', 11.54],
 ['00', 21.33]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

Below, we create a list `swap_avg_pts_by_hour` that equals `avg_pts_by_hour` with swapped columns.

In [15]:
swap_avg_pts_by_hour = []

for elem in avg_pts_by_hour:
    temp_list = [elem[1], elem[0]]
    swap_avg_pts_by_hour.append(temp_list)
    
swap_avg_pts_by_hour

[[20.44, '14'],
 [26.14, '22'],
 [20.32, '18'],
 [14.53, '07'],
 [22.74, '20'],
 [2.26, '05'],
 [24.39, '16'],
 [15.47, '19'],
 [19.21, '15'],
 [12.57, '03'],
 [25.21, '17'],
 [8.52, '06'],
 [5.86, '02'],
 [28.68, '13'],
 [10.81, '08'],
 [7.94, '21'],
 [8.21, '04'],
 [25.52, '11'],
 [34.84, '12'],
 [22.44, '23'],
 [12.29, '09'],
 [11.67, '01'],
 [11.54, '10'],
 [21.33, '00']]

We can now use the `sorted()` [function](https://docs.python.org/3/library/functions.html#sorted) to sort `swap_avg_pts_by_hour` in descending order. Since the first column of this list is the average number of points, sorting the list will sort by the average number of points. We can save the sorted result in `sorted_swap_points`.

Finally, we can print the top 5 hours for `"Show HN"` posts' points.

In [17]:
sorted_swap_points = sorted(swap_avg_pts_by_hour, reverse=True)

print("Top 5 Hours for Show Posts Points: ")
print("\n")
for elem in sorted_swap_points[:5]:
    print('{}: {:.2f} average points per post'.format(dt.datetime.strptime(elem[1], "%H").strftime("%H:%M"), elem[0]))

Top 5 Hours for Show Posts Points: 


12:00: 34.84 average points per post
13:00: 28.68 average points per post
22:00: 26.14 average points per post
11:00: 25.52 average points per post
17:00: 25.21 average points per post


From the above data, it can be concluded that to have a higher chance of receiving comments, a post should be created in the following hours (*Eastern Time US* [as per the documentation](https://www.kaggle.com/hacker-news/hacker-news-posts)):

- 12:00
- 13:00
- 22:00
- 11:00
- 17:00

# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average and most points on average. 

Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

On the other hand, while analyzing which type of posts get more points, we found out that it's show posts that get more points. Based on our analysis, to maximize the amount of points a post receives, we'd recommend the post be categorized as show post and created between 12:00 and 13:00 (12:00 pm est - 1:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments.