# Exploring Hacker News Posts

## Introduction
> “Hacker News” is a popular social news website in the programming community focusing on computer science and entrepreneurship. The website’s format is similar to that of Reddit’s where users post technology-related stories that are voted on. Posts are usually categorized into two groups:
“Ask HN” posts - posts that ask the Hacker News community a question, and “Show HN” posts - posts to show the Hacker News community a project, product, or generally something interesting. The aim of this project is to compare these two types of posts and more specifically answer the following questions:
> - Do “Ask HN” or “Show HN” posts receive more comments on average?
> - Do posts created at a certain time receive more comments on average?
> - Do “Ask HN” or “Show HN” posts receive more points on average?
> - Do posts created at a certain time receive more points on average?

> The dataset we’ll be working with throughout this project has been filtered to only include post submissions that received comments, resulting in an omission of 280,000 rows. The first part of this notebook will focus on preparing and cleaning the dataset for analysis, while the second part will center on analyzing the cleaned data and answering the above questions.

In [1]:
# Importing Libraries
from csv import reader
import pandas as pd

In [2]:
# Loading the dataset

def open_file(filename):
    return list(reader(open(filename)))

hackernews_data = open_file("/Users/omarstinner/Data Files/Python Projects/Files/Guided Project - Exploring Hacker News Posts/hacker_news.csv")
header = hackernews_data[0]

## Part 1: Cleaning The Dataset

In [3]:
def title_seperator(title):
    return [row for row in hackernews_data[1:] if (row[1].lower()).startswith(title)]

ask_posts = title_seperator("ask hn")
show_posts = title_seperator("show hn")

> **Function:** To seperate "Ask HN" and "Show HN" posts from the dataset

In [4]:
from datetime import *

def number_of_comments_and_posts(title):
    return sum([int(row[4]) for row in hackernews_data[1:] if (row[1].lower()).startswith(title)]), sum([1 for row in hackernews_data[1:] if (row[1].lower()).startswith(title)]) 

number_of_ask_comments,number_of_ask_posts = number_of_comments_and_posts("ask hn")
number_of_show_comments,number_of_show_posts = number_of_comments_and_posts("show hn")

ready_ask_posts = [[datetime.strptime(row[-1], "%m/%d/%Y %H:%M").hour if row.index(x) == 6 else x for x in row] for row in ask_posts]
ready_ask_posts.insert(0, header)

ready_show_posts = [[datetime.strptime(row[-1], "%m/%d/%Y %H:%M").hour if row.index(x) == 6 else x for x in row] for row in show_posts]
ready_show_posts.insert(0, header)

> **Function:** To return 2 values: 1. The total number of comments for both "Ask HN" and "Show HN" posts 2. The total number of posts for both "Ask HN" and "Show HN" posts

> **What's Happening?** After passing the dataset through the function, we see that "Ask HN" posts are more popular than "Show HN" posts. For this reason, we will continue our analysis with just "Ask HN" Data. To prepare the "ask_posts" dataset for a pandas DataFrame conversion, we will have to first replace the "created_at" column values with just the hour portion and also add the header back 

In [5]:
# Converting ready_ask_posts into a DataFrame
pandas_ask_posts = pd.DataFrame(ready_ask_posts[1:], columns = ready_ask_posts[0])
pandas_ask_posts["num_comments"] = pandas_ask_posts["num_comments"].astype(int)

# Converting show_ask_posts into a DataFrame
pandas_show_posts = pd.DataFrame(ready_show_posts[1:], columns = ready_show_posts[0])
pandas_show_posts["num_comments"] = pandas_show_posts["num_comments"].astype(int)

> **What's Happening?** We first convert the "ready_ask_posts" and "ready_show_posts" datasets into pandas DataFrames. We then change all the values in the "num_comments" column into integer types. This will allow us to perform some mathemtical operations on them.

## Part 2: Analyzing The Data

#### Do “Ask HN” or “Show HN” posts receive more comments on average?

In [6]:
print(pandas_ask_posts["num_comments"].mean())
print(pandas_show_posts["num_comments"].mean())

14.038417431192661
10.31669535283993


> After calculating the average number of comments for both types of posts, we see that on average "Ask HN" posts recieve around 14 comments per post, while "Show HN" posts recieve around 10 comments per post. These averages look reasonable as people are more inclined to give attention to pressing issues rather than a post that merely shows someone's work.

#### Do posts created at a certain time receive more comments on average?

In [7]:
# Creating a Dictionary storing the averages for each hour in for "Ask HN" posts and "Show HN" posts
hour_average_comments_per_post_ask = {k : (pandas_ask_posts.loc[pandas_ask_posts["created_at"] == k, "num_comments"].sum())/(pandas_ask_posts["created_at"] == k).sum() for k in set(pandas_ask_posts["created_at"].tolist())}
hour_average_comments_per_post_show = {k : (pandas_show_posts.loc[pandas_show_posts["created_at"] == k, "num_comments"].sum())/(pandas_show_posts["created_at"] == k).sum() for k in set(pandas_show_posts["created_at"].tolist())}

# Sorting the dictionary based on hour with the highest average numher of comments
sorted_ask_hour_avg = dict(sorted(hour_average_comments_per_post_ask.items(), key = lambda x: x[1], reverse = True))
sorted_show_hour_avg = dict(sorted(hour_average_comments_per_post_show.items(), key = lambda x: x[1], reverse = True))

print("Average Comments per Hour: 'Ask HN' posts:")
for k,v in sorted_ask_hour_avg.items():
    print(k, ":", v)

print("\n")

print("Average Comments per Hour: 'Show HN' posts:")
for k,v in sorted_show_hour_avg.items():
    print(k, ":", v)

Average Comments per Hour: 'Ask HN' posts:
15 : 38.5948275862069
2 : 23.810344827586206
20 : 21.525
16 : 16.796296296296298
21 : 16.009174311926607
13 : 14.741176470588234
10 : 13.440677966101696
14 : 13.233644859813085
18 : 13.20183486238532
17 : 11.46
1 : 11.383333333333333
11 : 11.051724137931034
19 : 10.8
8 : 10.25
5 : 10.08695652173913
12 : 9.41095890410959
6 : 9.022727272727273
0 : 8.127272727272727
23 : 7.985294117647059
7 : 7.852941176470588
3 : 7.796296296296297
4 : 7.170212765957447
22 : 6.746478873239437
9 : 5.5777777777777775


Average Comments per Hour: 'Show HN' posts:
18 : 15.770491803278688
0 : 15.709677419354838
14 : 13.44186046511628
23 : 12.416666666666666
22 : 12.391304347826088
12 : 11.80327868852459
16 : 11.655913978494624
7 : 11.5
11 : 11.159090909090908
3 : 10.62962962962963
20 : 10.2
19 : 9.8
17 : 9.795698924731182
9 : 9.7
13 : 9.555555555555555
4 : 9.5
6 : 8.875
1 : 8.785714285714286
10 : 8.25
15 : 8.102564102564102
21 : 5.787234042553192
8 : 4.852941176470588

> For "Ask HN" posts the hour that receives the most posts is 15:00, with an average of around 39 comments. Interestingly, 2:00 seems to also be a very active hour for "Ask HN" posts. The hours of 10:00, 13:00, 14:00 all receive around the same amount of average comments (14 comments). As expected, "Show HN" does not get nearly as much attention as "Ask HN" posts. Their most active hours are 18:00 and 0:00 with an average of around 15 comments. Assuming that the questions asked are educational and come from students, it seems like people get more responses to their questions towards the end of the school day (a time where users can log in and answer questions).

#### Do “Ask HN” or “Show HN” posts receive more points on average?

In [8]:
# Converting the "num_points" column to an integer type perform some math operations on them
pandas_ask_posts["num_points"] = pandas_ask_posts["num_points"].astype(int)
pandas_show_posts["num_points"] = pandas_show_posts["num_points"].astype(int)

print(pandas_ask_posts["num_points"].mean())
print(pandas_show_posts["num_points"].mean())

15.061926605504587
27.555077452667813


> Discussion communities are places that are meant to be respected and treated professionally. On other discussion platforms, there are options to "downvote" posts that are deemed irrelevant, inappropriate, or repetitive. However, Hacker News does not have such a feature and so instead we will consider such "downvotes" as the lack of points a post receives. Repetitive questions are penalized on popular discussion boards and we could be observing this same pattern on the Hacker News community through the lack of points a post receives. The results show that "Ask_HN" posts receive only around 15 points on average, which could possibly mean that users are asking repetitive or irrelevant questions leading to a lower point count. On the other hand, "Show HN" posts receive almost double the number of points as "Ask HN" posts receive. This is a reasonable outcome as people are more inclined to upvote posts where users display/share their interesting projects/findings.

#### Do posts created at a certain time receive more points on average?

In [9]:
# Creating a Dictionary storing the averages for each hour in for "Ask HN" posts and "Show HN" posts
hour_average_points_per_post_ask = {k : (pandas_ask_posts.loc[pandas_ask_posts["created_at"] == k, "num_points"].sum())/(pandas_ask_posts["created_at"] == k).sum() for k in set(pandas_ask_posts["created_at"].tolist())}
hour_average_points_per_post_show = {k : (pandas_show_posts.loc[pandas_show_posts["created_at"] == k, "num_points"].sum())/(pandas_show_posts["created_at"] == k).sum() for k in set(pandas_show_posts["created_at"].tolist())}

# Sorting the dictionary based on hour with the highest average numher of comments
sorted_ask_hour_avg = dict(sorted(hour_average_points_per_post_ask.items(), key = lambda x: x[1], reverse = True))
sorted_show_hour_avg = dict(sorted(hour_average_points_per_post_show.items(), key = lambda x: x[1], reverse = True))

print("Average Points per Hour: 'Ask HN' posts:")
for k,v in sorted_ask_hour_avg.items():
    print(k, ":", v)

print("\n")

print("Average Points per Hour: 'Show HN' posts:")
for k,v in sorted_show_hour_avg.items():
    print(k, ":", v)

Average Points per Hour: 'Ask HN' posts:
15 : 29.99137931034483
13 : 24.258823529411764
16 : 23.35185185185185
17 : 19.41
10 : 18.677966101694917
18 : 15.972477064220184
21 : 15.788990825688073
20 : 14.3875
11 : 14.224137931034482
19 : 13.754545454545454
2 : 13.672413793103448
6 : 13.431818181818182
5 : 12.0
14 : 11.981308411214954
1 : 11.666666666666666
8 : 10.729166666666666
12 : 10.712328767123287
7 : 10.617647058823529
23 : 8.544117647058824
4 : 8.27659574468085
0 : 8.2
9 : 7.311111111111111
22 : 7.197183098591549
3 : 6.925925925925926


Average Points per Hour: 'Show HN' posts:
23 : 42.388888888888886
12 : 41.68852459016394
22 : 40.34782608695652
0 : 37.83870967741935
18 : 36.31147540983606
11 : 33.63636363636363
19 : 30.945454545454545
20 : 30.316666666666666
15 : 28.564102564102566
16 : 28.322580645161292
17 : 27.107526881720432
14 : 25.430232558139537
3 : 25.14814814814815
1 : 25.0
13 : 24.626262626262626
6 : 23.4375
7 : 19.0
10 : 18.916666666666668
9 : 18.433333333333334
21 : 

> "Ask HN" points receive the most average points of around 30 during 15:00. As expected, "Show HN" posts receive 29% more points on average than "Ask HN" posts" for their respective highest performing hours (42 points vs. 29 points). 12:00 and 22:00 are also had very high average points. As previously discussed, posts regarding interesting projects are more likely to receive more upvotes than question posts, good or bad.

## Conclusion
> Throughout this project, we discovered the different characteristics of "Ask HN" and "Show HN" 
posts. We determined that "Ask HN" posts tend to receive more comments while "Show HN" posts tend to receive more points. Both of these findings are coherent as the expected result of a pressing question is a rapid response (in the form of a comment in our case), which led to a higher average comment count for "Ask HN" posts. And users are also more likely to upvote posts regarding interesting topics (the equivalent to a "like" on Instagram or Facebook), which led to a higher average point count for "Show HN" posts.

In [10]:
%%html
<style>
.nbviewer div.output_area {
  overflow-y: auto;
  max-height: 400px;
}
</style>