# Exploring Hacker News posts

## Dataset

This dataset contains information about posts from the [HackerNews](https://www.ycombinator.com/) which is Reddit-style, technology-and-startup-oriented webpage where user can submit their stories (a.k.a. posts), receive comments and votes.

The [source dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) has been randomly down-sampled to about 20,000 rows after removing all submissions without any comments.

The dataset contains the following columns:
- **id** - the unique identifier from Hacker News for the post;
- **title** - the title of the post;
- **url** - the URL that the posts links to, if the post has a URL;
- **num_points** - the number of points the post acquired, calculated as the total - number of upvotes minus the total number of downvotes;
- **num_comments** - the number of comments on the post;
- **author** - the username of the person who submitted the post;
- **created_at** - the date and time of the post's submission.


## Research problem

This analysis focuses on two kinds of posts:
- **Ask HN** - created by users to ask the Hacker News community a specific question;
- **Show HN** - made to show the community a project, product, or sth to take a look at.

The questions we are to answer are:
1. Which posts on average receive more comments: **Ask HN** or **Show HN**?
2. Do **Ask HN** posts create at a certain time receive more comments on average?

Additional problems considered in this notebook:
1. The average number of comments per hour the **Show HN** posts receive.
2. Which posts on average receive more points: **Ask HN** or **Show HN**?
3. Determine if posts created at a certain time are more likely to receive points.

## Dataset loading

In [1]:
from csv import reader

file = open("hacker_news.csv")
hn = list(reader(file))

print("First five rows of the 'Hacker News' dataset:\n")

for i in range(0,5,1):
    print(hn[i])

UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 5227: character maps to <undefined>

## Data pre-processing

We will separate the header into a separate variable **hn**.

In [None]:
headers = hn[0]
hn = hn[1:]

print(f"Headers: \n {headers}\n")
print(f"First five lines of the dataset without headers:\n {hn[:5]}\n")

All **Ask HN** posts will be extracted into **ask_posts** variable and respectivelyu **Show HN** posts into **show_posts** variable. 
All other posts will be stored in the **other_posts**.

In [None]:
ask_posts = []
show_posts = []
other_posts = []

TITLE_IDX = 1
ASK_HN_PREFIX = "ask hn"
SHOW_HN_PREFIX = "show hn"

for row in hn:
    title = row[TITLE_IDX].lower()
    if title.startswith(ASK_HN_PREFIX):
        ask_posts.append(row)
    elif title.startswith(SHOW_HN_PREFIX):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(f"Number of ASK HN posts: {len(ask_posts)}")
print(f"Number of SHOW HN posts: {len(show_posts)}")
print(f"Number of posts outside of this category: {len(other_posts)}")

In [None]:
print(f"{show_posts[:5]}")

## Research question no. 1

Below is the code of the helper method for calculating the average number of comments per post in the supplied dataset.

In [None]:
def get_average_number_of_comments(dataset):
    total_comments = 0
    NUM_COMMENT_IDX = 4
    for row in dataset:
        total_comments += int(row[NUM_COMMENT_IDX])
    return total_comments/len(dataset)

The average number of comments per post for each of the specified datasets:

In [None]:
avg_ask_comments = get_average_number_of_comments(ask_posts)
avg_show_comments = get_average_number_of_comments(show_posts)

print(f"Average number of comments per ASK HN post: {avg_ask_comments}")
print(f"Average number of comments per SHOW HN post: {avg_show_comments}")

On average the **Ask HN** posts receive more comments than **Show HN** comments but these values seem to be comparable.

## Research question no. 2

Let us determine whether **Ask HN** posts created at certain hours of the day are more likely to attract comments. The function below calculates the average number of posts per hour for a specified posts subset. 

It is run it to calculate the average number of comments per hour for the **Ask HN** posts.

In [None]:
import datetime as dt

def compute_avg_comments_per_hour(dataset):
    avg_by_hour = []
    result_list = []
    counts_by_hour = {}
    comments_by_hour = {}

    DATE_COLUMN = 6
    COMMENTS_COLUMN = 4

    for post in dataset:
        result_list.append( (dt.datetime.strptime(post[DATE_COLUMN], 
                                                 "%m/%d/%Y %H:%M"),
                             int(post[COMMENTS_COLUMN])))

    for element in result_list:
        hour = element[0].hour
        comments = element[1]
        counts_by_hour[hour] = counts_by_hour.get(hour,0) + 1
        comments_by_hour[hour] = comments_by_hour.get(hour,0) + comments
        
    for hour in comments_by_hour:
        avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour] ])
        
    return(avg_by_hour)

avg_by_hour = compute_avg_comments_per_hour(ask_posts)
print(f"Average comments per hour for the Ask HN posts:\n")
print(avg_by_hour)

The output data requires re-formatting and sorting according to the average number of comments per post for the sake of readability. Code executing that is wrapped into function.

In [None]:
def print_avg_comments_per_hour_data(avg_by_hour):
    sorted_swap = sorted([ [elem[1], elem[0]] for elem in avg_by_hour], reverse = True)

    for elem in sorted_swap:
        hour = dt.datetime.strftime(dt.datetime(2025, 12, 12, hour=elem[1]), "%H:%M")
        print(f"{hour} -> {elem[0]:.2f}")
    
print(f"Average comments per hour for the Ask HN posts presented in readable format:\n")
print_avg_comments_per_hour_data(avg_by_hour)

The **Ask HN** posts which attract most comments are created at 15:00 Eastern Time in US which translates to 8:00 Central European Time. Assuming this tendency continues, to attract most comments the **Ask HN** posts should be created at that hour.

## Additional problem 1: the average number of comments per hour the **Show HN** posts receive

To answer this question we re-use functions written for previous research questions.

In [None]:
avg_show_hn_comments_by_hour = compute_avg_comments_per_hour(show_posts)

print(f"Average comments per hour for the Show HN posts:\n")
print_avg_comments_per_hour_data(avg_show_hn_comments_by_hour)

As we can see the number of comments per hour varies from 15.77 to 3.05 with the difference between max and min values equal:

In [None]:
difference = 15.77-3.05
print(f"The difference between max comments per hour and min comments per hour is equal to {difference}")

## Additional problem 2: whether  **Ask HN** or **Show HN** receive on average more points

The following code calculates that:

In [None]:
ask_hn_points = [int(row[3]) for row in ask_posts]
avg_points_per_ask_hn_post = sum(ask_hn_points)/len(ask_hn_points)

print(f"Average number of points per Ask HN post is {avg_points_per_ask_hn_post}")


In [None]:
show_hn_points = [int(row[3]) for row in show_posts]
avg_points_per_show_hn_post = sum(show_hn_points)/len(show_hn_points)

print(f"Average number of points per Show HN post is {avg_points_per_show_hn_post}")


It occurs that **Show HN** posts receive almost twice as much points as the **Ask HN** posts. It supports the statement that **Show HN** posts engage the HackerNews
users more.

We can also investigate the average number of points per **other** post.

In [None]:
other_points = [int(row[3]) for row in other_posts]
avg_points_per_other_post = sum(other_points)/len(other_points)

print(f"Average number of points per Other post is {avg_points_per_other_post}")

It occurs that it is significantly bigger than the values for **Ask HN** and **Show HN** posts.

## Additional problem 3: determine if posts created at a certain time are more likely to receive more points.

The following function returns - for the specified dataset - the map containing as keys hours of the day and as values the average number of points for posts created at the specified hour. The result is sorted descending by the average number of points.

In [None]:
def compute_average_points_per_hour(dataset):
    date_points = [ ( dt.datetime.strptime(row[6], "%m/%d/%Y %H:%M"), int(row[3])) for row in dataset]
    hour_points = [(row[0].hour, row[1]) for row in date_points]
    
    hour_to_sum_points = {}
    hour_to_num_posts = {}
    
    for row in hour_points:
        hour = row[0]
        points = row[1]
        
        hour_to_sum_points[hour] = hour_to_sum_points.get(hour,0) + points
        hour_to_num_posts[hour] = hour_to_num_posts.get(hour,0) + 1
        
    for key in hour_to_sum_points:
        hour_to_sum_points[key] = hour_to_sum_points[key] / hour_to_num_posts[key]
        
    return dict(sorted(hour_to_sum_points.items(), key = lambda item: item[1], reverse = True))
        
    
    

In [None]:
print("Average number of points per ASK HN post")
print(" --- ")
compute_average_points_per_hour(ask_posts)

In [None]:
print("Average number of points per Show HN posts")
print(" --- ")
compute_average_points_per_hour(show_posts)

In [None]:
print("Average number of points per other post")
print(" --- ")
compute_average_points_per_hour(other_posts)

It occurs that the hour the post was created is correlated with the number of points it receives:
1. for the **Ask HN** posts - the best hours are 15 and 13.
2. for the **Show HN** posts - the best hours are 23 and 12.
3. for the **Other** posts - these are 13 and 14.

It is worth noting that for the **Other** posts the tendency is not so strong the biggest average number of points per hour is 62.52 and the smalles 45.24, while for the **Ask HN** and **Show HN** the smallest average number of points is not bigger than 7.