# Hacker News: Behind the Numbers of Tech's Most Vibrant Community #

Objective: Compare two most popular post types in Hacker News, which are "Ask HN" and "Show HN". Analyze, which one has higher amount of comments per post. Then analyze, which time of the day gathers the most amount of comments in better performing posts. 

Dataset used: https://www.kaggle.com/datasets/hacker-news/hacker-news-posts 

Steps:
1. Read the file and transform into list of lists
2. Remove the header row
3. Append the posts to new lists
4. Calculate the averages
5. Use datetime to create datetime object
6. Create two new dictionaries with: hour:total_posts and hour:total_comments
7. Calculate the average of each hour
8. Swap the columns and sort the answers

In [36]:
# Read in the data
from csv import reader

openned_file = open("hacker_news.csv")
read_file = reader(openned_file)

# Transform read_file into a list of lists
hn = list(read_file)
openned_file.close()

In [37]:
# Separate the header and delete it from the list

header = hn[0]
hn1 = hn [1:]

In [38]:
#Create three new lists for storing different posts

ask_posts = []
show_posts = []
other_posts = []

# Iterate through the list and append to new lists by using lower() and startswith() methods
for row in hn1:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("The amount of posts in ask hn is: ", len(ask_posts))
print("The amount of posts in show hn is: ", len(show_posts))
print("The amount of other posts is: ", len(other_posts))

The amount of posts in ask hn is:  1744
The amount of posts in show hn is:  1162
The amount of other posts is:  17194


In [39]:
# Calculate the amounts of comments with for loop and addition

ask_comments = 0
show_comments = 0

for row in ask_posts:
    comments = int(row[4])
    ask_comments += comments
    
for row in show_posts:
    comments = int(row[4])
    show_comments += comments
    

print("Total of comments in ask hn:",ask_comments)
print("Total of comments in ask hn:",show_comments)


# Calculate the average by dividing the amount of comments with amount of posts. Round the answer
avg_ask_comments = round(ask_comments / len(ask_posts))
print("The average amount of comments in ask hn is: ",avg_ask_comments)
avg_show_comments = round(show_comments / len(ask_posts))
print("The average amount of comments in show hn is: ", avg_show_comments)

Total of comments in ask hn: 24483
Total of comments in ask hn: 11988
The average amount of comments in ask hn is:  14
The average amount of comments in show hn is:  7


In [40]:
# Transform the strings into datetime objects and then extract the data we need

import datetime as dt

result_list = []

# Append a new list with the data we need(time and num_comments) from the posts that start with "ask hn"
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])  

# Create two new dictionaries to save the hour:posts and hour:comments
posts_per_hour = {}
comments_per_hour = {}
    
# Loop through the list to 
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date = row[0]
    comments = row[1]
    date_dt = dt.datetime.strptime(date, date_format) # Turning string into a datetime object
    date_str = date_dt.strftime("%H") # Extracting the desired outcome, which is the hour of the post
    if date_str not in posts_per_hour:
        posts_per_hour[date_str] = 1
        comments_per_hour[date_str] = comments
    else:
        posts_per_hour[date_str] += 1
        comments_per_hour[date_str] += comments

print(posts_per_hour)
print(comments_per_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [41]:
# Calculate the average by appending a new list with time and average comments per hour
avg_by_hour = []

for row in comments_per_hour:
    avg_by_hour.append([row, round(comments_per_hour[row] / posts_per_hour[row])])
    
avg_by_hour

[['09', 6],
 ['13', 15],
 ['10', 13],
 ['14', 13],
 ['16', 17],
 ['23', 8],
 ['12', 9],
 ['17', 11],
 ['15', 39],
 ['21', 16],
 ['20', 22],
 ['02', 24],
 ['18', 13],
 ['03', 8],
 ['05', 10],
 ['19', 11],
 ['01', 11],
 ['22', 7],
 ['08', 10],
 ['04', 7],
 ['00', 8],
 ['06', 9],
 ['07', 8],
 ['11', 11]]

In [42]:
# Swap the columns so that the order is: average by hour, hour
swap_avg_per_hour = []

for row in avg_by_hour:
    swap_avg_per_hour.append([row[1], row[0]])

# Use sorted() function to arrange the hours from highest to lowest
sorted_swap = sorted(swap_avg_per_hour, reverse = True)

sorted_swap

[[39, '15'],
 [24, '02'],
 [22, '20'],
 [17, '16'],
 [16, '21'],
 [15, '13'],
 [13, '18'],
 [13, '14'],
 [13, '10'],
 [11, '19'],
 [11, '17'],
 [11, '11'],
 [11, '01'],
 [10, '08'],
 [10, '05'],
 [9, '12'],
 [9, '06'],
 [8, '23'],
 [8, '07'],
 [8, '03'],
 [8, '00'],
 [7, '22'],
 [7, '04'],
 [6, '09']]

In [43]:
print("Top 5 hours to post in Hacker News")

# Print the desired outcome using datetime
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    hr = hour.strftime("%H:%M")
    print("At {} the average comments per post was {}.".format(hr, row[0]))
    

Top 5 hours to post in Hacker News
At 15:00 the average comments per post was 39.
At 02:00 the average comments per post was 24.
At 20:00 the average comments per post was 22.
At 16:00 the average comments per post was 17.
At 21:00 the average comments per post was 16.


To conclude the project, we made an analysis to the 20000 rows found from the Hacker News csv file. We focused on first comparing "Ask HN" and "Show HN" between each other. "Ask HN" questions got double amount the answers compared to "Show HN". 

The best time to post seems to be between 1500-1600 (EST). It would be generally a good idea to post between 1500-1700 or 2000-2200. 

Learnings:
- Working with strings
- Object-oriented programming
- Working with dates and times

