# Exploring Hacker News Posts
Analysing trends in what influences post popularity on [Hacker News](https://news.ycombinator.com/). The two types of posts explored are `Ask HN` or `Show HN`

Users submit `Ask HN` posts to ask the Hacker News community a specific question. Users submit `Show HN` posts to show the Hacker News community a project, product or something interesting.

These two types of posts will be compared to determine the following: 
* Do `Ask HN` or `Show HN` receive more posts on average?
* Do posts created at a certain time receive more comments on average?

The data set was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

In [3]:
hn_file = open('hacker_news.csv')
from csv import reader
read_hn_file = reader(hn_file)
hn = list(read_hn_file)
hn[:3]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30']]

### Remove headers

In [5]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HackerNews and Show HackerNews Posts
The data is separated into separate lists for posts starting with "Ask HN" and posts starting with "Show HN"

In [7]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else: 
        other_posts.append(post)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


### Calculate the average number of Ask Hacker News and Show Hacker News posts

In [9]:
def calculate_avg_comment_number(posts_list):
    total_comments = 0
    for post in posts_list:
        comment_number = int(post[4])
        total_comments += comment_number
    return total_comments / len(posts_list)

avg_ask_comments = calculate_avg_comment_number(ask_posts)
avg_show_comments = calculate_avg_comment_number(show_posts)
avg_other_comments = calculate_avg_comment_number(other_posts)
print(avg_ask_comments)
print(avg_show_comments)
print(avg_other_comments)

14.038417431192661
10.31669535283993
26.8730371059672


On average, "Ask HN" receive approximately 14 comments while "Show HN" posts receive about 10 comments. Since "Ask HN" posts tend to receive more comments, the remaining analysis focuses on these posts.

### Determine if Ask HN posts created at certain hours of the day receive more comments than Ask HN posts created at other times of day 

In [12]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received.
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )

# posts_by_hour contains the number of ask posts created during each hour of the day
posts_by_hour = {}
# comments_by_hour contains the corresponding number of comments ask posts created at each hour received
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for result in result_list:
    date = result[0]
    comment = result[1]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    if hour in posts_by_hour:
        comments_by_hour[hour] += comment
        posts_by_hour[hour] += 1
    else:
        comments_by_hour[hour] = comment
        posts_by_hour[hour] = 1

comments_by_hour


{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [13]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []
for hour in posts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / posts_by_hour[hour]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [14]:
sorted_avg_by_hour = sorted(avg_by_hour, key=lambda x: x[1], reverse = True)
sorted_avg_by_hour

[['15', 38.5948275862069],
 ['02', 23.810344827586206],
 ['20', 21.525],
 ['16', 16.796296296296298],
 ['21', 16.009174311926607],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['18', 13.20183486238532],
 ['17', 11.46],
 ['01', 11.383333333333333],
 ['11', 11.051724137931034],
 ['19', 10.8],
 ['08', 10.25],
 ['05', 10.08695652173913],
 ['12', 9.41095890410959],
 ['06', 9.022727272727273],
 ['00', 8.127272727272727],
 ['23', 7.985294117647059],
 ['07', 7.852941176470588],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['22', 6.746478873239437],
 ['09', 5.5777777777777775]]

In [15]:
for hour, avg in sorted_avg_by_hour[:5]:
    time = dt.datetime.strptime(hour, '%H').strftime('%H:%M')
    formatted_avg = format(avg, '.2f')
    print(f"{time}: {formatted_avg} average comments per post")

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US.