# Exploring Hacker News Posts

We will be exploring the Hacker News Posts dataset to find:
1) Whether titles starting with 'Ask HN' or 'Show HN' receive more comments on average?
2) Whether posts created at a certain time receive more comments on average?

*Ask HN: Users submit 'Ask HN' posts to ask the Hacker News Community a specific question*

*Show HN: Users submit 'Show HN' posts to show a project, product, or something interesting*

You can find the data set [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

Below are descriptions of the columns:
* <mark>id</mark>: The unique identifier from Hacker News for the post
* <mark>title</mark>: The title of the post
* <mark>url</mark>: The URL that the posts links to, if the post has a URL
* <mark>num_points</mark>: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* <mark>num_comments</mark>: The number of comments that were made on the post
* <mark>author</mark> : The username of the person who submitted the post
* <mark>created_at</mark> : The date and time at which the post was submitted

In [1]:
from csv import reader

#Read the dataset
dataset = open('hacker_news.csv')
read_dataset = reader(dataset)
hn = list(read_dataset)

# Display first five rows of dataset
for h in hn[:6]:
    print(h)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


In [2]:
# Header 
header = hn[0]
print(header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
# Removing header
hn = hn[1:]
for h in hn[:5]:
    print(h)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


In [4]:
# Extracting 'Ask HN' and 'Show HN' posts
ask_posts = []
show_posts = []
other_posts = []

for h in hn:
    # lower case
    title = h[1].lower()
    # find titles that start with 'ask hn'
    if title.startswith('ask hn'):
        ask_posts.append(h)
    # find titles that start with 'show hn'
    elif title.startswith('show hn'):
        show_posts.append(h)
    # other posts that don't fit the above criteria
    else:
        other_posts.append(h)

print('Number of posts with Ask HN:', len(ask_posts))
print('Number of posts with Show HN:', len(show_posts))
print('Number of other posts:', len(other_posts))

Number of posts with Ask HN: 1744
Number of posts with Show HN: 1162
Number of other posts: 17194


In [5]:
# Example of ask and show HN posts
print('Ask HN')
for ask in ask_posts[:5]:
    print(ask)
print()

print('Show HN')
for show in show_posts[:5]:
    print(show)

Ask HN
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']

Show HN
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', 

In [6]:
# Find the total number of comments for 'Ask' posts
total_ask_comments = 0

# Adding number of comments from 'Ask' posts
for ask in ask_posts:
    n_comments = int(ask[4])
    total_ask_comments += n_comments
    
# Finding the average of 'Ask' posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of comments for Ask HN posts',avg_ask_comments)

# Find the total number of comments for 'Show' Posts
total_show_comments = 0

# Adding number of comments for 'Show' posts
for show in show_posts:
    n_comments = int(show[4])
    total_show_comments += n_comments

# Finding the average of 'Show' posts
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of comments for Show HN posts',avg_show_comments)

Average number of comments for Ask HN posts 14.038417431192661
Average number of comments for Show HN posts 10.31669535283993


From our analysis, we see that 'Ask HN' posts receive more comments on average

We will further analyse the 'Ask HN' posts to see if posts created at a certain time would have more comments

We will use the following steps to perform this analysis:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
2. Calculate the average number of comments ask posts receive by hour created

In [7]:
import datetime as dt

result_lists = []

# Extracting the time and number of comments data
for ask in ask_posts:
    created_at = ask[6]
    n_comments = int(ask[4])
    result_lists.append([created_at, n_comments])

counts_by_hour = {}
comments_by_hour = {}

# Finding the total number of comments by the hour
for result in result_lists:
    date = result[0]
    n_comments = int(result[1])
    date_format = "%m/%d/%Y %H:%M"
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments


In [8]:
# Frequency table data
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [9]:
# Frequency table data
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

# Calculating the Average Number of Comments for Ask HN Posts by Hour

In [10]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

In [11]:
#Top 5 hours for Ask HN posts
top_5_time = sorted(avg_by_hour, key=lambda x:x[1], reverse=True)[:5]

print("Top 5 hours for Ask HN posts")
for top_5 in top_5_time:
    print(f'{top_5[0]}:00, {top_5[1]} average comments per posts')

Top 5 hours for Ask HN posts
15:00, 38.5948275862069 average comments per posts
02:00, 23.810344827586206 average comments per posts
20:00, 21.525 average comments per posts
16:00, 16.796296296296298 average comments per posts
21:00, 16.009174311926607 average comments per posts


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post

There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments

According to the data set [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.

# Conclusion

We analyzed ask posts and show posts to determine which type of post and time receive the most comments on average

Based on our analysis, to maximise the amount of comments a post receives, we recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est)

However, it should be noted that the data set we analysed excluded posts without any comments

Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average