# Uncovering how types and timings of HN posts affect engagement

This project seeks to reveal two things about _Hacker News_ posts:

1. If 'Ask HN' or 'Show HN' posts receive more comments
2. If posts created at a certain time receive more engagement on average than those created at other times

With this information, content creators and social media managers – for instance – will have valuable insight into what content receives the most traction, and at what times it performs best.

The dataset 'Hacker News.csv' – a shortened 20.000 entry-file, rather than the massive 300.000+ entry file – contains the following columns that'll help us find the answers hereto:

- `id` – a unique ID attached to each individual post
- `title` – the title of each post
- `url` – the url to which each post links (where applicable)
- `num_points` – the total number of points per post, after subtracting downvotes from upvotes
- `num_comments` - the number of comments per post
- `author` – the username of the person that created the post
- `created_at` – the date and time of the post's creation

## Preparing the data
First we'll import the relevant modules, verify the data is displaying well, and removing the header row, as this will only get in the way during later analyses.

In [3]:
# Importing modules
from csv import reader
from pprint import pprint # for easier list reading
import datetime as dt

# Opening the dataset as a list of lists
opened_file = open("/Users/lux/Library/CloudStorage/GoogleDrive-lucasknowak@gmail.com/My Drive/School/DataQuest/Guided Projects/Exploring Hacker News Posts/Hacker News.csv")
read_file = reader(opened_file)
hn = list(read_file)

# Showing the first five rows – commented to save space
# pprint(hn[:5])

In [4]:
# Removing headers so we're left with the bare data
headers = hn[0]
hn = hn[1:]

print('Header:\n', headers)
print('First five rows:\n')
pprint(hn[:5])

Header:
 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
First five rows:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/mov

## Filtering out irrelevant posts
Because we're only interested in two specific types of posts, '_Ask HN_' and '_Show HN_', we're going to filter out any posts _not_ starting with either of those two short phrases. The posts that remain, we'll place in new lists: one for ask posts, one for show posts, and one for other types of posts.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of ask posts:', len(ask_posts))
print('Number of short posts:',len(show_posts))
print('Number of other posts:',len(other_posts))

Number of ask posts: 1744
Number of short posts: 1162
Number of other posts: 17194


To confirm that the lists contain the right types of questions, we'll also (p)print the first five items of each list of question types. ⤵️ ( Or, rather, the code to do so will be left below in commented form to save space! )

In [8]:
# pprint(ask_posts[:5])
# pprint(show_posts[:5])
# pprint(other_posts[:5])

## Ask HN posts vs show HN posts: Which type gets more comments?
To find out whether ask posts or show posts get more comments, we'll look at the total number of comments in each category, and then we'll calculate the average number of comment per post within the corresponding category ('ask HN' or 'show HN').

In [10]:
# Calculating total ask comments and average comment per ask post
total_ask_comments = 0

for row in ask_posts:
    num_comment = row[4]
    num_comment = int(num_comment)
    total_ask_comments += num_comment

print('Total ask comments:', total_ask_comments)

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of comments per ask post:', round(avg_ask_comments, 2))
# Rounding for the sake of readability and practicality

Total ask comments: 24483
Average number of comments per ask post: 14.04


In [11]:
# Calculating total show comments and average comment per show post
total_show_comments = 0

for row in show_posts:
    num_comment = row[4]
    num_comment = int(num_comment)
    total_show_comments += num_comment

print('Total show comments:', total_show_comments)

avg_show_comments = total_show_comments / len(show_posts)
print('Average number of comments per show post:', round(avg_show_comments, 2))
# Rounding for the sake of readability and practicality

Total show comments: 11988
Average number of comments per show post: 10.32


We have a winner: **ask posts**, with an average of 14 per ask post, versus about 10 per show HN post. There could be several reasons for this, but what first comes to mind is that a question _invites_ people to answer more so than a statement or a showcase does. It invites sharing one's opinion and/or expertise quite directly.

Either way, this gives our hypothetical content creator or social media manager an idea of what type of post to prioritise when considering their strategy on _Hacker News_.

## Calculating the number of ask posts and comments by 'Hour Created'
Since we've identified ask posts in specific as our area of interest, we'll focus the rest of our analysis on these posts.

The next step is therefore determining if ask posts created at a certain _time_ are more likely to receive comments. To that end, we'll do the following:

- Determine the number of ask posts created in each hour of the day, as well as the number of comments received in that time
- Calculate the average number of comments on ask posts _by hour created_

We've already imported the daytime module (as `dt`), so we can work with the data in the `created_at` column of our dataset.

In [14]:
# Filtering out at what time each individual post in 'ask_posts' was created
result_list = [] # Empty list the relevant data will go into

for row in ask_posts:
    created_at = row[6]
    num_comment = row[4]
    result_list.append([created_at, int(num_comment)])

# pprint(result_list[:5]) – Commented print; used to verify if the list came out as intended
    
# Calculating the average number of comments received by hour created
# Starting with empty dictionaries all relevant data will go into
counts_by_hour = {}
comments_by_hour = {}

for row in result_list: 
    date_and_hour_string = row[0]
    date_and_hour_dt = dt.datetime.strptime(date_and_hour_string, "%m/%d/%Y %H:%M") # Turning the string into a datetime object
    hour = dt.datetime.strftime(date_and_hour_dt, "%H") # extracting the hour
    comments_num = row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments_num
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments_num

print('Number of ask posts by hour created:', counts_by_hour, '\n') 
print('Number of comments by hour created:', comments_by_hour, '\n')

Number of ask posts by hour created: {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} 

Number of comments by hour created: {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641} 



With the `counts_by_hour` and `comments_by_hour` dictionaries in place, we can work on a final list that will display the average number of comments per hour. This will reveal what the most opportune times are to post an _ask HN_ post on Hacker News if your goal is to maximise engagement (measured through the number of comments your post receives).

So far, 15h (3 pm) seems the most popular time, but let's figure out averages by taking into account the number of posts that were made by hour, too.

We start off with an empty list once more, which we'll populate with our findings.

In [16]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

pprint(avg_by_hour)

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]


This is the information we're after – but it's hard to read, given that the list is out of order. We have two options here: we can sort by time, or by average comment by hour. Given the intent behind this project and the audience this information is meant for – creative strategists who want to know _what the best time to post (ask HN posts) on Hacker News is_ –, we'll sort by average number of comments, and limit the output to the first five results.

That will bring the answer to the central question to the top of the list.

First, we make the number of comments and the hour swap places in the list, so that the number of comments is shown first.

In [18]:
swap_avg_by_hour = [] # Empty list our data will go into

for row in avg_by_hour:
    hour = row[0]
    comment = row[1]
    swap_avg_by_hour.append([row[1], row[0]])

pprint(swap_avg_by_hour)

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]


We're getting there, but now it's time to actually sort the list and print the result (down to two decimals – again, for readability and practicality)!

In [20]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 hours for ask posts comments:')

for row in sorted_swap[:5]:
    avg_comment = row[0]
    hour = row[1]
    hour_dt = dt.datetime.strptime(hour, '%H')
    hour_formatted = dt.datetime.strftime(hour_dt, '%H')
    print('{hour}: {comments:.2f} comments per post on average'.format(hour=hour_formatted, comments=avg_comment))

Top 5 hours for ask posts comments:
15: 38.59 comments per post on average
02: 23.81 comments per post on average
20: 21.52 comments per post on average
16: 16.80 comments per post on average
21: 16.01 comments per post on average


## Conclusion
In this notebook, we performed an analysis on a dataset containing posts from the Hacker News website. We began by cleaning and exploring the data, identifying the types of posts and their respective engagement levels. This revealed that the 'Ask HN' posts performed best in terms of attracting comments.

We then focused on analyzing the "Ask HN" posts to determine if posts created at certain times attracted more comments in comparison to other times.

We extracted the relevant data, parsed the dates, and calculated the average number of comments for posts created during each hour of the day. This allowed us to identify the best time to create a post to maximize engagement.

Posts created **later in the day** – starting at 15h (3 PM) all the way to 02h (2 AM) – seem to perform best as far as garnering comments go.