# A brief look at Hacker News engagement
## Hacker News can be found here: https://news.ycombinator.com/

This will be a brief look at how engagement compares among Ask HN, Show HN, and other posts. This brief will also examaine how, or if, time of day impacts engagement.

Hacker News enables users to post and share content to the platform. Users can upvote and downvote submissions, resulting in a point count. It is similar to Reddit. Engagement in this context means users (a) commented on posts and (b) the points a post received. 

Ask HN posts are a format in which users submit questions for the community to answer.

Answers sought:
* Do Ask HN or Show HN posts recieve more comments on average?
* Do posts created at a certain time recieve more comments on average?

This analysis will be done in two parts:
* Exclusion of zero-comment posts, ie posts with low to no engagement.
* Inclusion of zero-comment posts, ie posts with low to no engagement.

In [123]:
# Defining functions for exploration

# open a csv file
## a function to open datasets
## returns the data set and header row or just returns the data set as a list of lists
## recommended manner to call function:
## variable_data, header_data = open_dataset('filename')
def open_dataset(file_name, has_head=True):
    opened_file = open(file_name)
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    if has_head:
        return data[1:], data[0]
    else:
        return data

# explores a dataset
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row, '\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
# checks for shifted columns in a dataset
# see https://community.dataquest.io/t/guided-project-finding-insights-about-popular-hacker-news-posts/557800
def check_shifted(header, dataset):
    header_len = len(header)
    error_count = 0
    for row in dataset:
        if len(row) != header_len:
            error_count += 1
            print(header, '\n')
            print('Row Index: ', dataset.index(row), '\n')
            print(row, '/n')
    print('Column Shift Errors: ', error_count)

# checks for null or missing data
# see https://community.dataquest.io/t/guided-project-finding-insights-about-popular-hacker-news-posts/557800
def check_null_data(dataset_header, dataset, index):
    null_value = False
    null_count = 0
    # Loop over each row in the dataset to identify any missing values at the given index
    
    for row in dataset:
        if row[index] == '':
            null_value = True
            null_count += 1
        if null_value == True:
            print(dataset_header, '\n')
            print('Row Index: ', dataset.index(row), '\n') # Print the row number where the error was found
            print(row, '\n')
            null_value = False
    # Print the number of missing values identified at the given index
    print('Missing "{}" Values Identified: {}'.format(hn_header[index], null_count)) #uses object defined outside of function


#### below is an excerpt from the dataset

For a brief overview of what we have to work with, I have printed the header information and five rows of information from the dataset.

In [124]:
#using the open_dataset function
hn, hn_header = open_dataset(file_name='HN_posts_year_to_Sep_26_2016.csv')

print(hn_header, '\n')
explore_data(hn, 0, 5, True)
print('\n')
check_shifted(hn_header, hn)
print('\n')
# ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
check_null_data(hn_header, hn, 3)
check_null_data(hn_header, hn, 4)
check_null_data(hn_header, hn, 6)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] 

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] 

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] 

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] 

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'] 

Number of 

In [125]:
# pt 2
# removing header
# completed during import
# pt 3
# filtering the data
# four lists to separate ask, show, others, and all posts
# 'other' posts consist of all non-ask and show posts.

# collects posts with comments
def commented_posts(dataset, index):
    output = []
    for row in dataset:
        num_comments = int(row[index])
        if num_comments != 0:
            output.append(row)
    return output

#calc avg
def average_calc(dataset, index):
    total = 0
    for item in dataset:
        total += int(item[index])    
    return total / len(dataset)

# collect user submissions by title
def collect_titles(dataset, index):
    ask_posts = []
    show_posts = []
    other_posts = []
    for row in dataset:
        title = row[index]
        title = title.lower()
        if title.startswith('ask hn'):
            ask_posts.append(row)
        elif title.startswith('show hn'):
            show_posts.append(row)
        else:
            other_posts.append(row)
    return ask_posts, show_posts, other_posts
    
collect_titles(hn, 1) # collects posts into three list obj that 
#includes all posts, including those without engagement

## excludes posts without comments/engagement
ask_comments = commented_posts(ask_posts, 4)
show_comments = commented_posts(show_posts, 4)
other_comments = commented_posts(other_posts, 4)
hn_comments = commented_posts(hn, 4) # all posts with engagement
# avg comments check
avg_ask_comments = average_calc(ask_comments, 4)
avg_show_comments = average_calc(show_comments, 4)
avg_other_comments = average_calc(other_comments, 4)
hn_avg_comments = average_calc(hn_comments, 4) # avg comments of all posts with engagement

## includes posts without engagement
# average number of comments check
avg_ask_posts = average_calc(ask_posts, 4)
avg_show_posts = average_calc(show_posts, 4)
avg_other_posts = average_calc(other_posts, 4)
avg_all = average_calc(hn, 4) # avg comments of all posts


In [126]:
## post stats excluding no-engagement posts
string_format2 = "In the scraped data, excluding posts without direct engagement, there are {ask:,} \"Ask HN\" posts, {show:,} \"Show HN\" posts, {otter:,} \"other posts\", and a total of {totes:,} posts."
string_output2 = string_format2.format(ask=len(ask_comments), show=len(show_comments), otter=len(other_comments), totes=len(hn_comments))
print(string_output2, '\n')

print('\n')
print("An overview of posts, excluding posts without direct engagement:")
print("Ask HN posts: {aposts:,}".format(aposts=len(ask_comments)))
print("Ask HN average comments: {avg:.2f}".format(avg=avg_ask_comments))
print("Ask HN average points: {:.2f}".format(average_calc(ask_comments, 3)))
for row in ask_comments[:5]:
    print(row)
print('\n')
print("Show HN posts: {aposts:,}".format(aposts=len(show_comments)))
print("Show HN average comments: {avg:.2f}".format(avg=avg_show_comments))
print("Show HN average points: {:.2f}".format(average_calc(show_comments, 3)))
for row in show_comments[:5]:
    print(row)
print('\n')
print("Other posts: {aposts:,}".format(aposts=len(other_comments)))
print("Other average comments: {avg:.2f}".format(avg=avg_other_comments))
print("Other posts average points: {:.2f}".format(average_calc(other_comments, 3)))
for row in other_comments[:5]:
    print(row)
print('\n')
print("All posts: {aposts:,}".format(aposts=len(hn_comments)))
print("All posts average comments: {avg:.2f}".format(avg=hn_avg_comments))
print("All posts average points: {:.2f}".format(average_calc(hn_comments, 3)))
for row in hn_comments[:5]:
    print(row)
    
## post stats including no engagement
string_format = "In the scraped data, including posts with no direct engagement, there are {ask:,} \"Ask HN\" posts, {show:,} \"Show HN\" posts, {otter:,} \"other posts\", and a total of {totes:,} posts."
string_output = string_format.format(ask=len(ask_posts), show=len(show_posts), otter=len(other_posts), totes=len(hn))
print(string_output, '\n')
print("An overview of posts, including posts without direct engagement:")
print("Ask HN posts: {aposts:,}".format(aposts=len(ask_posts)))
print("Ask HN average comments: {avg:.2f}".format(avg=avg_ask_posts))
print("Ask HN average points: {:.2f}".format(average_calc(ask_posts, 3)))
for row in ask_posts[:5]:
    print(row)
print('\n')
print("Show HN posts: {aposts:,}".format(aposts=len(show_posts)))
print("Show HN average comments: {avg:.2f}".format(avg=avg_show_posts))
print("Show HN average points: {:.2f}".format(average_calc(show_posts, 3)))
for row in show_posts[:5]:
    print(row)
print('\n')
print("Other posts: {aposts:,}".format(aposts=len(other_posts)))
print("Other average comments: {avg:.2f}".format(avg=avg_other_posts))
print("Other posts average points: {:.2f}".format(average_calc(other_posts, 3)))
for row in show_posts[:5]:
    print(row)
print('\n')
print("All posts: {aposts:,}".format(aposts=len(hn)))
print("All posts average comments: {avg:.2f}".format(avg=avg_all))
print("All posts average points: {:.2f}".format(average_calc(hn, 3)))
for row in hn[:5]:
    print(row)
print('\n')


In the scraped data, excluding posts without direct engagement, there are 6,911 "Ask HN" posts, 5,059 "Show HN" posts, 68,431 "other posts", and a total of 80,401 posts. 



An overview of posts, excluding posts without direct engagement:
Ask HN posts: 6,911
Ask HN average comments: 13.74
Ask HN average points: 14.40
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']
['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']
['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']


Show HN posts: 5,059
Show HN average comments: 9.81
Show HN average points: 26.62
['12577142', 'Show HN: Jum

### a look at engagement via comments

"Ask HN" posts have consistent average engagement across posts on HackerNews. This includes consistent, average engagement when including and excluding no-engagement posts in the calculation.

Engagement in this context means another user has seen the post and commented on it. One can presume eyeballs have seen many of the Show HN posts, but not commented on it. One can not presume the same for the "others" category.

|---|Excludes zero-comment posts |---|---|Includes zero-comment posts |---|---
| --- | --- | --- | --- | --- | --- | --- |
|Post Type |Post Count |Average Comments |Average Points |Post Count |Average Comments |Average Points
|Ask HN |6,911 |13.74 |14.40 |9,139 |10.39 |11.31
|Show HN |5,059 |9.81 |26.62 |10,158 |4.89 |14.84
|Others |68,431 |25.84 |53.43 |273,822 |6.46 |15.16
|Total |80,401 |23.79 |48.39 |293,119 |6.53 |15.03

From the above chart, one can see there is potential for high-engagement in non-"Ask HN" and non-"Show HN" posts. However, there is a stark contrast in average engagement between including and excluding zero-comment posts.

Ask HN posts show the most consistent, direct engagement across the two categories. These posts generally represent a "safe bet" for creating a post that receives engagement from other users. Of all the Ask HN posts, 75 percent of those received direct user engagement.

Show HN posts are nearer a coin-flip of whether a user will directly engage with the post. About 50 percent of Show HN posts received direct engagement.

Posts in the "other" category have the potential for the highest engagement, but represent a significant gamble. Only about 30 percent of those posts received engagement.

In [127]:
## pt 5 and pt 6 and pt 7 sorta
# generate freq tables for posts per hour and comments per hour
# use freq tables to calc avg comments per hour
# useable for points by hour, also
def hourly_comments(dataset, index_comments):
    results = []
    #loop over dataset, append to list of lists
    for t in dataset:
        results.append([t[6], int(t[index_comments])])
    counts_by_hour = {}
    comments_by_hour = {}
    # loop over new list of lists, populate dictionary, parse string for datetime
    for x in results:
        x[0] = dt.datetime.strptime(x[0], "%m/%d/%Y %H:%M")
        hour = x[0].hour
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += x[1]
        else:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = x[1]
    avg_by_hour = []
    # populate list of lists while calcing an average and appending it to aforementioned list
    for i in counts_by_hour:
        avg_by_hour.append([comments_by_hour[i]/counts_by_hour[i],i])
    #sorts list of lists by hour is descending order
    sorted_avg = sorted(avg_by_hour, reverse=True)
    return sorted_avg

## exclusion of zero-comment submissions
## calc the average comments during a given hour in the day
# ask posts
ex_hourly_ask = hourly_comments(ask_comments, 4)
# show posts
ex_hourly_show = hourly_comments(show_comments, 4)
# other posts
ex_hourly_other = hourly_comments(other_comments, 4)
# all posts
ex_hourly_hn_comments = hourly_comments(hn_comments, 4)
## inclusion of zero-comment submissions
## calc the average comments during a given hour in the day
# ask posts
in_hourly_ask = hourly_comments(ask_posts, 4)
# show posts
in_hourly_show = hourly_comments(show_posts, 4)
# other posts
in_hourly_other = hourly_comments(other_posts, 4)
# all posts
in_hourly_hn = hourly_comments(hn, 4)

##using hourly comments function to calc avg pts per hour
## exclusion of zero-comment submissions
# ask posts
ex_points_ask = hourly_comments(ask_comments, 3)
# show posts
ex_points_show = hourly_comments(show_comments, 3)
# other posts
ex_points_other = hourly_comments(other_comments, 3)
# all posts
ex_points_hn_comments = hourly_comments(hn_comments, 3)
## inclusion of zero-comment submissions
## calc the average comments during a given hour in the day
# ask posts
in_points_ask = hourly_comments(ask_posts, 3)
# show posts
in_points_show = hourly_comments(show_posts, 3)
# other posts
in_points_other = hourly_comments(other_posts, 3)
# all posts
in_points_hn = hourly_comments(hn, 3)

In [128]:
## posts excluding zero-comment submissions, comments
print("An overview of average, high-engagement hours for posts, excluding zero-comment submissions.", '\n')
print("Ask HN: Hours of the average highest engagement")
for item in ex_hourly_ask[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')    
print("Show HN: Hours of the average highest engagement")
for item in ex_hourly_show[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')    
print("Other posts: Hours of the average highest engagement")
for item in ex_hourly_other[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')    
print("All HN: Hours of the average highest engagement")
for item in ex_hourly_hn_comments[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')

# posts excluding zero-comment submissions, points
print("An overview of the average highest points for a post, excluding zero-comment submissions.", '\n')
print("Ask HN: hours of the highest engagement vis-a-vis up/downvotes")
for item in ex_points_ask[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')
print("Show HN: hours of the highest engagement vis-a-vis up/downvotes")
for item in ex_points_show[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')
print("Other posts: hours of the highest engagement vis-a-vis up/downvotes")
for item in ex_points_other[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')
print("All HN: hours of the highest engagement vis-a-vis up/downvotes")
for item in ex_points_hn_comments[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')

# posts including zero-comment submissions, comments
print("An overview of average, high-engagement hours for posts, including zero-comment submissions.", '\n')
print("Ask HN: Hours of the average highest engagement")
for item in in_hourly_ask[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')    
print("Show HN: Hours of the average highest engagement")
for item in in_hourly_show[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')    
print("Other posts: Hours of the average highest engagement")
for item in in_hourly_other[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')    
print("All HN: Hours of the average highest engagement")
for item in in_hourly_hn[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {comments:.2f} average comments per post.".format(hour=item[1], comments=item[0])
    print(strang)
print('\n')

#posts including zero-comment submission, points
print("An overview of the average highest points for a post, including zero-comment submissions.", '\n')
print("Ask HN: hours of the highest engagement vis-a-vis up/downvotes")
for item in in_points_ask[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')
print("Show HN: hours of the highest engagement vis-a-vis up/downvotes")
for item in in_points_show[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')
print("Other posts: hours of the highest engagement vis-a-vis up/downvotes")
for item in in_points_other[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')
print("All HN: hours of the highest engagement vis-a-vis up/downvotes")
for item in in_points_hn[:5]:
    item[1] = dt.datetime.strptime(str(item[1]), '%H').strftime('%H:%M')
    strang = "{hour}: {points:.2f} average points per post.".format(hour=item[1], points=item[0])
    print(strang)
print('\n')

An overview of average, high-engagement hours for posts, excluding zero-comment submissions. 

Ask HN: Hours of the average highest engagement
15:00: 39.67 average comments per post.
13:00: 22.22 average comments per post.
12:00: 15.45 average comments per post.
10:00: 13.76 average comments per post.
17:00: 13.73 average comments per post.


Show HN: Hours of the average highest engagement
07:00: 12.42 average comments per post.
12:00: 12.03 average comments per post.
14:00: 11.60 average comments per post.
08:00: 11.07 average comments per post.
04:00: 10.87 average comments per post.


Other posts: Hours of the average highest engagement
13:00: 29.37 average comments per post.
12:00: 29.20 average comments per post.
14:00: 28.09 average comments per post.
15:00: 27.97 average comments per post.
11:00: 27.13 average comments per post.


All HN: Hours of the average highest engagement
15:00: 27.63 average comments per post.
13:00: 27.31 average comments per post.
12:00: 26.76 average 