# Engagement Analysis of Hacker News User Content

**by Gerard Tieng**

Today, we will be analyzing a [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) of user submissions from popular technology site Hacker News. 

Our goal in this project is to determine what types of content gains the most comments from other users as well as to identify the best times for posts to gain the most comments.

The following Python skills will be demonstrated:

- String manipulation for data cleansing
- Datetime formatting with the datetime library
- String formatting for easier readability

## Importing the Data

The first thing we'll do is import the data from the hacker_news.csv file using the reader function from the csv library, and then transforming the object to a list. We'll verify we did this correctly by inspecting the first five rows of the dataset.

In [1]:
import csv

opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)

hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Upon inspection, we'll find the first row of the data contains the name of the headers of this dataset. Let's separate the headers into its own variable from the rest of the data.

In [2]:
headers = hn[0] #index 0
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:
hn = hn[1:] #index 1 through end of rows
hn[0:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

With our separated, clean data set we can now perform our first filter of content to analyze. In the Hacker News community, there are two common posts made by users: "Ask Hacker News" (ASK HN for short; in which users submit questions) and "Show Hacker News" (SHOW HN for short; in which users share content). We'll be comparing ASK posts against SHOW posts for this project.

The following code is written to format and scan the beginning of each post and categorize them as either ask, show, or other.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


At the very least, this initial filter shows that "Ask Hacker News" content appears to be more frequent to the site. Now that we have our content filtered into separate ASK and SHOW lists, we will use the following code to calculate the average comments per post in each category.

In [5]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Comment average for ASK posts: " + str(avg_ask_comments))

total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("Comment average for SHOW posts: " + str(avg_show_comments))

Comment average for ASK posts: 14.038417431192661
Comment average for SHOW posts: 10.31669535283993


In [6]:
ask_vs_show = (avg_ask_comments - avg_show_comments) /  avg_show_comments
print("ASK posts receive " + str(ask_vs_show*100) + "% more comments than SHOW posts.")

ASK posts receive 36.074750208924534% more comments than SHOW posts.


Our analysis shows that ASK posts will more likely gain on average 4 more comments than SHOW posts. For the remainder of this project, we'll mainly concentrate on the "Ask Hacker News" types of posts.

## Time Analysis

In this portion of the analysis we will determine which of the 24 hours in the day is expected to yield the most comments when making an Ask Hacker News post. 

The loop below is designed to extract the time and total number of comments from each entry in the ask_posts subset. Then we'll use the strptime (string-parse-time) method from the datetime library to convert the date-string to a datetime object and the strftime (string-format-time) method to save the respective hour to a variable.

Finally, we'll create two frequency tables to store the number of records containing the appropriate hour (counts_by_hour) and the total number of comments corresponding to the entry's hour (comment_by_hour).

In [7]:
import datetime as dt

result_list = []

for post in ask_posts:
    stamp = post[6]
    comments = int(post[4])
    result_list.append([stamp, comments])
    

counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    date = result[0]
    comments = result[1]
    
    stamp = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(stamp, '%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

With both the total number of entries and comments by hour, we can calculate the average number of comments a post from each hour will receive.

In [8]:
avg_by_hour = []

for key, value in counts_by_hour.items():
    average = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, average])
    
avg_by_hour

[['04', 7.170212765957447],
 ['02', 23.810344827586206],
 ['15', 38.5948275862069],
 ['22', 6.746478873239437],
 ['11', 11.051724137931034],
 ['17', 11.46],
 ['16', 16.796296296296298],
 ['20', 21.525],
 ['13', 14.741176470588234],
 ['00', 8.127272727272727],
 ['14', 13.233644859813085],
 ['03', 7.796296296296297],
 ['21', 16.009174311926607],
 ['07', 7.852941176470588],
 ['19', 10.8],
 ['18', 13.20183486238532],
 ['12', 9.41095890410959],
 ['09', 5.5777777777777775],
 ['05', 10.08695652173913],
 ['06', 9.022727272727273],
 ['23', 7.985294117647059],
 ['10', 13.440677966101696],
 ['08', 10.25],
 ['01', 11.383333333333333]]

## Displaying the Top 5 Times

Next, we'll save the avg result to the 0 index and use the sorted() method to display the values in descending order, giving us our top 5 times. For better readability, the format() method will be used to insert the desired values into a template string.

In [9]:
swap_avg_by_hour = []

for average in avg_by_hour:
    swap_avg_by_hour.append([average[1], average[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [10]:
print("Top 5 Hours for Ask Posts Comments:")
print("***********************************")

for hours in sorted_swap[0:5]:
    
    hour_dt = dt.datetime.strptime(hours[1], '%H')
    time = dt.datetime.strftime(hour_dt, '%H')
    average = hours[0]
    
    template = "{t}: {a:.2f} average comments per post"
    output = template.format(t=time, a=average)
    print(output)

Top 5 Hours for Ask Posts Comments:
***********************************
15: 38.59 average comments per post
02: 23.81 average comments per post
20: 21.52 average comments per post
16: 16.80 average comments per post
21: 16.01 average comments per post


## Conclusion

According to our analysis, posts in the "Ask Hacker News" format receive 36% more comments than posts in the "Show Hacker News" format.

Of the "Ask Hacker News" format content, posts are likely to get the most comments when created at 3PM with an average of 38.59. Favorable windows also include from 8-9PM, 3-4PM and 2AM.