Basic Data Analytics for Hacker News Posts

This is a simple program based on the guided project at Dataquest. Here, I analyze a dataset of Hacker News posts from September 2015-16. The documentation for it can be found at https://www.kaggle.com/hacker-news/hacker-news-posts

Hacker News has 2 specific types of posts called Ask HN and Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

The goals of the project are simple:
1. Which of the two, Ask HN or Show HN, has the higher number of average comments?
2. Among the posts that have the higher average, does the time of posting affect the number of comments? If yes, what are the top 5 hours that recieve the most comments.

In [1]:
# Opening the data file

from csv import reader

opened_file=open("hacker_news.csv")
read_file=reader(opened_file)
hn=list(read_file)
hn_header = hn[0]
hn=hn[1:]

# Having a look at the first few rows of data

print(hn_header, "\n")
i=0
while i < 3:
    print(hn[i])
    print("\n")
    i = i+1

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']




In [2]:
# To obtain seperate lists for titles containing ask hn
# and show hn

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("The number of posts with ask hn:",len(ask_posts))
print("The number of posts with show hn:",len(show_posts))
print("The number of other posts:", len(other_posts))

The number of posts with ask hn: 1744
The number of posts with show hn: 1162
The number of other posts: 17194


In [3]:
# Obtaining the average number of comments in ask hn and 
# show hn posts

total_ask_comments = 0

for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

print("Average number of comments in ask hn posts:",
      round(avg_ask_comments, 2))


total_show_comments = 0

for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments
    
avg_show_comments = total_show_comments/len(show_posts)

print("Average number of comments in show hn posts:",
      round(avg_show_comments, 2))

Average number of comments in ask hn posts: 14.04
Average number of comments in show hn posts: 10.32


In [4]:
# To calculate the average number of comments by hour for
# ask hn posts

import datetime as dt

result_list = []

for row in ask_posts:
    n_comments = int(row[4])
    result_list.append([row[6], n_comments])
    
#print(result_list[:3])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_str = row[0]
    date = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    hour = date.hour
    #print(hour)
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

#print(counts_by_hour)
#print("\n")
#print(comments_by_hour)

avg_by_hour = []

for hour in counts_by_hour and comments_by_hour:
    avg_comments = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comments])
    
print("The average number of comments by hour are given below: \n")
for row in avg_by_hour:
    print(row)

The average number of comments by hour are given below: 

[0, 8.160714285714286]
[1, 11.737704918032787]
[2, 23.45762711864407]
[3, 7.672727272727273]
[4, 7.083333333333333]
[5, 10.48936170212766]
[6, 8.844444444444445]
[7, 7.685714285714286]
[8, 10.142857142857142]
[9, 5.586956521739131]
[10, 13.233333333333333]
[11, 10.898305084745763]
[12, 9.337837837837839]
[13, 14.906976744186046]
[14, 13.13888888888889]
[15, 38.27350427350427]
[16, 16.798165137614678]
[17, 11.356435643564357]
[18, 13.1]
[19, 10.72972972972973]
[20, 21.28395061728395]
[21, 15.9]
[22, 6.680555555555555]
[23, 7.884057971014493]


In [5]:
# To show the 5 hours which have the highest number of
# average comments in ask hn posts

def sortSecond(val): 
    return val[1]  
  
avg_by_hour.sort(key = sortSecond, reverse = True)  


print("The top 5 hours for ask post comments are:""\n")

i=0
for row in avg_by_hour:
    if i<5:
        dt_hour = str(row[0])
        hour = dt.datetime.strptime(dt_hour,"%H")
        hour = hour.strftime("%H:%M")
        avg_posts = row[1]
        output = "{}: {:.2f} average comments per post".format(hour, avg_posts)
        print(output)
        i=i+1

The top 5 hours for ask post comments are:

15:00: 38.27 average comments per post
02:00: 23.46 average comments per post
20:00: 21.28 average comments per post
16:00: 16.80 average comments per post
21:00: 15.90 average comments per post


In [7]:
# To convert the top 5 hours with the highest average
# comments from EST to IST


print("The top 5 hours for ask post comments in IST are:""\n")

i=0
for row in avg_by_hour:
    if i<5:
        time = str(row[0])
        time = dt.datetime.strptime(time, "%H")
        time_ist = time + dt.timedelta(hours=10, minutes=30)
        time_ist = time_ist.strftime("%H:%M")
        avg_posts = row[1]
        output = "{}: {:.2f} average comments per post".format(time_ist, avg_posts)
        print(output)
        i=i+1

The top 5 hours for ask post comments in IST are:

01:30: 38.27 average comments per post
12:30: 23.46 average comments per post
06:30: 21.28 average comments per post
02:30: 16.80 average comments per post
07:30: 15.90 average comments per post
