
# Analysis of [Hacker News](https://news.ycombinator.com) Posts

In this analysis, we will take a concise dataset of Hacker News posts, and analyze them to see if we find any patterns. We will try and focus on **Ask HN** and **Show HN** posts and see if these have any specific patterns.

The dataset has been obtained from [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Note, the dataset used has been modified to cut out any posts without comments.

In [16]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:6])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
['id', 'title',

In [20]:
#popping the header list from hn and adding it to a new variable called header
header = hn.pop(0)
print(header)
print(hn[:5])

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
[['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos'

## Extracting Ask HN and Show HN Posts

In [19]:
#creating lists with only ask posts, show posts, and a third list with all the other posts

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("The number of Ask HN posts are: " + str(len(ask_posts))+ ".")
print("The number of Show HN posts are: " + str(len(show_posts))+ ".")
print("The number of Other HN posts are: " + str(len(other_posts))+ ".")
    

The number of Ask HN posts are: 1744.
The number of Show HN posts are: 1162.
The number of Other HN posts are: 17193.


We see that the number of Ask HN posts are 1744, Show HN posts are 1162, and the other posts are 17193.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

We will determine if Ask HN or Show HN posts get higher number of comments on average

In [23]:
#creating a function that will take in a data set and give you the average number of comments
def avg_comments(dataset):
    
    total_comments = 0
    
    for row in dataset:
        num_comments = row[4]
        num_comments = int(num_comments)
        total_comments += num_comments
    
    average_comments = total_comments/len(dataset)
    return average_comments


avg_ask_comments = avg_comments(ask_posts)
avg_show_comments = avg_comments(show_posts)

print("The average number of comments an Ask HN posts gets is: " + str(avg_ask_comments))
print("The average number of comments a Show HN posts gets is: " + str(avg_show_comments))



The average number of comments an Ask HN posts gets is: 14.038417431192661
The average number of comments a Show HN posts gets is: 10.31669535283993


From our analysis, it seems that Ask HN posts, on average, get more comments than Show HN posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

We will determine if Ask HN posts created at a certain time are more likely to get more comments.

In [38]:
import datetime as dt

#this is an empty list which will contain a list of lists. The list of list will contain two elemtns: the time of the post and the number of comments it recieved.
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    two_elements = [created_at, int(num_comments)]
    result_list.append(two_elements)

#print(result_list)
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_created = row[0]
    n_comments = row[1]
    dt_object = dt.datetime.strptime(date_created, "%m/%d/%Y %H:%M")
    dt_hour = dt_object.hour #extracting just the hour from the date time object
    
    if dt_hour not in counts_by_hour:
        counts_by_hour[dt_hour] = 1
        comments_by_hour[dt_hour] = n_comments
    else:
        counts_by_hour[dt_hour] += 1
        comments_by_hour[dt_hour] += n_comments
#print(counts_by_hour)
#print(comments_by_hour)


for i, j in counts_by_hour.items():
    print("At hour " + str(i) + ", there were " + str(j) + " posts.")
print("\n\n")    
for i, j in comments_by_hour.items():
    print("At hour " + str(i) + ", there were " + str(j) + " comments.")
    
    
    

At hour 9, there were 45 posts.
At hour 13, there were 85 posts.
At hour 10, there were 59 posts.
At hour 14, there were 107 posts.
At hour 16, there were 108 posts.
At hour 23, there were 68 posts.
At hour 12, there were 73 posts.
At hour 17, there were 100 posts.
At hour 15, there were 116 posts.
At hour 21, there were 109 posts.
At hour 20, there were 80 posts.
At hour 2, there were 58 posts.
At hour 18, there were 109 posts.
At hour 3, there were 54 posts.
At hour 5, there were 46 posts.
At hour 19, there were 110 posts.
At hour 1, there were 60 posts.
At hour 22, there were 71 posts.
At hour 8, there were 48 posts.
At hour 4, there were 47 posts.
At hour 0, there were 55 posts.
At hour 6, there were 44 posts.
At hour 7, there were 34 posts.
At hour 11, there were 58 posts.



At hour 9, there were 251 comments.
At hour 13, there were 1253 comments.
At hour 10, there were 793 comments.
At hour 14, there were 1416 comments.
At hour 16, there were 1814 comments.
At hour 23, there wer

In [41]:
avg_by_hour = []
#we will loop over the two dictionaries to find the average number of comments in that hour. We can do this because the keys of both the dictionaries are the same

for key in counts_by_hour:
    avg_comments = comments_by_hour[key]/counts_by_hour[key]
    avg_by_hour.append([key, avg_comments])

for row in avg_by_hour:
    hour = row[0]
    avg = row[1]
    print("The average number of comments for hour " + str(hour) + " is " + str(avg) +".")

The average number of comments for hour 9 is 5.5777777777777775.
The average number of comments for hour 13 is 14.741176470588234.
The average number of comments for hour 10 is 13.440677966101696.
The average number of comments for hour 14 is 13.233644859813085.
The average number of comments for hour 16 is 16.796296296296298.
The average number of comments for hour 23 is 7.985294117647059.
The average number of comments for hour 12 is 9.41095890410959.
The average number of comments for hour 17 is 11.46.
The average number of comments for hour 15 is 38.5948275862069.
The average number of comments for hour 21 is 16.009174311926607.
The average number of comments for hour 20 is 21.525.
The average number of comments for hour 2 is 23.810344827586206.
The average number of comments for hour 18 is 13.20183486238532.
The average number of comments for hour 3 is 7.796296296296297.
The average number of comments for hour 5 is 10.08695652173913.
The average number of comments for hour 19 is 1

## Sorting and Printing Values from a List of Lists


In [49]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
#print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Post Comments:")

for row in sorted_swap[:5]:
    hour = str(row[1])
    dt_hour_str = dt.datetime.strptime(hour, "%H").strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(dt_hour_str, row[0]))
    #string_hour = dt_hour.strftime("%H:%M")

Top 5 Hours for Ask Post Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From the dataset, we can conclude that posting an Ask HN post around 3:00 PM will most likely yield the most number of comments.