# Analyzing the posts in the Hacker News
### What time of day is best to post on Hacker News?

In this proyect we will analyze the Hacker News database and compare two main types of posts that are common on the website. The "Ask HN" and the "Show HN" are among the most common posts on the website. We will analyze information of the mentioned posts such as: number of coments, hour of the day with higher number of comments, number of points, hour of the day with the higher number of points, etc

In [1]:
#Open and read file from Hacker News
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)

#Create variable hn as a list of lists
hn = list(read_file)

#Check top five rows
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
#Isolate the header row 
headers = hn[:1]
hn = hn[1:]

#Check outcome
print(headers)
print('\n')
print(hn[:5])
print('Number of rows: ', len(hn))

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Number of ro

Hacker news has two very typical posts in its website. The "Ask HN", in which users ask the community a question and the "Show HN", in which users show the community a new feature program or technology. In both types of posts the users start the post title with the phrase "Ask HN"/"Show HN".

For our analysis we will separate the posts in three categories: `ask_posts`, `show_posts` and `other_posts`. To do this we must have in mind that user not always capitalize the posts the same way. 

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'): #we use the method str.startswith(), but befor, we set the title to lowercase
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
#Check to see if wee have the same number of posts as before (20,098)
print('Ask HN Posts: ', len(ask_posts),
     'Show HN Posts: ', len(show_posts),
     'Other Posts: ', len(other_posts)
     )

print('Total Number of Posts: ', len(ask_posts) + len(show_posts) + len(other_posts)) # we do get the same number of rows as before

Ask HN Posts:  1744 Show HN Posts:  1162 Other Posts:  17194
Total Number of Posts:  20100


Now that we have separated the types of posts from the Hacker News file, we will check to see the average number of comments each of these types of posts generate on the site. 

In [4]:
#Set empty lists for the totals
total_ask_comments = []
total_show_comments = [] 
total_other_comments = []

for row in ask_posts:
    num_comments = row[4] #set the number of columns and set it as an interger
    num_comments = int(num_comments)
    total_ask_comments.append(num_comments)

for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments.append(num_comments)

for row in other_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_other_comments.append(num_comments)
    
#Set the variables for the averages
avg_ask_comments = sum(total_ask_comments) / len(ask_posts)
avg_show_comments = sum(total_show_comments) / len(show_posts)
avg_other_comments = sum(total_other_comments) / len(other_posts)
    
#Print averages in a nice form
print('Average Comments "Ask HN": {:.2f}'.format(avg_ask_comments), '\n'
     'Average Comments "Show HN": {:.2f}'.format(avg_show_comments),'\n'
     'Average Comments Other Posts: {:.2f}'.format(avg_other_comments))    

Average Comments "Ask HN": 14.04 
Average Comments "Show HN": 10.32 
Average Comments Other Posts: 26.87


From the analisis above we can se that, on average, "Ask HN" posts have more comments (14.04 avg comments per post) than the "Show HN" posts (10.32 avg comments per post). Neither however has more comments than the other posts that are not "Ask HN" nor "Show HN" which has 26.87 avg comments per post.

I would argue that it is not surprising to see that the "Ask HN" posts have a higher average comment per post than the "Show HN" since the "Ask HN" is looking for answers to a questions and that the community is responding in the comments.

*I will focus the remainder of this analysis to focus on theses "Ask HN" posts and try to extract more information from the data set*

In [5]:
import datetime as dt
#Organize data of interest (datetime data and number of comments) in a single list of lists
result_list = []
for row in ask_posts:
    created_at = row[6]
    created_at = dt.datetime.strptime(created_at, '%m/%d/%Y %H:%M')
    num_comments = row[4]
    num_comments = int(num_comments)
    result_list.append([created_at, num_comments])
    
#Create dictionaries to set frequency per hour and number of comments per hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    hour = row[0].hour
    num_comments = row[1]
    if hour in  counts_by_hour:
        counts_by_hour[hour] += 1              #create a frequency table of posts per hour
        comments_by_hour[hour] += num_comments #create a key(hour) and add the number of comments as to get a total of comments/hour
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments

#Print results nicely        
for key, value in counts_by_hour.items():
    print('Hour: ', key, '   # of Posts -> ', value)

for key, value in comments_by_hour.items():
    print('Hour: ', key, '   Total Comments -> ', value)    

Hour:  0    # of Posts ->  55
Hour:  1    # of Posts ->  60
Hour:  2    # of Posts ->  58
Hour:  3    # of Posts ->  54
Hour:  4    # of Posts ->  47
Hour:  5    # of Posts ->  46
Hour:  6    # of Posts ->  44
Hour:  7    # of Posts ->  34
Hour:  8    # of Posts ->  48
Hour:  9    # of Posts ->  45
Hour:  10    # of Posts ->  59
Hour:  11    # of Posts ->  58
Hour:  12    # of Posts ->  73
Hour:  13    # of Posts ->  85
Hour:  14    # of Posts ->  107
Hour:  15    # of Posts ->  116
Hour:  16    # of Posts ->  108
Hour:  17    # of Posts ->  100
Hour:  18    # of Posts ->  109
Hour:  19    # of Posts ->  110
Hour:  20    # of Posts ->  80
Hour:  21    # of Posts ->  109
Hour:  22    # of Posts ->  71
Hour:  23    # of Posts ->  68
Hour:  0    Total Comments ->  447
Hour:  1    Total Comments ->  683
Hour:  2    Total Comments ->  1381
Hour:  3    Total Comments ->  421
Hour:  4    Total Comments ->  337
Hour:  5    Total Comments ->  464
Hour:  6    Total Comments ->  397
Hour:  7    T

In [6]:
#Created an empty list to later append the average commentes per post per hour of day 
avg_by_hour = []

#loop over the dictionaries to create the averages and append them to the empty list "avg_by_hour"
for key in comments_by_hour:
    avg = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, avg])
    
#Print results nicely
for row in avg_by_hour:
    print('Hour: {}   Average Comments per Ask Post: {:.2f}'.format(row[0], row[1]))

Hour: 0   Average Comments per Ask Post: 8.13
Hour: 1   Average Comments per Ask Post: 11.38
Hour: 2   Average Comments per Ask Post: 23.81
Hour: 3   Average Comments per Ask Post: 7.80
Hour: 4   Average Comments per Ask Post: 7.17
Hour: 5   Average Comments per Ask Post: 10.09
Hour: 6   Average Comments per Ask Post: 9.02
Hour: 7   Average Comments per Ask Post: 7.85
Hour: 8   Average Comments per Ask Post: 10.25
Hour: 9   Average Comments per Ask Post: 5.58
Hour: 10   Average Comments per Ask Post: 13.44
Hour: 11   Average Comments per Ask Post: 11.05
Hour: 12   Average Comments per Ask Post: 9.41
Hour: 13   Average Comments per Ask Post: 14.74
Hour: 14   Average Comments per Ask Post: 13.23
Hour: 15   Average Comments per Ask Post: 38.59
Hour: 16   Average Comments per Ask Post: 16.80
Hour: 17   Average Comments per Ask Post: 11.46
Hour: 18   Average Comments per Ask Post: 13.20
Hour: 19   Average Comments per Ask Post: 10.80
Hour: 20   Average Comments per Ask Post: 21.52
Hour: 21 

In [7]:
#Rearranging the list to be able to sort by the average comments per post
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

#Print results nicely
print("Top 5 Hours for Ask Posts Comments")
for element in sorted_swap[:5]:
    hour_format = dt.datetime.strptime( str(element[1]),'%H')  #conveting into datetime object
    hour_format = hour_format.strftime('%H:%M')          #formating hour in HH:MM 
    print('{}: {:.2f} average comments per ask post'.format(hour_format, element[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per ask post
02:00: 23.81 average comments per ask post
20:00: 21.52 average comments per ask post
16:00: 16.80 average comments per ask post
21:00: 16.01 average comments per ask post


Now that we have analyzed the posts by the number of comments and the hour of day, I will extend the analyzes to compare the "Ask Posts" and the "Show Posts" by the number of points achieved in the site (Upvotes minus Downvotes)

In [9]:
#Create empty lists to record the number of points and determine average points per type of post
total_ask_points = []
total_show_points = []
total_other_points = []

#Append number of points to the list
for row in ask_posts:
    points = int(row[3])
    total_ask_points.append(points)
    
for row in show_posts:
    points = int(row[3])
    total_show_points.append(points)
    
for row in other_posts:
    points = int(row[3])
    total_other_points.append(points)
    
#Create average variables    
avg_ask_points = sum(total_ask_points) / len(ask_posts)
avg_show_points = sum(total_show_points) / len(show_posts)
avg_other_points = sum(total_other_points) / len(other_posts)

#print results nicely
print('Average "Ask HN" Post Points: {:.2f}'.format(avg_ask_points), '\n'
     'Average "Show HN" Post Points: {:.2f}'.format(avg_show_points),'\n'
     'Average Other Posts Points: {:.2f}'.format(avg_other_points))

Average "Ask HN" Post Points: 15.06 
Average "Show HN" Post Points: 27.56 
Average Other Posts Points: 55.41


When comparing the number of points (Upvotes minus Downvotes) of the "Show HN" posts and the "Ask HN" posts we see that the "Show HN" posts have on average a higher number of points (27.56 points). Therefore we will continue our analyzes on the number of points exclusively on the "Show HN" posts.

In [11]:
#Create empty dictionaries to create frequency tables
counts_per_hour = {}
points_per_hour = {}

for row in show_posts:
    points = int(row[3])
    created_at = row[6]
    created_at = dt.datetime.strptime(created_at, '%m/%d/%Y %H:%M')
    hour = created_at.hour
    if hour in counts_per_hour:
        counts_per_hour[hour] += 1
        points_per_hour[hour] += points
    else:
        counts_per_hour[hour] = 1
        points_per_hour[hour] = points
        
#Print results nicely        
for key, value in counts_per_hour.items():
    print('Hour: ', key, '   # of Posts -> ', value)

for key, value in points_per_hour.items():
    print('Hour: ', key, '   Total Points -> ', value)   


Hour:  0    # of Posts ->  31
Hour:  1    # of Posts ->  28
Hour:  2    # of Posts ->  30
Hour:  3    # of Posts ->  27
Hour:  4    # of Posts ->  26
Hour:  5    # of Posts ->  19
Hour:  6    # of Posts ->  16
Hour:  7    # of Posts ->  26
Hour:  8    # of Posts ->  34
Hour:  9    # of Posts ->  30
Hour:  10    # of Posts ->  36
Hour:  11    # of Posts ->  44
Hour:  12    # of Posts ->  61
Hour:  13    # of Posts ->  99
Hour:  14    # of Posts ->  86
Hour:  15    # of Posts ->  78
Hour:  16    # of Posts ->  93
Hour:  17    # of Posts ->  93
Hour:  18    # of Posts ->  61
Hour:  19    # of Posts ->  55
Hour:  20    # of Posts ->  60
Hour:  21    # of Posts ->  47
Hour:  22    # of Posts ->  46
Hour:  23    # of Posts ->  36
Hour:  0    Total Points ->  1173
Hour:  1    Total Points ->  700
Hour:  2    Total Points ->  340
Hour:  3    Total Points ->  679
Hour:  4    Total Points ->  386
Hour:  5    Total Points ->  104
Hour:  6    Total Points ->  375
Hour:  7    Total Points ->  494
H

In [14]:
points_results = []

#Organize info from dictionaries and create the average points per hour
for key in counts_per_hour:
    hour = key
    posts = counts_per_hour[key]
    points = points_per_hour[key]
    avg_points = points / posts
    points_results.append([avg_points, hour])
    
#Sort and edit the print top 5 hour with highest averge points per post  
points_results = sorted(points_results, reverse = True)
print('Top 5 Hours for Show Posts Points')
for element in points_results[:5]:
    hour = element[1]
    avg_points = element[0]
    print('{}:00: {:.2f} average points per show post'.format(hour, avg_points))

Top 5 Hours for Show Posts Points
23:00: 42.39 average points per show post
12:00: 41.69 average points per show post
22:00: 40.35 average points per show post
0:00: 37.84 average points per show post
18:00: 36.31 average points per show post


Here we have the top five hours with the highest average points per post in the "Show HN" category. We can se that late at night (22:00 - 00:00) is an time were high scoring posts are created.

Finally in our analysis of the hacker news posts data set we will analyze the day of the week in which posts receive the most points. My guess would be that post created/posted during weekdays will receive higher points that the one on the weekends

In [15]:
#Empty list that I will use to store our information
results_day = []

#Loop the original dataset hn and extract the information we want to analyze
for row in hn:
    created_at = row[6]
    created_at = dt.datetime.strptime(created_at, '%m/%d/%Y %H:%M')
    created_at = created_at.strftime('%A')
    num_points = int(row[3])
    results_day.append([num_points, created_at])
    
#Preview list
print(results_day[:5])

[[386, 'Thursday'], [39, 'Tuesday'], [2, 'Thursday'], [3, 'Friday'], [8, 'Wednesday']]


In [16]:
#create a dictionary to count the posts per day and the sum of points per day 
points_by_day = {}
count_by_day = {}
for row in results_day:
    day_of_week = row[1]
    num_points = row[0]
    if day_of_week in points_by_day:
        points_by_day[day_of_week] += 1 
        count_by_day[day_of_week] += num_points
    else:
        points_by_day[day_of_week] = 1
        count_by_day[day_of_week] = num_points

#Print results nicely        
for key, value in count_by_day.items():
    print('{}: {:,} total posts'.format(key, value))
    
for key, value in points_by_day.items():
    print('{}: {:,} total points'.format(key, value))


Monday: 155,037 total posts
Wednesday: 173,150 total posts
Sunday: 106,485 total posts
Saturday: 98,739 total posts
Friday: 141,102 total posts
Thursday: 170,824 total posts
Tuesday: 165,614 total posts
Monday: 3,084 total points
Wednesday: 3,328 total points
Sunday: 2,062 total points
Saturday: 2,025 total points
Friday: 2,985 total points
Thursday: 3,318 total points
Tuesday: 3,298 total points


In [20]:
avg_points_day = []

for key in points_by_day:
    day = key
    avg_points = points_by_day[key] / count_by_day[key]
    avg_points_day.append([avg_points, key])
    
#Sorting and printing in nice format
avg_points_day = sorted(avg_points_day, reverse = True)

for element in avg_points_day:
    day = element[1]
    avg_points = element[0]
    print('{}: {:.5f} average points per post'.format(day, avg_points))

Friday: 0.02115 average points per post
Saturday: 0.02051 average points per post
Tuesday: 0.01991 average points per post
Monday: 0.01989 average points per post
Thursday: 0.01942 average points per post
Sunday: 0.01936 average points per post
Wednesday: 0.01922 average points per post


in this last table we can see that my hypothesis was wrong. Weekdays such as Monday of Wednesday perform worse thatn weekend days such as Saturday (or Friday, which I consider weekend :P). 

### **Conclusion**
In this proyect we explored the data set containing information about posts on the website Hacker News. We analyzed typical posts like "Show HN" or "Ask HN".