# Hacker News Site Analysis

This is guided project for the Data Analyst in Python Certification in Dataquest

In this project I am going to analyze a data set from a Hacker News site

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Post on this site are normally categorized as "asks" to the community or as "show" to the community.

I'll compare these two types of posts to determine the following:

* Do "asks" posts or "show" posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?




## Reading the csv file

In [8]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
opened_file.close()

## Printing the first rows of the data set

In [9]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing the headers from the data set

In [10]:
headers = hn[0]
hn = hn[1:]

print(headers)
print('/n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
/n
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Filtering rows

Let´s filter only the rows that are categorized as "requests" or "shows" to the hacker communite.

We can differenciate them because they will begin either with "ask hn" or "show hn"

In [11]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
        
print('The number of Request posts is {:,}'.format(len(ask_posts)))
print('The number of Show posts is {:,}'.format(len(show_posts)))
print('The number of Other posts is {:,}'.format(len(other_posts)))

The number of Request posts is 1,744
The number of Show posts is 1,162
The number of Other posts is 17,194


## Finding the Average for Ask and Show posts

In the last step we got two different lists, one for the ask posts and another for the show posts

Now, let´s calculate the average for each

In [12]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)

print('The average number of comments on ask posts is {:,.2f}%'.format(avg_ask_comments))

The average number of comments on ask posts is 14.04%


In [13]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = total_show_comments / len(show_posts)

print('The average number of comments on show posts is {:,.2f}%'.format(avg_show_comments))

The average number of comments on show posts is 10.32%


Based on the current data, ask posts receive more comments than show posts

## Finding a relation between ask posts creation time and the number of comments the post receive

What we are trying to confirm is if posts created at certain time receive most comments, that way, we could increase the amount of comments for our posts.

Firtst we are going to calculate the amount of ask posts that are created at each hour of the day.

Then, we will calculate the average number of comments ask posts received at each hour of the day.

In [24]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
posts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in posts_by_hour:
        comments_by_hour[time] += comment
        posts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        posts_by_hour[time] = 1

print(comments_by_hour)

print(posts_by_hour)

{'23': 543, '01': 683, '11': 641, '02': 1381, '22': 479, '05': 464, '20': 1722, '16': 1814, '17': 1146, '19': 1188, '03': 421, '09': 251, '10': 793, '08': 492, '07': 267, '18': 1439, '00': 447, '21': 1745, '04': 337, '13': 1253, '12': 687, '06': 397, '15': 4477, '14': 1416}
{'23': 68, '01': 60, '11': 58, '02': 58, '22': 71, '05': 46, '20': 80, '16': 108, '17': 100, '19': 110, '03': 54, '09': 45, '10': 59, '08': 48, '07': 34, '18': 109, '00': 55, '21': 109, '04': 47, '13': 85, '12': 73, '06': 44, '15': 116, '14': 107}


## Finding the average number of comments per posts

In [36]:
avg_by_hour = []
for hour in posts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / posts_by_hour[hour]])
    
sorted(avg_by_hour)

[['00', 8.127272727272727],
 ['01', 11.383333333333333],
 ['02', 23.810344827586206],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['05', 10.08695652173913],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['11', 11.051724137931034],
 ['12', 9.41095890410959],
 ['13', 14.741176470588234],
 ['14', 13.233644859813085],
 ['15', 38.5948275862069],
 ['16', 16.796296296296298],
 ['17', 11.46],
 ['18', 13.20183486238532],
 ['19', 10.8],
 ['20', 21.525],
 ['21', 16.009174311926607],
 ['22', 6.746478873239437],
 ['23', 7.985294117647059]]

Let´s do one final step and invert the list, so we can order it  by percentage and not by hour

In [45]:
inverted_list = [] 

for list in avg_by_hour:
    inverted_list.append([list[1],list[0]])

order_avg_hour = sorted(inverted_list, reverse=True)

print('Hightes 5 hours for "Ask" comments')

for avg, hr in order_avg_hour[:5]:
    print('Post created at {} received {:.2f}% of comments'.format(hr,avg))

Hightes 5 hours for "Ask" comments
Post created at 15 received 38.59% of comments
Post created at 02 received 23.81% of comments
Post created at 20 received 21.52% of comments
Post created at 16 received 16.80% of comments
Post created at 21 received 16.01% of comments


The highest comments return hours to create ask posts are 15, 02 and 20. Posts created at 15 hour received 61% more comments that the second highest post time: 02

##Conclusion

In this guided project I analyzed the "Ask" and "Show" posts to determine which category of post receive more comments. Also, we analyzed the relation between the number of comments a post received and the hour it was created.
Based on the results hour recommendation to maximize the number of comments on a post is to create a "Ask" posts at between 15:00 and 16:00.