# Exploring Hacker News Posts
### In this project we are going to work with strings, object oriented programming and dates + times.
We are going to analyse two types of posts from Hacker News. Ask HN or Show HN.

In posts that begin with 'Ask HN' users submit posts to ask the Hacker News community specific questions Here are some examples:

`
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?'`

users can also submit Show HN posts to show the hacker news community a project, product or something interesting they have found. Here are a few examples:

`
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm`

We are going to compare these two types of posts to detrmine the following:
* Do Ask HN or Show HN receive more commnets on average?
* Do posts created a certain time receive more commnets on average?


Lets start by importing th librarires we need in order to read the data set into a list of lists.

In [13]:
#first we're going to read the file hacker_new.csv and save it as a list of lists
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

#here we are going to display the first 5 rows to see if everything looks okay
print(hn[:5])



[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing Headers from a List of Lists
Above we created a list of lists and stored it in the variable hn. Looking at the list we can see the first row is comprised of headers for each column. We are going to remove the header and store it in its own varialbe called headers. The new list hn should only have value.

In [14]:
#extract headers from hn and save to new list called headers
headers = hn[0]
print(headers)

#remove headers from list hn. We will print the first 5 rows to see if it worked
hn = hn[1:]
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN posts
ow that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object string1.startswith('dq'). If string1 starts with dq, it will return True; otherwise, it will return False.
`
print('dataquest'.startswith('Data'))
print('dataquest'.startswith('data'))`

`
False
True
`

In the example above, the first print call gives us False because dataquest does not start with Data. The second print call prints True because dataquest does start with data. Capitalization matters.

If we wish to control for case, we can use the lower method, which returns a lowercase version of the starting string. Here's an example:

`print('DataQuest'.lower())`

`dataquest`

Copy
Let's use these methods to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists.

In [15]:
#first we create some empty lists
ask_posts = [] 
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'): #check if title starts with ask hn
        ask_posts.append(row) #if title starts with ask hn we add the entire row to ask_post list
    elif title.lower().startswith('show hn'): #check if title starts with show hn
        show_posts.append(row) #if title starts with show hn we add entire row to show posts list
    else:
        other_posts.append(row) #if title doesnt start with ask hn or show hn we add entire row to other_posts list
        

#now we are going to check how many posts were in each type
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating the average number of comments for Ask HN and Show HN posts
Above we seperated "ask posts" and "show posts" into two lists of lists. Now we want to determine which posts receive more comments on average.

In [16]:
total_ask_comments = 0 

for row in ask_posts: #loop over each row in ask_posts list of lists
    num_comments = int(row[4]) #set num_comments equal to the 5th column of ask posts and convert to int type
    total_ask_comments += num_comments #add num_comments to the total_ask_comments variable

average_ask_comments = total_ask_comments / len(ask_posts) #calculate average # of comments on ask post
print(average_ask_comments) #average number of comments per ask hn post

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


Based on the calculations made above we can see that on average ask posts receive about 14 comments per post and show posts receive about 10.3 comments per post. This means that on average ask posts recieve about 35% more comments per post than ask show posts.

## Finding the Number of Ask Posts and Comments by Hour Created
Next we are going to try to determinge if ask posts created at a certain time are more likely to receive more comments than other times. To do this analysis we'll use the following two steps:
1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.



In [17]:
import datetime as dt

result_list = []

for row in ask_posts: #iterate over each row in ask_posts
    created_at = row[6] 
    num_comments = int(row[4])
    result_list.append([created_at, num_comments]) #append the time created_at and number of comments to result_list
 
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment_num = int(row[1])
    date_dt = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = date_dt.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment_num
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment_num
    
print('The number of ask_posts in each hour:\n', counts_by_hour, '\n')  
print('The number of comments in each hour:\n', comments_by_hour)

The number of ask_posts in each hour:
 {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} 

The number of comments in each hour:
 {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### Next we are going to calculate the number of comments for ask HN posts by hour.

In [18]:
avg_by_hour = []

for hour in comments_by_hour:
    average = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, average])
    
print('Average number of comments per post by hour:')
avg_by_hour
    

Average number of comments per post by hour:


[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting and Printing Values from a list of lists

In [20]:
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg,hour])
    
for row in swap_avg_by_hour:
    print(row)
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

print("\nTop 5 Hours for Ask Posts Comments")
for avg, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg))
    

[5.5777777777777775, '09']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[16.796296296296298, '16']
[7.985294117647059, '23']
[9.41095890410959, '12']
[11.46, '17']
[38.5948275862069, '15']
[16.009174311926607, '21']
[21.525, '20']
[23.810344827586206, '02']
[13.20183486238532, '18']
[7.796296296296297, '03']
[10.08695652173913, '05']
[10.8, '19']
[11.383333333333333, '01']
[6.746478873239437, '22']
[10.25, '08']
[7.170212765957447, '04']
[8.127272727272727, '00']
[9.022727272727273, '06']
[7.852941176470588, '07']
[11.051724137931034, '11']

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


15:00 is the hour that receives the most comments on average with 38.59 comments per post. 