# Introduction

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

*id*: The unique identifier from Hacker News for the post

*title*: The title of the post

*url*: The URL that the posts links to, if it the post has a URL

*num_points*: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post

*author*: The username of the person who submitted the post

*created_at*: The date and time at which the post was submitted

Here are the first few rows of the data set:

|id   |title|url|num_points|num_comments|author|created_at|
|:---:|:---:|:---:|----------|------------|:----:|----------|
|12224879|	Interactive Dynamic Video|	http://www.interactivedynamicvideo.com/|	386|	52|	ne0phyte|	8/4/2016 11:52|
|10975351|	How to Use Open Source and Shut the F*ck Up at the Same Time|	http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/|	39|	10|	josep2|	1/26/2016 19:30|
|11964716|	Florida DJs May Face Felony for April Fools' Water Joke|	http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/|	2|	1|	vezycash|	6/23/2016 22:20|
|11919867|	Technology ventures: From Idea to Enterprise|	https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429 |	3|	1|	hswarna|	6/17/2016 0:01
|10301696|	Note by Note: The Making of Steinway L1037 (2007)|	http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0|	8|	2|	walterbell|	9/30/2015 4:12|

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Below are a couple examples:

>Ask HN: How to improve my personal website?
>Ask HN: Am I the only one outraged by Twitter shutting down share counts?
>Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

>Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
>Show HN: Something pointless I made
>Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?



# Part 1
## Instructions
1. Start by adding a title and writing a paragraph in a markdown cell introducing the project and the data set. The title and the introduction are tentative at this point, so don't spend too much time here — you can come back at the end of your work to refine them.

2. Read the `hacker_news.csv` file in as a list of lists.
    * Assign the result to the variable `hn`.

3. Display the first five rows of `hn`.


In [1]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
print(hn[0:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


# Removing Headers from a List of Lists
## Instructions
1. Extract the first row of data, and assign it to the variable headers.
2. Remove the first row from hn.
3. Display headers.
4. Display the first five rows of hn to verify that you removed the header row properly.


In [2]:
headers = hn[0]
print(headers)
hn = hn[1:]
print(hn[0:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


# Extracting Ask HN and Show HN Post
## Instructions

1.Create three empty lists called ask_posts, show_posts, and other_posts.

2. Loop through each row in hn.
    * Assign the title in each row to a variable named title.
        * Because the title column is the second column, you'll need to get the element at index 1 in each row.

3. Implement the following steps:
    * If the lowercase version of title starts with ask hn, append the row to ask_posts.
    * Else if the lowercase version of title starts with show hn, append the row to show_posts.
    * Else append to other_posts.

4. Check the number of posts in ask_posts, show_posts, and other_posts


In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else: 
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print(ask_posts[0:4])
print(show_posts[0:4])


1744
1162
17194
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']]
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'h

# Calculating the Average Number of Comments for Ask HN and Show HN Posts
## Instructions
1. Find the total number of comments in ask posts and assign it to total_ask_comments.
    * Set total_ask_comments to 0.
2. Use a for loop to iterate over the ask posts.
    * Because the num_comments column is the fifth column in ask_posts, you'll need to get the element at index 4 in each row.
        * You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
        * Add this value to total_ask_comments.
3. Compute the average number of comments on ask posts and assign it to avg_ask_comments.
4. Print avg_ask_comments.
5. Find the total number of comments in show posts and assign it to total_show_comments.
    * Set total_show_comments to 0.
6. Use a for loop to iterate over the show posts.
    * Because the num_comments column is the fifth column in show_posts, you'll need to get the element at index 4 in each row.
        * You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
        * Add this value to total_show_comments.
7. Compute the average number of comments on show posts and assign it to avg_show_comments.
8. Print avg_show_comments.
9. Do show posts or ask posts receive more comments on average? Write a markdown cell explaining your findings.



In [4]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments = total_ask_comments + num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)


14.038417431192661


In [5]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments = total_show_comments + num_comments
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)


10.31669535283993


Ask posts recibe more comments on average (around 14)  than show posts (approximately 10)

# Finding the amount of ask posts by hour created
## Instructions
1. Import the datetime module as dt.
2. Create an empty list and assign it to result_list. This will be a list of lists.
3. Iterate over ask_posts and append to result_list a list with two elements:
    * The first element shall be the column created_at.
        * Because the created_at column is the seventh column in ask_posts, you'll need to get the element at index 6 in each row.
    * The second element shall be the number of comments of the post.
        * You'll also need to convert the value to an integer.
4. Create two empty dictionaries called counts_by_hour and comments_by_hour.
5. Loop through each row of result_list.
6. Extract the hour from the date, which is the first element of the row.
7. Use the datetime.strptime() method to parse the date and create a datetime object.
8. Use the string we want to parse as the first argument and a string that specifies the format as the second argument.
    * Use the datetime.strftime() method to select just the hour from the datetime object.
    * If the hour isn't a key in counts_by_hour:
        * Create the key in counts_by_hour and set it equal to 1.
        * Create the key in comments_by_hour and set it equal to the comment number.
    * If the hour is already a key in counts_by_hour:
        * Increment the value in counts_by_hour by 1.
        * Increment the value in comments_by_hour by the comment number.

In [26]:
import datetime as dt
result_list = []

for row in ask_posts:
    lista = [row[6],int(row[4])]
    result_list.append(lista)
    

#print(result_list[0:4])
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour_str = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour = hour_str.hour
    #print(hour) 
    if hour in counts_by_hour:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] = comments_by_hour[hour] + row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

#print(counts_by_hour)
#print(comments_by_hour)
                                 
    


# 6. Calculating the average number of comments of Ask HN posts by hour
## Instructions
1. Use the example above to calculate the average number of comments per post for posts created during each hour of the day.
2. The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post. Assign the result to a variable named avg_by_hour. Display the results.

In [28]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
print(avg_by_hour)

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]


# 7 Sorting and printing values from a list of lists
## Instructions
1. Create a list that equals avg_by_hour with swapped columns.
    * Create an empty list and assign it to swap_avg_by_hour.
    * Iterate over the rows of avg_by_hour and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.
2. Print swap_avg_by_hour.
3. Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
    * Set the reverse argument to True, so that the highest value in the first column appears first in the list.
    * Assign the result to sorted_swap.
4. Print the string "Top 5 Hours for Ask Posts Comments".
5. Loop through each average and each hour (in this order) in the first five lists of sorted_swap.
6. Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.
    * To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time.
    * To format the average, you can use {:.2f} to indicate that just two decimal places should be used.
7. Which hours should you create a post during to have a higher chance of receiving comments? Refer back to the documentation for the data set to convert the times to the time zone you live in. Write a markdown cell explaining your findings.

In [49]:
#from operator import itemgetter
swap_avg_by_hour = []
hour_format = "%H"
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
#print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print("Top 5 Hours for Ask Posts Comments")
sorted_swap5 = sorted_swap[0:5]
for row in sorted_swap5:
    hour = dt.datetime.strptime(str(row[1]),hour_format)
    linea = "{}:{:2f} average comments per hour".format(hour.strftime("%H:%M"),row[0])
    print(linea)


Top 5 Hours for Ask Posts Comments
15:00:38.594828 average comments per hour
02:00:23.810345 average comments per hour
20:00:21.525000 average comments per hour
16:00:16.796296 average comments per hour
21:00:16.009174 average comments per hour


The time given in the data set is expressed as US-Eastern time. Right now, here in Colombia we are an hour behind. However, I think that changes in some periods of the year we are in the same time zone. In anay case, to target, for example, the 15:00 hour time, I should try to post near 14:00 hour time