## Exploring 'Hacker News' Posts

Hacker News is a website where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We will be looking specifically at 'Ask HN' and 'Show HN' posts. 'Ask' posts are submitted to ask a question of the community, while 'Show' posts are meant to share something that others might find interesting, such as a project or product.



We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists. We'll create a separate variable for the header row.

In [13]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

Now let's look at the first five entries in the data set, using a function:

In [14]:
def explore_data(dataset, start, end, rows_columns=False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n')
    if rows_columns:
        print(f'Number of rows: {len(dataset)}')
        print(f'Number of columns: {len(dataset[0])}')

hacker_data = explore_data(hn, 0, 5)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




With the header row out of the way, we can start filtering the data. Since we're only interested in the Ask HN and Show HN posts, we'll need to filter those out and place them into their own respective lists of lists.

In [15]:
# initialize the new lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]  # pull the title from the second column
    title = title.lower()  # change title to all lower-case, so it can be filtered into the correct list
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# let's check the length of each list.
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Next, we'll see which type of post has the most comments on average. We'll loop over each list and add the number of comments for each row to the total, then divide by the number of rows.

In [18]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of comments on Ask HN posts: ", avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments on Show HN posts: ", avg_show_comments)

Average number of comments on Ask HN posts:  14.038417431192661
Average number of comments on Show HN posts:  10.31669535283993


So, Ask HN posts receive a little over 14 comments each on average, while Show HN posts receive about 10.3. Based on this, it appears initially that Ask posts are more popular, or at least typically generate a more lengthy discussion. With this in mind, we'll focus on just the Ask posts moving forward.

Another interesting question is whether the time of day the post is created has an impact on how many comments are generated. Using the `datetime` module we can perform further filtering to that end. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2. Calculate the average number of comments ask posts receive by hour created.


In [None]:
import datetime as dt

result_list = []
