# Exploring Hacker News Posts
In this project, I am working with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories are voted and commented upon similar to reddit. Posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

I'm specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the community a project, product, or just generally something interesting.

I am comparing these two types of posts to determine the following: 
- Do `Ask HN` or `Show HN` receive more comments on average
- Do posts created at a certain time receive more comments on average?

I want to start by importing the libraries we need and reading the data set into a list of lists.

In [2]:
# import the reader function from the csv module
from csv import reader

# use the built-in function open() to open the file
opened_file = open('hacker_news.csv')

# use csv.reader() to parse the data from opened file
read_file = reader(opened_file)

#use list() to convert the read file into a list of lists format
hn = list(read_file)

#close the opened file
opened_file.close()

# display the first five rows
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


# Removing Headers from a List of Lists
The first list in the inner list contains the column headers, and the lists after contain the data for one row. I need to remove the row containing the column headers to analyze the data. 

I want to extract the first row of data and assign it to the variable `headers`. Remove the first row from `hn`. Then I want to display the header and then the first five rows of `hn`.

In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


# Extracting Ask HN and Show HN Posts
Now that I've removed the headers from `hn`, we're ready to filter our data. Since I'm only concerned with post titles beginning with `Ask HN` or `Show HN`, I want to create new lists of lists just containing the data for those titles.

To find posts that begin with either `Ask HN` or `Show HN`, I will use the string method `startswith`. Given a string object, I can check if it starts with `Ask HN` or `Show HN`. If it starts with the key word, it will return `True`, otherwise it will return `False`. I also need to control for case and convert the string to lowercase.

To do this I will begin by creating three empty lists and then looping through `hn` identifying posts that meet my criteria and sorting them into their matching list.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Checking number of posts in each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


# Calculating the Average Number of Comments 
Now that I have separated the ask posts and the show posts into two lists of lists. Next, I want to determine if ask posts or show posts receive more comments on average.

First I want to find the total number of comments in ask posts and then calculate the average number of comments for that list.

In [9]:
total_ask_comments = 0
count = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    count += 1
avg_ask_comments = total_ask_comments / count
print(avg_ask_comments)


10.393478498741656


Next I want to figure out the average comments for show posts and then calculate the average of those comments to determine which type of article receives the more comments on average.

In [10]:
total_show_comments = 0
count = 0
for row in show_posts:
    total_show_comments += int(row[4])
    count += 1
avg_show_comments = total_show_comments / count
print(avg_show_comments)

4.886099625910612


Based on these two calculations it appears that `Ask HN` posts receive more comments on average than `Show HN` posts.