# Exploring Hacker News Posts
In this project, I am working with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories are voted and commented upon similar to reddit. Posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

I'm specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the community a project, product, or just generally something interesting.

I am comparing these two types of posts to determine the following: 
- Do `Ask HN` or `Show HN` receive more comments on average
- Do posts created at a certain time receive more comments on average?

I want to start by importing the libraries we need and reading the data set into a list of lists.

In [1]:
# import the reader function from the csv module
from csv import reader

# use the built-in function open() to open the file
opened_file = open('hacker_news.csv')

# use csv.reader() to parse the data from opened file
read_file = reader(opened_file)

#use list() to convert the read file into a list of lists format
hn = list(read_file)

#close the opened file
opened_file.close()

# display the first five rows
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


# Removing Headers from a List of Lists
The first list in the inner list contains the column headers, and the lists after contain the data for one row. I need to remove the row containing the column headers to analyze the data. 

I want to extract the first row of data and assign it to the variable `headers`. Remove the first row from `hn`. Then I want to display the header and then the first five rows of `hn`.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


# Extracting Ask HN and Show HN Posts
Now that I've removed the headers from `hn`, we're ready to filter our data. Since I'm only concerned with post titles beginning with `Ask HN` or `Show HN`, I want to create new lists of lists just containing the data for those titles.

To find posts that begin with either `Ask HN` or `Show HN`, I will use the string method `startswith`. Given a string object, I can check if it starts with `Ask HN` or `Show HN`. If it starts with the key word, it will return `True`, otherwise it will return `False`. I also need to control for case and convert the string to lowercase.

To do this I will begin by creating three empty lists and then looping through `hn` identifying posts that meet my criteria and sorting them into their matching list.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Checking number of posts in each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


# Calculating the Average Number of Comments 
Now that I have separated the ask posts and the show posts into two lists of lists. Next, I want to determine if ask posts or show posts receive more comments on average.

First I want to find the total number of comments in ask posts and then calculate the average number of comments for that list.

In [4]:
total_ask_comments = 0
count = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    count += 1
avg_ask_comments = total_ask_comments / count
print(avg_ask_comments)


10.393478498741656


Next I want to figure out the average comments for show posts and then calculate the average of those comments to determine which type of article receives the more comments on average.

In [5]:
total_show_comments = 0
count = 0
for row in show_posts:
    total_show_comments += int(row[4])
    count += 1
avg_show_comments = total_show_comments / count
print(avg_show_comments)

4.886099625910612


Based on these two calculations it appears that `Ask HN` posts receive more comments on average than `Show HN` posts.

# Finding the Amount of Ask Posts and Comments by Hour Created
I determined that on average, ask posts receive more comments than show posts. Becuase of this I want to focus my remaining analysis on just these posts.

Now I want to determine if asks posts created at a certain time are more likely to attract comments. There are two steps to performing this analysis
1. calculate the amount of ask posts created in each hour of the day, along with the number of comments received
2. Calculate the average number of comments ask posts receive by hour created

First I want to calucate the ask posts and comments by the hours. To do this I will use the `datetime` module to work with the data in the `created_at` column

In [9]:
#import datetime module as dt
import datetime as dt

#create an empty list
result_list = []

#iterate over ask_posts and append to result_list
for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result = [created_at,comments]
    result_list.append(result)
    
#create two empty dictionaries
counts_by_hour = {}
comments_by_hour = {}

#loop through each row of result_list
for row in result_list:
    num_comment = row[1]
    create_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour = create_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comment
            

# Calculating Average Number of Comments by Hour
Now that I have created two dictionaries:
- `counts_by_hour`: contains the number of ask posts created during each hour day
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received

Next I want to use these dictionaries to calculate the average number of comments for posts created during each hour of the day

In [15]:
#create an empty list
avg_by_hour = []

#iterate over the comments_by_hour dictionary
for row in comments_by_hour:
    average = comments_by_hour[row] / counts_by_hour[row]
    avg_by_hour.append([row, average])
    
#display the list
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


# Sorting and Printing Values
The format of the current results makes it hard to identify the hours with the highest values. I want to finish by sorting the list of lists and printing the five highest values in a format that's easier to read

In [17]:
# create an empty list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[0:5]:
    hour_time = dt.datetime.strptime(row[1],"%H")
    hour_string = hour_time.strftime("%H")
    final_sent = '{hour}: {avg:.2f} average comments per post'.format(hour=hour_string,avg=row[0])
    print(final_sent)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]
Top 5 Hours for Ask Posts Comments
15: 28.68 average comments per post
13: 16.32 average comments per post
12: 12.38 average comments per post
02: 11.14 average comments per post
10: 10.68 average comments per post


# Final Conclusion
From the dataset I can see that hour 15 is the top hour for average number of comments per post followed by hours 13 and 12. Based on the documentation of the dataset I know that the time is in Eastern Standard Time in the US. Based on my findings I would predict that if you were to create a post between the hours of 12-3pm EST you have a higher chance of receiving comments on your ask post.

# Potential Next Steps
If I wanted to expand on this project, some things I could do are:
- determine if show or ask posts receive more points on average
- determine if posts created at a certain are more likely to receive points
- compare my results to the average number of comments and points on other posts.