# Exploring Hacker News Posts
In this project, I am working with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories are voted and commented upon similar to reddit. Posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

I'm specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the community a project, product, or just generally something interesting.

I am comparing these two types of posts to determine the following: 
- Do `Ask HN` or `Show HN` receive more comments on average
- Do posts created at a certain time receive more comments on average?

I want to start by importing the libraries we need and reading the data set into a list of lists.

In [2]:
# import the reader function from the csv module
from csv import reader

# use the built-in function open() to open the file
opened_file = open('hacker_news.csv')

# use csv.reader() to parse the data from opened file
read_file = reader(opened_file)

#use list() to convert the read file into a list of lists format
hn = list(read_file)

#close the opened file
opened_file.close()

# display the first five rows
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


# Removing Headers from a List of Lists
The first list in the inner list contains the column headers, and the lists after contain the data for one row. I need to remove the row containing the column headers to analyze the data. 

I want to extract the first row of data and assign it to the variable `headers`. Remove the first row from `hn`. Then I want to display the header and then the first five rows of `hn`.

In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Extracting Ask HN and Show HN Posts
Now that I've removed the headers from `hn`, we're ready to filter our data. Since I'm only concerned with post titles beginning with `Ask HN` or `Show HN`, I want to create new lists of lists just containing the data for those titles.

To find posts that begin with either `Ask HN` or `Show HN`, I will use the string method `startswith`. Given a string object, I can check if it starts with `Ask HN` or `Show HN`. If it starts with the key word, it will return `True`, otherwise it will return `False`. I also need to control for case and convert the string to lowercase.

To do this I will begin by creating three empty lists and then looping through `hn` identifying posts that meet my criteria and sorting them into their matching list.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Checking number of posts in each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


# Calculating the Average Number of Comments 
Now that I have separated the ask posts and the show posts into two lists of lists. Next, I want to determine if ask posts or show posts receive more comments on average.

First I want to find the total number of comments in ask posts and then calculate the average number of comments for that list.

In [5]:
total_ask_comments = 0
count = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    count += 1
avg_ask_comments = total_ask_comments / count
print(avg_ask_comments)


14.038417431192661


Next I want to figure out the average comments for show posts and then calculate the average of those comments to determine which type of article receives the more comments on average.

In [6]:
total_show_comments = 0
count = 0
for row in show_posts:
    total_show_comments += int(row[4])
    count += 1
avg_show_comments = total_show_comments / count
print(avg_show_comments)

10.31669535283993


Based on these two calculations it appears that `Ask HN` posts receive more comments on average than `Show HN` posts.

# Finding the Amount of Ask Posts and Comments by Hour Created
I determined that on average, ask posts receive more comments than show posts. Becuase of this I want to focus my remaining analysis on just these posts.

Now I want to determine if asks posts created at a certain time are more likely to attract comments. There are two steps to performing this analysis
1. calculate the amount of ask posts created in each hour of the day, along with the number of comments received
2. Calculate the average number of comments ask posts receive by hour created

First I want to calucate the ask posts and comments by the hours. To do this I will use the `datetime` module to work with the data in the `created_at` column

In [7]:
#import datetime module as dt
import datetime as dt

#create an empty list
result_list = []

#iterate over ask_posts and append to result_list
for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result = [created_at,comments]
    result_list.append(result)
    
#create two empty dictionaries
counts_by_hour = {}
comments_by_hour = {}

#loop through each row of result_list
for row in result_list:
    num_comment = row[1]
    create_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour = create_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comment
            

# Calculating Average Number of Comments by Hour
Now that I have created two dictionaries:
- `counts_by_hour`: contains the number of ask posts created during each hour day
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received

Next I want to use these dictionaries to calculate the average number of comments for posts created during each hour of the day

In [8]:
#create an empty list
avg_by_hour = []

#iterate over the comments_by_hour dictionary
for row in comments_by_hour:
    average = comments_by_hour[row] / counts_by_hour[row]
    avg_by_hour.append([row, average])
    
#display the list
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


# Sorting and Printing Values
The format of the current results makes it hard to identify the hours with the highest values. I want to finish by sorting the list of lists and printing the five highest values in a format that's easier to read

In [9]:
# create an empty list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[0:5]:
    hour_time = dt.datetime.strptime(row[1],"%H")
    hour_string = hour_time.strftime("%H")
    final_sent = '{hour}: {avg:.2f} average comments per post'.format(hour=hour_string,avg=row[0])
    print(final_sent)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Top 5 Hours for Ask Posts Comments
15: 38.59 average comments per post
02: 23.81 average comments per post
20: 21.52 average comments per post
16: 16.80 average comments per post
21: 16.01 average comments per post


# Final Conclusion
From the dataset I can see that hour 15 is the top hour for average number of comments per post followed by hours 13 and 12. Based on the documentation of the dataset I know that the time is in Eastern Standard Time in the US. Based on my findings I would predict that if you were to create a post between the hours of 12-3pm EST you have a higher chance of receiving comments on your ask post.

# Potential Next Steps
If I wanted to expand on this project, some things I could do are:
- change the time zone for the top hours to post Ask HN posts
- determine if show or ask posts receive more points on average
- determine if posts created at a certain are more likely to receive points
- compare my results to the average number of comments and points on other posts.

## Changing the Time Zone
Now I want to use the datetime method to convert the hours from our busiest times to times from Eastern Standard Time. I want to do this so that I can better understand time what times I should post an Ask HN post to get the most comments.

In [11]:
template = "{hour}: {comments:.2f} average comments per post"
for row in sorted_swap[:5]:
    hour = row[1]
    comment = row[0]
    hour_dt = dt.datetime.strptime(hour, "%H")
    hour_str = hour_dt.strftime("%I:%M %p")
    final_str = template.format(hour=hour_str, comments=comment)
    print(final_str)

03:00 PM: 38.59 average comments per post
02:00 AM: 23.81 average comments per post
08:00 PM: 21.52 average comments per post
04:00 PM: 16.80 average comments per post
09:00 PM: 16.01 average comments per post


## Determining which posts receive more points
First I want to use the lists I created earlier for ask, show and other posts. I need to loop through them to gather information about number of points and determine which set of posts receive the most points on average.

In [12]:
total_ask_points = 0
for row in ask_posts:
    points = row[3]
    points_int = int(points)
    total_ask_points += points_int
    
avg_ask_points = total_ask_points / len(ask_posts)
print(avg_ask_points)

total_show_points = 0
for row in show_posts:
    points = row[3]
    points_int = int(points)
    total_show_points += points_int

avg_show_points = total_show_points / len(show_posts)
print(avg_show_points)

total_other_points = 0
for row in other_posts:
  points = row[3]
  points_int = int(points)
  total_other_points += points_int

avg_other_points = total_other_points / len(other_posts)
print(avg_ask_points)

15.061926605504587
27.555077452667813
15.061926605504587


From this calculation I can see that Show HN posts receive the most points per post. So now I would like to determine at which hours show posts are more likely to receive points.

In [13]:
import datetime as dt
result_list_show = []
for row in show_posts:
    created = row[6]
    points = row[3]
    points_int = int(points)
    list_elem_s = [created, points_int]
    result_list_show.append(list_elem_s)

show_counts_by_hour = {}
points_by_hour = {}
for row in result_list_show:
    date = row[0]
    date_obj = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date_obj.strftime("%H")
    if hour not in show_counts_by_hour:
        show_counts_by_hour[hour] = 1
        points_by_hour[hour] = row[1]
    else:
        show_counts_by_hour[hour] += 1
        points_by_hour[hour] += row[1]

print(show_counts_by_hour)
print(points_by_hour)

{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}
{'14': 2187, '22': 1856, '18': 2215, '07': 494, '20': 1819, '05': 104, '16': 2634, '19': 1702, '15': 2228, '03': 679, '17': 2521, '06': 375, '02': 340, '13': 2438, '08': 519, '21': 866, '04': 386, '11': 1480, '12': 2543, '23': 1526, '09': 553, '01': 700, '10': 681, '00': 1173}


Now I want to use these two new dictionaries for finding the average points per hour.

- show_counts_by_hour: containing the number of show posts created each hour
- points_by_hour: containing the corresponding number of points per hour

In [17]:
avgpoints_by_hour = []
for points in points_by_hour:
    avgpoints_by_hour.append([points, points_by_hour[points]/show_counts_by_hour[points]])
    

swap_avgpoints_by_hour = []
for row in avgpoints_by_hour:
    swap_avgpoints_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avgpoints_by_hour, reverse = True)
print("Top 5 hours for Show HN post points")
for row in sorted_swap[:5]:
    hour = row[1]
    comment = row[0]
    hour_dt = dt.datetime.strptime(hour, "%H")
    hour_str = hour_dt.strftime("%I:%M %p")
    final_str = template.format(hour=hour_str, comments=comment)
    print(final_str)

Top 5 hours for Show HN post points
11:00 PM: 42.39 average comments per post
12:00 PM: 41.69 average comments per post
10:00 PM: 40.35 average comments per post
12:00 AM: 37.84 average comments per post
06:00 PM: 36.31 average comments per post


## Comparing Results
Now that I've calculated the average points for each type of posts. I need to calculate the average amount of comments on other posts. Create lists for each set of averages before combining them into a list of lists.

In [22]:
total_other_comments = 0
avg_other_posts = []
avg_ask_posts = []
avg_show_posts = []
avg_list = []
for row in other_posts:
    comments = row[4]
    comments_int = int(comments)
    total_other_comments += comments_int
    
avg_other_comments = total_other_comments / len(other_posts)

avg_other_posts = [avg_other_comments, avg_other_points]
avg_ask_posts = [avg_ask_comments, avg_ask_points]
avg_show_posts = [avg_show_comments, avg_show_points]
avg_list.append(avg_other_posts)
avg_list.append(avg_ask_posts)
avg_list.append(avg_show_posts)
print(avg_list)

[[26.8730371059672, 55.4067698034198], [14.038417431192661, 15.061926605504587], [10.31669535283993, 27.555077452667813]]


## Additional Analysis
Now that I have all the averages togethe, I can deduce from my calculations that other posts that do not include Show HN or Ask HN have higher comments and higher points. Once the other posts are excluded we can see that Ask HN has the second most comments and the least amount of points per post. Lastly Show HN have the least amount of comments per post but the second most amount of points.