# Exploring Hacker News Posts

In this project I will be analyzing a data set from the popular website [Hacker News](https://news.ycombinator.com/). This [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) contains 20,000 rows of user submitted posts that contain at least one comment.

The goal of this project is to determine which posts, between the subjects **Ask HN** or **Show HN**, receive more comments, and whether posts created during a certain time receive more comments.

* The Ask HN subject allows posters to ask the community about something.

* The Show HN subject allows posters to show the community something.

In [8]:
##Let's begin by opening and reading the data set
from csv import reader
opened = open("hacker_news.csv")
read = reader(opened)
hacker = list(read)


##The data of our set will be labeled hn
hn = hacker[1:]
##And the header will be labled hnh
hnh = hacker[0]

##Let's print the first five rows to ensure everything is correct
print(hnh, "\n")
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now that we have our data, let's start cleaning it.

## Data Cleaning: Ask HN and Show HN

We are only interested in the data that use the subjects Ask HN and Show HN, so we'll create and fill two new lists, respectively, with this pertinent data.

In [16]:
##We'll create two lists for the pertinent data, 
##and another for the rest of the data.
ask_posts = []
show_posts = []
other_posts = []

##Now we'll loop through our data set and fill each of our new 
##lists with it's respective data.
for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn") == True:
        ask_posts.append(row)
    elif title.startswith("show hn") == True:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
##Let's see how much data we are left with in each list
print("Ask HN Posts :", len(ask_posts), "\n"
      "Show HN Posts :", len(show_posts), "\n"
      "Other Posts :", len(other_posts))


Ask HN Posts : 1744 
Show HN Posts : 1162 
Other Posts : 17194


We are left with 2906 posts to work with. Our data is all in a clean, readable format for our first goal of determining which subject receives more comments. So let's begin.

## Analysis: Ask HN and Show HN Average Comments

In [25]:
##Let's begin by creating our variables
total_ask_comments = 0
total_show_comments = 0

avg_ask_comments = 0
avg_show_comments = 0

total_ask = 0
total_show = 0

##Now let's loop through our respective subjects to determine
##the number of comments in each subject, as well as the average
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    total_ask += 1
    avg_ask_comments = total_ask_comments // total_ask
    
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    total_show += 1
    avg_show_comments = total_show_comments // total_show
    
##Now that we have our relevant information, let's print it out and view it
print("Total Ask HN Comments :", total_ask_comments, "\n"
      "Average Ask HN Comments :", avg_ask_comments, "\n", "\n"
     
      "Total Show HN Comments :", total_show_comments, "\n"
      "Average Show HN Comments :", avg_show_comments)

Total Ask HN Comments : 24483 
Average Ask HN Comments : 14 
 
Total Show HN Comments : 11988 
Average Show HN Comments : 10


It appears that Ask HN posts receieve more comments and discussion compared to Show HN posts. While the difference in the number of posts between the two subjects is only 582 posts, Ask HN has more than double the number of comments.

With Ask HN post comments average only being 4 more than the average of Show HN comments, it pushes me to believe that there are a number of very popular posts in the Ask HN subject that are outliers compared to the rest.

## Data Cleaning: Time of Day

Now we can begin our second goal of finding the most popular time of day for posts receiving comments. With our Ask HN subject receiving more than double the number of comments compared to Show HN, we'll base our analysis around this data.

We'll first need to put all of the data into a readable and standard form, we'll use dictionaries. We only want the hour of day that posts and comments were made, so we can create a new list with only this information.

In [88]:
##We'll import the datetime module, which contains many useful classes 
##and methods for managing dates and times. We'll give the module an 
##alias of dt, to help with code readiability
import datetime as dt

##We'll create a list of lists that contains the pertinent information
##from our posts. The comments and the date created
result_list = []

##Then we'll create a loop to grab this information
for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])
    
##Now we'll populate two dictionaries with this information.
##One dictionary will contain the number of posts made for
##a certain hour of the day, while the other will contain the
##number of comments for a certain hour of the day.
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comments = row[1]
    dates = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = dates.strftime("%H")
    hour = hour + ":00"
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
        
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
##Let's view our dictionaries
print("------------------" "\n"
      "||Posts by Hour ||" "\n"
      "------------------" "\n", 
       counts_by_hour, "\n" "\n"
      "---------------------" "\n"
      "||Comments by Hour ||" "\n"
      "---------------------" "\n", 
       comments_by_hour)

------------------
||Posts by Hour ||
------------------
 {'06:00': 44, '13:00': 85, '18:00': 109, '22:00': 71, '05:00': 46, '12:00': 73, '08:00': 48, '15:00': 116, '00:00': 55, '23:00': 68, '04:00': 47, '10:00': 59, '02:00': 58, '17:00': 100, '16:00': 108, '01:00': 60, '07:00': 34, '09:00': 45, '14:00': 107, '03:00': 54, '21:00': 109, '19:00': 110, '20:00': 80, '11:00': 58} 

---------------------
||Comments by Hour ||
---------------------
 {'06:00': 397, '13:00': 1253, '18:00': 1439, '22:00': 479, '05:00': 464, '12:00': 687, '08:00': 492, '15:00': 4477, '00:00': 447, '23:00': 543, '04:00': 337, '10:00': 793, '02:00': 1381, '17:00': 1146, '16:00': 1814, '01:00': 683, '07:00': 267, '09:00': 251, '14:00': 1416, '03:00': 421, '21:00': 1745, '19:00': 1188, '20:00': 1722, '11:00': 641}


## Analysis: Average Posts per Hour of Day

Now that we have our dictionaries, we can use them to find the average number of posts received during each hour of the day.

In [80]:
##Let's create a list that'll contain our averages. The first column
##will hold the time of day, and the second will contain the average
avg_by_hour = []

for key in counts_by_hour:
    average = comments_by_hour[key] // counts_by_hour[key]
    avg_by_hour.append([key, average])
    
##And let's print our results
print("---------------------------------" "\n"
      "||Average Posts to Time of Day ||" "\n" 
      "---------------------------------" "\n", avg_by_hour)

---------------------------------
||Average Posts to Time of Day ||
---------------------------------
 [['06:00', 9], ['13:00', 14], ['18:00', 13], ['22:00', 6], ['05:00', 10], ['12:00', 9], ['08:00', 10], ['15:00', 38], ['00:00', 8], ['23:00', 7], ['04:00', 7], ['10:00', 13], ['02:00', 23], ['17:00', 11], ['16:00', 16], ['01:00', 11], ['07:00', 7], ['09:00', 5], ['14:00', 13], ['03:00', 7], ['21:00', 16], ['19:00', 10], ['20:00', 21], ['11:00', 11]]


Now we have our results! This is a little difficult to read, however, so let's make it a little nicer.

In [98]:
##Let's swap the columns so that we can sort our data from most
##posts to least posts.
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg, hour])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
back_swap = sorted(swap_avg_by_hour, reverse=False)

##Then let's print our results
print("-------------------------------------------" "\n"
      "||Top 5 Best Hours for Ask Posts Comments||" "\n"
      "-------------------------------------------")

for avg in sorted_swap[:5]:
    format_str = "{0}: {1} average comments per post"
    print(format_str.format(avg[1], avg[0]))
    
    
print("--------------------------------------------" "\n"
      "||Top 3 Worst Hours for Ask Posts Comments||" "\n"
      "--------------------------------------------")

for avg in back_swap[:3]:
    format_str = "{0}: {1} average comments per post"
    print(format_str.format(avg[1], avg[0]))

-------------------------------------------
||Top 5 Best Hours for Ask Posts Comments||
-------------------------------------------
15:00: 38 average comments per post
02:00: 23 average comments per post
20:00: 21 average comments per post
21:00: 16 average comments per post
16:00: 16 average comments per post
--------------------------------------------
||Top 3 Worst Hours for Ask Posts Comments||
--------------------------------------------
09:00: 5 average comments per post
22:00: 6 average comments per post
03:00: 7 average comments per post


It appears that 3PM Eastern Time is when comments are most likely, with 2AM and 8PM in second and third place. The time with the least chance of comments is at 9AM. For the highest chance to receieve comments on your ask HM post, you should post at 3PM ET, and avoid posting at 9AM.

## Conclusion

In conclusion, we have successfully answered both of our questions. Ask HM posts receive the most comments from the community, with over double the number of comments compared to Show HM posts. To receive the most number of comments on an Ask HM post, you should post at 3PM ET.