# Analysis of the Hacker News Dataset

[Hacker News](https://news.ycombinator.com/) is a website where user-submitted stories (known as "posts"). The posts are voted and commented upon, similar to reddit. Hacker News is very popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We will be analyzing posts that start with 'Ask HN' and 'Show HN'. 
* **Ask HN** is where a user will ask the Hacker News community a question.
* **Show HN** is where a user will show the Hacker News community a project, product and just something interesting.


In [52]:
from csv import reader

### The Hacker News data set ###
opened_file = open('hacker_news.csv', encoding="utf-8")
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]              #The header row
hn_minus_header = hn[1:]       #The rest of the data

#Set your values for the starting row and ending row of the dataset.
starting_row = 0
ending_row = 5
total_rows = ending_row - starting_row

Below we created a function to explore the data, by slicing with a start row and end row. We also print the header row, the data in each row within the slice and the total number of rows and columns in the dataset.

In [53]:
def explore_data(dataset, start, end, rows_and_columns=False, newline=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        if newline == True:
            print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('\n')
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(hn_header)
#print('\n')
print('\nThe first ' + str(total_rows) + ' rows of the dataset:\n')
#print('\n')
explore_data(hn_minus_header, starting_row, ending_row, True, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

The first 5 rows of the dataset:

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell

## Now we will extract the posts for Ask HN and Show HN

Below, we will separate the data into 3 different lists, print the size of each list and display the first few rows of ask_posts and show_posts:
    * A list for Ask HN posts
    * A list for Show HN posts
    * A list for other posts

In [62]:
# Create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn_minus_header:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Length of Ask HN posts = ' + str(len(ask_posts)))
print('Length of Show HN posts = ' + str(len(show_posts)))
print('Length of Other posts = ' + str(len(other_posts)))

print('\nHere are 5 rows of Ask HN:')
explore_data(ask_posts, starting_row, ending_row, False, False)
print('\nHere are 5 rows of Show HN:')
explore_data(show_posts, starting_row, ending_row, False, False)

Length of Ask HN posts = 1744
Length of Show HN posts = 1162
Length of Other posts = 17194

Here are 5 rows of Ask HN:
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']

Here are 5 rows of Show HN:
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'd

Next, let's determine if **ask posts** received the most comments or if **show posts** received the most comments, on average.

In [55]:
# Keep track of the total comments for Ask HN
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])  
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

# Keep track of the total comments for Show HN
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])  
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
    

14.038417431192661
10.31669535283993


### Analysis of above data points:
* Ask HN has more posts than Show HN.
* There are more posts for **Ask HN** and there are more comments on average vs **Show HN**. 
* We will explore **Ask HN** posts since this dataset has more posts and comments. 

### Next, we'll determine if ask posts created at a certain time are more likely to attract comments. 
We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [63]:
import datetime as dt

result_list = []
for posts in ask_posts:
    result_list.append([posts[6], int(posts[4])])
    
#Create 2 empty dictionaries
counts_by_hour = {}
comments_by_hour = {}

#Loop through result_list
date_format = "%m/%d/%Y %H:%M"

for result in result_list:
    created_date = result[0]
    comment = result[1]
    hour = dt.datetime.strptime(created_date, date_format).strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    
print (comments_by_hour)      


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Next, we'll use the **counts_by_hour** and **comments_by_hour** dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [57]:
avg_comments_by_hour = []

for hour in comments_by_hour:
    avg_comments_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_comments_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Now we will make the list above easier to read.

In [58]:
swap_avg_per_hour = []

for hour in avg_comments_by_hour:
    swap_avg_per_hour.append([hour[1], hour[0]])
    
print (swap_avg_per_hour)

sorted_swap = sorted(swap_avg_per_hour, reverse=True)
print ('\n')
print (sorted_swap)


[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'],

In [59]:
# Sort the values and print the 5 top hours with the highest average comments
print ('\nTop 5 Hours for Ask HN Post Comments:')

for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post.".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg
        )
    )



Top 5 Hours for Ask HN Post Comments:
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


According to our sample dataset, if you were to post a question to **Ask HN** in the 15:00 (or 3pm) hour, you would get the most visibility for your post. The 15:00 hour has almost 62% more comments on average than the 02:00 hour.
According to the [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home), the timezone is in Eastern timezone. 

# Conclusion
In this project, we created the following tasks:
* Read in a sample size of 20,100 rows of the Hacker News dataset
* Seperated out the posts between **Ask HN**, **Show HN** and **Other**
* Determined that **Ask HN** had to largest number of posts
* Determined the average number of comments per **Ask HN** and **Show HN**. **Ask HN** had the highest average number of comments.
* We focused on the **Ask HN** posts and calculated the average number of comments per hour in a given day.
* We sorted our **Ask HN** dataset by average number of comments, in ascending order. 
* Finally, we displayed the top 5 hours for **Ask HN** posts, based on average number of comments.

Our recommendataion for getting the highest visibility of an **Ask HN** post is to create a post in the 15:00 to 15:59 Eastern timeframe. Note that we excluded posts that did not have any comments.