# Hacker News Webiste Analysis

Hacker News is a webiste started by Y Combinator, a startup incubator. Users submit posts or stories, and stories are voted and commented upon. Hacker News is a site popular in techhnology circles with popular posts receiving upwards of hundreds of thousands of visitors. 

The data set can be downloaded at this [link](https://www.kaggle.com/hacker-news/hacker-news-posts/download), with an explanation of the data set found [here](https://www.kaggle.com/hacker-news/hacker-news-posts#HN_posts_year_to_Sep_26_2016.csv).

Here is an explanantion of the column names and the data each contains:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

In our analysis we are only interested in posts whose titles begin with Ask HN or Show HN. The overall goal of our anlysis will be to compare both of these posts to figure out:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's begin by cleaning our data set. 

## Cleaning the Data Set

The original data set contains about 300,000 rows of data, for our purposes we have reduced it to 20,000 rows by removing all rows without comments and selecting a random sample from the remaining posts. 

Let's begin by reading in the data set. 

In [4]:
# Read In Hackner News Data #

from csv import reader

opened_file = open('hacker_news.csv', encoding='utf8')

read_file = reader(opened_file)

hn = list(read_file)

hn_header = hn[0]

hn_data = hn[1:]

Let's do some preliminary data exploration to see how many rows and columns the data set has, figure which row contains the number of comments, and explore the first few rows. 

We will first define a function `explore_data()` that will allow us to explore the data set.  

In [6]:
# Define Function explore_data() #

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:   # if you specify True in parameters will print number of rows and columns, default is False
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')

We will now print the column names and explore the first 5 rows of data.

In [9]:
# Column Names #

print(hn_header)
print('\n')


# Explore First 5 Rows #

explore_data(hn_data, 0, 5, True)  # Specifying True will print out number of rows and columns

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of 

We can see that the column of interest is `num_comments`, and is found at index 4 or -3. Our data contains 293,119 rows and 7 columns. We can see that for all of these rows, the `num_comments` is equal to 0.

Let's proceed to removing rows with no comments, but before we do; we will change the data type to integer as it is currently a string.

In [10]:
# For Loop to Remove Rows with 0 Comments #

hn_clean = []

for row in hn_data:
    
    num_comm = int(row[4])  # change type from str to int
    
    if num_comm != 0: 
        hn_clean.append(row)
        #for loop will add rows with more than 0 comments to hn_clean defined above
        
        
# Explore Clean Data #

explore_data(hn_clean, 0, 5, True)

['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']


['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']


['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']


['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']


Number of rows: 80401
Number of columns: 7




We can see that our number of rows has decreased significantly to 80,401. Also, we can see that none of the rows have a 0 for the column `num_comments`. 

Let's proceed to extract a random sample of 20,000 rows. 

In [13]:
# Get Random Sample of 20,000 from hn_clean #

import random # module we need to perform random sampling

hn_20k = random.sample(hn_clean, 20000)

# Explore hn_20k #

explore_data(hn_20k, 0, 5, True)

['10609350', "Show HN: Spamaltman  a Twitter bot based on Sam Altman's blog posts", 'https://twitter.com/spamaltman', '7', '1', 'moakq', '11/22/2015 7:23']


['10645686', 'Bilingual people are twice as likely to recover from a stroke, study finds', 'http://www.sciencealert.com/bilingual-people-are-twice-as-likely-to-recover-from-a-stroke-study-finds', '132', '56', 'hunglee2', '11/29/2015 19:21']


['10681122', 'Intel Talks Thunderbolt 3', 'http://spectrum.ieee.org/computing/hardware/intel-talks-thunderbolt-3', '2', '1', 'ingve', '12/5/2015 6:32']


['12309590', 'Bus1: a new Linux interprocess communication proposal', 'https://lwn.net/SubscriberLink/697191/d5803573a8c5b84c/', '106', '133', 'broodbucket', '8/18/2016 0:22']


['12079257', 'PyCon: Everybody Pays', 'http://jessenoller.com/blog/2011/05/25/pycon-everybody-pays', '1', '1', 'pmoriarty', '7/12/2016 14:27']


Number of rows: 20000
Number of columns: 7




As you can see our data set `hn_20k` contains 20,000 rows, and all of our data remains in the same order. 

We are only interested in posts with titles beginnig with either Ask HN or Show HN. 

Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?

Users also submit Show HN posts to show the Hacker News community a project, product, or something interesting. Below are a couple of examples:

- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

We will continue by isolating these posts. 

## Isolating Ask HN and Show HN Posts

The overall goal of our anlysis will be to compare both of these posts to figure out:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

To isolate the posts, we will use the string method `startswith()`. Given a string we can check if it starts with a certain input.

In [14]:
# Example of string method startswith() #

print('dataquest'.startswith('Data'))
print('dataquest'.startswith('data'))

False
True


As you can see the first string came back as False becasuse dataquest does not begin with 'Data' but does begin with 'data'. 

We can control for case by using the string method `lower()` which will return a lowercase version of the string. 

In [15]:
print('DataQuest'.lower())

dataquest


Let us now continue to isolate posts beginning with Ask HN or Show HN by using both the `startswith()` and `lower()` methods. 

In [23]:
# Create Lists to Isolate Each Type of Post #

ask_posts = []
show_posts = []
other_posts = []

# For Loop to Isolate Posts #

for row in hn_20k:
    
    title = row[1]  # index where the title can be found
    
    title = title.lower() # make each title lowercase 
    
    # Loops to append each title to there respective list
    
    if title.startswith('ask hn')  == True:
        ask_posts.append(row)
        
    elif title.startswith('show hn')  == True:
        show_posts.append(row)
        
    else:
        other_posts.append(row)

Now that we have isolated each type of post to its corresponding list, let's explore each.

In [24]:
# Explore ask_posts List #

explore_data(ask_posts, 0, 5, True)

['11483846', 'Ask HN: Is there a command line domain register?', '', '2', '3', 'andrewfromx', '4/12/2016 21:45']


['12478587', "Ask HN: What to look for in a job candidate's GitHub profile", '', '9', '9', 'noaclpo', '9/12/2016 10:50']


['10851088', 'Ask HN: Why do companies struggle to find relevant insights in business data?', '', '3', '3', 'cneumann81', '1/6/2016 15:49']


['11784291', "Ask HN: What's your blog?", '', '12', '22', 'sebg', '5/27/2016 6:33']


['11262125', 'Ask HN: If a previous startup failed with your same idea, is that a bad sign?', '', '5', '7', 'johndoe786', '3/10/2016 20:41']


Number of rows: 1704
Number of columns: 7




We can see that we have a total of 1,704 posts beginning with Ask HN.

Let's continue to explore the `show_posts` list.

In [25]:
# Explore show_posts List #

explore_data(show_posts, 0, 5, True)

['10609350', "Show HN: Spamaltman  a Twitter bot based on Sam Altman's blog posts", 'https://twitter.com/spamaltman', '7', '1', 'moakq', '11/22/2015 7:23']


['12153481', 'Show HN: A Game Boy emulator in C++', 'https://github.com/Dooskington/GameLad', '37', '2', 'dooskington', '7/24/2016 15:05']


['10382596', 'Show HN: Easily Add a CMS to Your JavaScript App', 'https://github.com/cosmicjs/cosmicjs-node', '17', '5', 'tonyspiro', '10/13/2015 18:50']


['12318724', 'Show HN: Preempt Web Attacks', '', '4', '3', 'malleablebyte', '8/19/2016 8:44']


['10192567', 'Show HN: Nodal. An opinionated, full-featured API server for Node 4.0', 'https://github.com/keithwhor/nodal', '13', '2', 'keithwhor', '9/9/2015 16:56']


Number of rows: 1267
Number of columns: 7




We can see that we have a total of 1,267 posts beginning with Show HN.

Next we will determine which type of posts received the most comments. 

## More Comments: Ask or Show Posts?

We will continue by analyzing the average number of comments received by each type of post. 

First, we will need to figure out the total number of comments each type of post received then average it out. 

In [31]:
# Total Number of Comments for Ask HN Posts #

total_ask_comm = 0

for row in ask_posts:
    
    num_comm = int(row[4])  # make sure row is of int type to perform operations on it
    
    total_ask_comm += num_comm
    
# Compute average number of comments #

avg_ask_comm = round(total_ask_comm/len(ask_posts), 2)

print(total_ask_comm)
print(avg_ask_comm)

23601
13.85


We have a total of 23,601 comments for Ask HN posts, with an average of 13.85 comments per post. 

Let's proceed with Show HN posts.

In [32]:
# Total Number of Comments for Show HN Posts #

total_show_comm = 0

for row in show_posts:
    
    num_comm = int(row[4])  # make sure row is of int type to perform operations on it
    
    total_show_comm += num_comm
    
# Compute average number of comments #

avg_show_comm = round(total_show_comm/len(show_posts), 2)

print(total_show_comm)
print(avg_show_comm)

12554
9.91


For Show HN posts we have a total of 12,554 comments with an average of 9.91 comments per post.

From our analysis we can see that Ask HN posts receive on average about 4 more comments per post. Ask HN posts have total of 23,601 (compared to 12,554 comments for Show HN posts), our data set does contain about 500 more Ask HN posts which could make up for the discrepancy in average comments per post. 

Ask HN having a greater number of comments could be due to the fact that our random sample of 20,000 extracted more Ask HN posts, or that Hacker News users in general post more Ask HN posts. But we could always check this out for ourselves.

In [36]:
# Create Lists to Isolate Each Type of Post in Full Data Set #

ask_posts_all = []
show_posts_all = []
other_posts_all = []

# For Loop to Isolate Posts #

for row in hn_clean:
    
    title = row[1]  # index where the title can be found
    
    title = title.lower() # make each title lowercase 
    
    # Loops to append each title to there respective list
    
    if title.startswith('ask hn')  == True:
        ask_posts_all.append(row)
        
    elif title.startswith('show hn')  == True:
        show_posts_all.append(row)
        
    else:
        other_posts_all.append(row)
        
        
# Total Number of Comments for Ask HN Posts #

total_ask_comm_all = 0

for row in ask_posts_all:
    
    num_comm = int(row[4])  # make sure row is of int type to perform operations on it
    
    total_ask_comm_all += num_comm
    
# Compute average number of comments #

avg_ask_comm_all = round(total_ask_comm_all/len(ask_posts_all), 2)

print('Ask HN Comms:', total_ask_comm_all)
print('Avg Ask HN Comms:', avg_ask_comm_all)
print('Total Ask HN Posts:', len(ask_posts_all))
print('\n')
        
        
# Total Number of Comments for Show HN Posts #

total_show_comm_all = 0

for row in show_posts_all:
    
    num_comm = int(row[4])  # make sure row is of int type to perform operations on it
    
    total_show_comm_all += num_comm
    
# Compute average number of comments #

avg_show_comm_all = round(total_show_comm_all/len(show_posts_all), 2)

print('Show HN Comms:', total_show_comm_all)
print('Avg Show HN Comms:', avg_show_comm_all)
print('Total Show HN Posts:', len(show_posts_all))

Ask HN Comms: 94986
Avg Ask HN Comms: 13.74
Total Ask HN Posts: 6911


Show HN Comms: 49633
Avg Show HN Comms: 9.81
Total Show HN Posts: 5059


We see that the same overall pattern holds for the full data set. On average, Ask HN posts receive about 4 comments more posts, with almost 2,000 more Ask HN posts then Show HN posts. 

Hacker News posters seem to be more interested in posting Ask HN posts then Show HN posts in general due to the fact you will probably receive more comments on an Ask HN post. 

To take it a step further, does the time of day you post an Ask HN submission affect the number of comments you will receive?

## Does Time of Post Affect Number of Comments?

Since we found that on average, Ask HN posts receive more comments, we will focus the rest of our analysis on only these posts. 

We want to determine if the time at which a submission is posted affects the number of comments it receives. For this we will:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

We will first calcualte amont of psots and comments recevied per hour. To do this, we will need to use the `datetime` module to work with the `created_at` column of our data set.

Because the dates in our data set are passed in as strings, we will need to use the `datetime.strptime()` constructor parse the dates and return them as datetime objects. Example below

In [37]:
# Import datetime module #

import datetime as dt  # alias as dt for less typing

date_1_str = "December 24, 1984"
date_1_dt = dt.datetime.strptime(date_1_str, "%B %d, %Y")
print(date_1_dt)

1984-12-24 00:00:00


As you can see our string date is now a datetime object. We will use the same methodology to create datetime objects for the dates in the `created_at` column, and create dictionaries `counts_by_hour` and `comments_by_hour` to perform our analysis. 

In [42]:
# Create empty list of lists to store 2 elements: 1) time post created 2) number of comments of post 

result_list = []

# For loop for populating result_list

for row in ask_posts:
    
    create_time = row[6]  # index for 'created_at' column (also can use index -1)
    
    num_comm = int(row[4])  # make sure row is of int type to perform operations on it
    
    temp_list = [create_time, num_comm]  # temp list to store values
    
    result_list.append(temp_list)  # append temp list values to result list
    
    
# Create dictionaries to store comments and posts by hour

counts_by_hour = {}
comments_by_hour = {}

# For loop to populat dictionaries

for row in result_list:
    
    num_comm = row[1]  # index where we find number of comments 
    
    str_time = row[0]  # index where we find 'created_at' time, stored as str_time
    
    dt_time = dt.datetime.strptime(str_time, "%m/%d/%Y %H:%M")  # parsing str_time string into datetime object
    
    hour = dt_time.strftime("%H")  # isolating the hour from the datetime object
    
    # if statments to populat dictionaries
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comm
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comm
    

We will now use the `counts_by_hour` and `comments_by_hour` dictionaries to create a list of lists `avg_by_hour` which will store lists containing two elements: 

- hour of the day 
- average number of comments for that hour rounded to two decimals

In [55]:
# Create dictionary avg_by_hour to store values

avg_by_hour = []

# For loop to populate avg_by_hour list

for hour in counts_by_hour:
    
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour], 2)])

# Print results of for loop

for hour in avg_by_hour:
    print(hour)

['21', 12.36]
['10', 14.12]
['15', 52.38]
['06', 9.29]
['20', 17.13]
['12', 9.23]
['09', 6.24]
['04', 10.57]
['18', 9.39]
['23', 7.35]
['19', 8.87]
['13', 17.73]
['14', 9.7]
['01', 7.98]
['16', 13.75]
['00', 9.23]
['22', 8.76]
['17', 17.51]
['03', 10.14]
['02', 9.45]
['08', 7.98]
['11', 11.43]
['07', 6.87]
['05', 6.81]


As you can see we now have the results of the number of comments by hour, but the way this is displayed makes it difficult to read. 

Let's finish by sorting this list and dispalying the the five hours with highest number of average comments. 

We will begin by making a new list of lists `swap_avg_by_hour` which will store lists containing:

- Average number of comments per hour
- Hour of the day

Basically, just swapping the elements of the `avg_by_hour` list of lists. 

In [58]:
# Create swap_avg_by_hour list

swap_avg_by_hour = []

for hour in avg_by_hour:
    
    swap_avg_by_hour.append([hour[1], hour[0]])
    
for avg_comm in swap_avg_by_hour:    
    print(avg_comm)

[12.36, '21']
[14.12, '10']
[52.38, '15']
[9.29, '06']
[17.13, '20']
[9.23, '12']
[6.24, '09']
[10.57, '04']
[9.39, '18']
[7.35, '23']
[8.87, '19']
[17.73, '13']
[9.7, '14']
[7.98, '01']
[13.75, '16']
[9.23, '00']
[8.76, '22']
[17.51, '17']
[10.14, '03']
[9.45, '02']
[7.98, '08']
[11.43, '11']
[6.87, '07']
[6.81, '05']


We will now create a list `sorted_swap` which will sort the `swap_avg_by_hour` list in descending order by using the `sorted()` function. 

In [70]:
# Create sorted_swap list

sorted_swap = sorted(swap_avg_by_hour, reverse = True)  # reverse = True specifies descending order

print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[0:5]:
    
    str_time = hour
    
    dt_time = dt.datetime.strptime(str_time, "%H")
    
    dt_hour = dt_time.strftime("%H:%M")
    
    output = "{h}: {num} average comments per posts".format(h = dt_hour, num = avg)
    
    print(output)
    

Top 5 Hours for Ask Posts Comments
15:00: 52.38 average comments per posts
13:00: 17.73 average comments per posts
17:00: 17.51 average comments per posts
20:00: 17.13 average comments per posts
10:00: 14.12 average comments per posts


## Hacker New Ask Posts Analysis Conclusion

As you can see 3:00pm, 1:00pm, 5:00pm, 8:00pm, and 10:00 am are in the top 5 hours (in that order) for posting if you would like to receive a high number of comments on your post. 

We should note though, that posting at 3:00pm results in about 3 times as many comments then posting at any of other top 5 hours. 

So, if your goal is to post a popular submission which will receive a high number of visitors and comments, posting an Ask HN submission at 3:00pm is your best bet!