# Exploring Hacker News Posts

We would be using Hacker News Data for our analysis. [Hacker News](https://news.ycombinator.com/) is a social news website focusing on computer Science and enterpreneurship. It was started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

# Finding and Understanding Data Set

We took data set for hacker news post from [here](https://www.kaggle.com/hacker-news/hacker-news-posts/data#)

This data set is Hacker News posts from the last 12 months (up to September 26 2016). It has 7 columns and the details about them could be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts/data#)

For this guided project and analysis, it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

# Approach Used

Since we will be doing a lot of analysis of data set and would be using the same code multiple times, I have created a lot of functions in the project. Everytime we are repeating a task or code, we can use the existing function, hence avoiding multiple lines of code. I am adding definitions and return type of the functions I created below:

* explore_data(file, start, end) - used to display data **return None**
    - file - list that we want to display
    - start - starting index
    - end - ending index


* avg_of_col(dataset, index) - find the average number of comments or points for different posts **return average in float**
    - dataset - Dataset for which you want to find the average of a particular column
    - index - index of columns value in int that is use for average


* value_by_hour(dataset, index) - find average number of posts and average number of comments or points for different posts by hour **return 2 dictionaries**


* avg_by_hour(dataset, index) - find average comments or points per post for different posts by hour, this is nested function and would call value_by_hour **return list of lists**
    - post_by_hour - avg number of post per hour dictionary returned by value_by_hour dic


* swap_sort(dataset) - This function would swap the average per post and then sort it before we display the results **return a list of lists** 

# Opening the Data Set

Let's start by downloading the .csv file, opening it and saving it as list of lists.

I have also saved the header and data separately for analyis purpose.

In [163]:
#Import library to read csv file

from csv import reader

#Open, read and save csv file as list of list

opened_file = open("C:\\Users\\itika\\Desktop\\Python\\Project 2 - Exploring Hacker News Posts\\hacker_news.csv", encoding = "utf8")
read_file = reader(opened_file)
hn_data = list(read_file)

#Separating headers and rest of the data

hn_header = hn_data[0]
hn = hn_data[1:]

# Display and explore the Data Set

I am writing a function to explore and display the data set. If we need to look into dataset anytime later, we can use this function.

In [164]:
#Function to explore data set
def explore_data(file, start, end):
    for row in file[start:end]:
        print(row)
        print("\n")
    print('Number of rows:', len(file))
    print('Number of columns:', len(file[0]))
    

In [165]:
#Print and understand the dataset
print(hn_header)
print('\n')
explore_data(hn, 0, 5)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of 

# Defining our task and goal

### For our analysis, we are refering to the posts that start with either *Ask HN* or *Show HN*

* Ask HN - Users submit a post to ask questions the Hacker News Community a specific question<br>
**Example -** [Ask HN: How do you take notes when reading a book?](https://news.ycombinator.com/item?id=23596471) 

* Show HN - Users submit a post to show the Hacker news Community a project, product, or just generally something interesting
<br>**Example -**[Show HN: My game, Sky Fleet, just went up on Steam](https://news.ycombinator.com/item?id=23609709)

### We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

# Extracting Ask HN and Show HN posts

Before we start analysing ask HN and show HN posts, we would need to extract them in a separate list.

Let's start working on the data set by storing Ask HN, Show HN and other posts in a separate lists.

In [166]:
#Creating empty lists
ask_posts = []
show_posts = []
other_posts = []

#looping over our data to extract specific posts
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of Ask HN posts: ", len(ask_posts))
print("Number of Show HN posts: ", len(show_posts))
print("Number of other posts: ", len(other_posts))

Number of Ask HN posts:  1744
Number of Show HN posts:  1162
Number of other posts:  17194


### Let's start by exploring few rows of Ask HN and Show HN list of lists

In [167]:
#Using the already created function 

print("Ask HN Posts:")
print("\n")
explore_data(ask_posts, 3, 6)

print("\n*********************************************************************************************\n")

print("Show HN Posts:")
print("\n")
explore_data(show_posts, 3, 6)

Ask HN Posts:


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


['10284812', 'Ask HN: Limiting CPU, memory, and I/O usage on a program for testing', '', '2', '1', 'zatkin', '9/26/2015 23:23']


Number of rows: 1744
Number of columns: 7

*********************************************************************************************

Show HN Posts:


['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']


['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']


['11237259', 'Show HN: Run with Mark (Runkeeper only)', 'http://runwithmark.github.io/#/', '3', '3', 'ecesena', '3/7/2016 5:17']


Number of 

# Do Ask HN or Show HN receive more comments on average?

### Step 1 : Calculating the Average number of comments for Ask HN and show HN

Let's create a generic function to iterate over the list of lists to calculate the average number of the column values like comments, num_points. 

From, our data set we know that num_comments are at index 4

In [168]:
#Function to find average of a particular columns

def avg_of_col(dataset, index):
    total_of_col = 0
    average_of_col = 0
    for row in dataset:
        col_value = int(row[index])
        total_of_col += col_value
    
    average_of_col = total_of_col/len(dataset)
    
    return average_of_col

#Passing ask_posts and show_posts to the funtion to find average comments on both posts

avg_ask_comments = avg_of_col(ask_posts, 4)
avg_show_comments = avg_of_col(show_posts, 4)

#Print the average number of comments for both type of posts

print("Average number comments posted on Ask HN posts: ", avg_ask_comments)
print("Average number comments posted on Show HN posts: ", avg_show_comments)

Average number comments posted on Ask HN posts:  14.038417431192661
Average number comments posted on Show HN posts:  10.31669535283993


### Step 2 : Analysing the Average number of comments for Ask HN and show HN

From our calculations on data set above, we can clearly see that Ask HN posts receive more posts on an average, which is around 36%. This makes me think that community users are more likely to comment on ask post rather that on show posts. 

Since Ask HN posts are more likely to receive more comments, we will continue rest of our analysis with ask posts.

# Do posts created at a certain time receive more comments on average?

To determine if an ask post created at a certain time of the day is more likely to get more comments, we need to analyse the data further. 

To find the average number of comments per post by hour, we would first need to find the number of ask posts created per hour and comments generated for those posts by hour. We can then find the average number of comments per post by hour.

Once we have average comments per post for a specific hour we can sort them and analyse the results.

Let's work on this step by step:

## Calculate the average number of comments ask posts receive by hour created.

To accomplish this task, we will created nested functions. First function(avg_by hour) to find the average number of comments per post by hour, this function will call another function(value_by_hour) to find post by hour and comments by hour.

In [169]:
#Import datetime module as dt

import datetime as dt

#Function to find posts created per hour and any other corresponding field created per hour

def value_by_hour(dataset, index):
    
#Instantiating empty dictionaries

    posts_by_hour = {}
    col_value_by_hour = {}

    for row in dataset:
        hour = row[6]
        hour_dt = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
        hour_of_day = hour_dt.strftime("%H")
        if hour_of_day not in posts_by_hour:
            posts_by_hour[hour_of_day] = 1
            col_value_by_hour[hour_of_day] = int(row[index])
        else:
            posts_by_hour[hour_of_day] += 1
            col_value_by_hour[hour_of_day] += int(row[index])
    
    return posts_by_hour, col_value_by_hour

#Function to find average comments by hour
#This function would call another function to find post by hour and comments by hour

def avg_by_hour(dataset, index):
    avg_by_hour = []
    
    posts_by_hour, total_by_hour = value_by_hour(dataset, index)

#Iterate over keys to find average
    for hour in total_by_hour:
        value = total_by_hour[hour]
        posts_count = posts_by_hour[hour]
        avg_per_post = value/posts_count
        avg_by_hour.append([hour,avg_per_post])
    
    return avg_by_hour
   
#Calling the function to find average comments for ask posts by hour
avg_comments_by_hour = avg_by_hour(ask_posts, 4)
                                  
avg_comments_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting, printing and analysing the results

Before we analyse our to data to see if posts created at a particular time received more comments, we need to print the average comments per post by hour in a readable and understandable manner.

We will use below steps for that:

1. Swap the list avg_by_hour and sort the swaped list in descending order such that highest comments come first i.e. Comments : Hour

2. Print the new list in a formated way. eg. 15:00: 38.59 average comments per post

### Step 1 : Swap and Sort 

Below is the function which takes any dataset and give output as a swaped and sorted in descending order

In [170]:
#Function for Swap and Sort average by hour

def swap_sort(dataset):
    swap_avg_by_hour = []
    for row in dataset:
        swap_avg_by_hour.append([row[1], row[0]])
        
    sorted_swap = sorted(swap_avg_by_hour, reverse = True)
    
    return sorted_swap

sorted_swap_comments = swap_sort(avg_comments_by_hour)

sorted_swap_comments
    

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

### Step 2 : Print Top 5 Hours for Ask Posts with highest average Comments

For this we would use str.format() method


In [171]:
#Print top 5 posts with highest number of comments by hour

print("****************Top 5 Hours for 'Ask HN' Comments****************")
for avg, hr in sorted_swap_comments[:5]:
    print(
        "At {} hour, there were {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

****************Top 5 Hours for 'Ask HN' Comments****************
At 15:00 hour, there were 38.59 average comments per post
At 02:00 hour, there were 23.81 average comments per post
At 20:00 hour, there were 21.52 average comments per post
At 16:00 hour, there were 16.80 average comments per post
At 21:00 hour, there were 16.01 average comments per post


# Conclusion Part 1

In the resuts above, we can clearly see that the posts generated at 15:00 hour are likely to receive more comments, which is 38.59 average number of comments, in comparison to posts created at any other time. This is 60% more than the post created at 02:00 which is second highest.

According to the documentation of the data set, the times are in Eastern time zone format. Since, I am working from Pacific time zone, I can say that posts created at 12:00 PM PST or 3:00 PM EST have higher chance of receiving comments.

In this project, we analysed data for Hacker New posts. Instead of using the whole dataset, we excluded the posts without comments. This could give different result all together. We then compared average number of comments for Ask HN and Show HN posts. Based on our results we worked on Ask HN posts to find if post created at any particular hour received more comments in average that any other hour.

Given our analysis, it would be more accurate to say that Ask Hacker News posts created between 12:00 PM PST to 1:00 PM PST received the most comments on average.

# Part 2 : Some Additional Coding and Analysis for fun

## Determine if show or ask posts receive more points on average.

To find the average number of num_points reveived by Ask HN and Show HN posts, we would use the pre-existing function **avg_of_col(dataset, index)** to find average number of comments for both posts.

In [157]:
#Passing ask_posts and show_posts to the funtion to find average num_points on both posts 
avg_ask_points = avg_of_col(ask_posts, 3)

avg_show_points = avg_of_col(show_posts, 3)

print("Average num_points posted on Ask HN posts: ", avg_ask_points)
print("Average num_points posted on Show HN posts: ", avg_show_points)

Average num_points posted on Ask HN posts:  15.061926605504587
Average num_points posted on Show HN posts:  27.555077452667813


Based on the results above, we can see that Show HN received 83% more num points than Ask HN posts on an average. So, although Ask HN posts received more comments on an average, Show HN were more liked by users. 

This is makes complete sense to me because as per the naming of the posts, in Ask HN posts users are looking for answers to their queries and hence the comments while Show HN posts are created to showcase the work that an user did and likely to receive more upvotes.

## Determine if posts created at a certain time are more likely to receive more points.

Since Show HN post has more num points on an average, we will be using Show HN post to check if they are more likely to receive upvotes at a certain time.

To determine that we would use the existing function ----

**value_by_hour(dataset, index)** to find number of show HN posts by hour and number of points received by hour<br><br>
**avg_by_hour(post_by_hour, dataset)** to find average of points received per post by hour

In [172]:
#passing the show posts data in function value_by_hour
#function would return two dictionaries posts by hour and points for that post by hour

avg_points_by_hour = avg_by_hour(show_posts, 3)

avg_points_by_hour

[['14', 25.430232558139537],
 ['22', 40.34782608695652],
 ['18', 36.31147540983606],
 ['07', 19.0],
 ['20', 30.316666666666666],
 ['05', 5.473684210526316],
 ['16', 28.322580645161292],
 ['19', 30.945454545454545],
 ['15', 28.564102564102566],
 ['03', 25.14814814814815],
 ['17', 27.107526881720432],
 ['06', 23.4375],
 ['02', 11.333333333333334],
 ['13', 24.626262626262626],
 ['08', 15.264705882352942],
 ['21', 18.425531914893618],
 ['04', 14.846153846153847],
 ['11', 33.63636363636363],
 ['12', 41.68852459016394],
 ['23', 42.388888888888886],
 ['09', 18.433333333333334],
 ['01', 25.0],
 ['10', 18.916666666666668],
 ['00', 37.83870967741935]]

Let's swap, sort and print the results

For swap and sort, we will use the function **swap_sort(dataset)**

In [173]:
#Calling the function to swap and sort the average point by hour

sorted_swap_points = swap_sort(avg_points_by_hour)

sorted_swap_points

[[42.388888888888886, '23'],
 [41.68852459016394, '12'],
 [40.34782608695652, '22'],
 [37.83870967741935, '00'],
 [36.31147540983606, '18'],
 [33.63636363636363, '11'],
 [30.945454545454545, '19'],
 [30.316666666666666, '20'],
 [28.564102564102566, '15'],
 [28.322580645161292, '16'],
 [27.107526881720432, '17'],
 [25.430232558139537, '14'],
 [25.14814814814815, '03'],
 [25.0, '01'],
 [24.626262626262626, '13'],
 [23.4375, '06'],
 [19.0, '07'],
 [18.916666666666668, '10'],
 [18.433333333333334, '09'],
 [18.425531914893618, '21'],
 [15.264705882352942, '08'],
 [14.846153846153847, '04'],
 [11.333333333333334, '02'],
 [5.473684210526316, '05']]

In [174]:
#Print top 5 posts with highest number of points by hour

print("****************Top 5 Hours for 'Show HN' Points****************")
for avg, hr in sorted_swap_points[:5]:
    print(
        "At {} hour, there were {:.2f} average upvotes per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

****************Top 5 Hours for 'Show HN' Points****************
At 23:00 hour, there were 42.39 average upvotes per post
At 12:00 hour, there were 41.69 average upvotes per post
At 22:00 hour, there were 40.35 average upvotes per post
At 00:00 hour, there were 37.84 average upvotes per post
At 18:00 hour, there were 36.31 average upvotes per post


From above results, we can say that on an average a show post received more points or upvotes at 23:00 hour EST or 11:00 PM EST. Also, top 3 show posts have almost the same average number of upvotes per post.

### Compare your results to the average number of comments and points other posts receive.

We have average number of comments and points for Ask HN and Show HN from earlier analysis.

Lets first find the average number of comments and points for other posts. We will use pre-existing functions for that **avg_of_col(dataset, index)** 


In [161]:
#Passing other posts data to find average number of comments on other posts
avg_other_comments = avg_of_col(other_posts, 4)

#Passing other posts data to find average number of comments on other posts
avg_other_points = avg_of_col(other_posts, 3)



In [162]:
#Display average number of comments received by Show HN, Ask HN and Other posts

print("######################Average Number of comments for all the posts######################")
print("\n")
print("On an average Show HN posts received {:.2f} comments per post".format(avg_show_comments))
print("On an average Ask HN posts received {:.2f} comments per post".format(avg_ask_comments))
print("On an average Other posts received {:.2f} comments per post".format(avg_other_comments))
print("\n\n")

#Display average number of points received by Show HN, Ask HN and Other posts

print("######################Average Number of Upvotes for all the posts######################")
print("\n")
print("Average number of upvotes received by Show HN posts are {:.2f}".format(avg_show_points))
print("Average number of upvotes received by Ask HN posts are {:.2f}".format(avg_ask_points))
print("Average number of upvotes received by Other posts are {:.2f}".format(avg_other_points))

######################Average Number of comments for all the posts######################


On an average Show HN posts received 10.32 comments per post
On an average Ask HN posts received 14.04 comments per post
On an average Other posts received 26.87 comments per post



######################Average Number of Upvotes for all the posts######################


Average number of upvotes received by Show HN posts are 27.56
Average number of upvotes received by Ask HN posts are 15.06
Average number of upvotes received by Other posts are 55.41


Above results show that although Ask post received more comments that Show posts and Show posts received more upvotes, all other posts received way more comments and point than both of them together, which is 26.87 comments per post or around 52% in total and 55.41 points per post or 57% in total.

# Final Conclusion

To summarize, all other posts received more comments(52%) and upvotes(57%) in comparison to ask and show posts togethor. Ask post received more comments per post in comparison to show post but show posts received more upvotes in comparison to ask posts.

It would be a good idea to post a Ask HN between 12:00 PM to 3:00 PM PST since that was when highest number of comments per post were added. Also, best time to create show HN post is between 9:00 AM - 10:00 AM or 7:00 PM to 9:00 PM PST, these are hours when Show post received maximum number of upvotes. 