# Exploring Hacker News Posts by Pavel Gladkevich
This project was completed as part of the Data Analyst series of [Dataquest](https://www.dataquest.io/directory/) on 05/31/19 
<br/><br/>
**Goal:** To analyze posts from the [Hacker News Site](https://news.ycombinator.com/), a popular technology related forum where users post stories that are voted upon and have comment sections. Users of this site can submit ask, show, job, and other posts. In our analyisis we will first look at whether ask or show posts get more comments, and what at what time does the more popular category's post receive the most comments. 
<br/><br/>
Thus, we will do a per hour analysis of posts and comments to identify a post that receives the greatest response rate. The data was obtained from the kaggle repository: [Hacker News Dataset](https://www.kaggle.com/hacker-news/hacker-news-posts/downloads/hacker-news-posts.zip/1). It contains a years worth of data collected from September 2015 to September 2016. We will reduce the dataset from roughly 300,000 to only contain the approximatley 80,000 posts that have comments. The time data was taken in US Eastern time so it is either EST/EDT and will be converted to my local timezone of PST/PDT.


In [151]:
# Load the downloaded csv file of Hacker News Data from file path
open_file = open("/Users/pgladkevich/Desktop/coding/projects/datasets/HN_posts_year_to_Sep_26_2016.csv")

# Import reader which will iterate over lines in the given csvfile.
from csv import reader

hacker_news = reader(open_file)

# Turn data set file into list of lists format
h_list = list(hacker_news)
print("The length of the original dataset",len(h_list))

# Delete all entries that do not have at least one comment
hn = []
for row in h_list:
    num_comments = row[4]
    if num_comments != '0':
        hn.append(row)
print("Dataset containing only posts with comments",len(hn))

The length of the original dataset 293120
Dataset containing only posts with comments 80402


In [152]:
# Function that prints the desired number of rows from an interval in the dataset and optionally the num rows/columns
def explore_data(dataset, start=1,end=5,r_and_c=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    
    if r_and_c:
        #the number of rows in the dataset is the amount of entries unless a header is included then -1
        print('Number of rows is ', len(dataset))
        #the number of columns is the number of values in a single row since list of lists format
        print('Number of columns is ', len(dataset[0]))

explore_data(hn,0,r_and_c=True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']


['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']


['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']


Number of rows is  80402
Number of columns is  7


In [153]:
# Remove header
headers = hn[0]

# Dataset without header
hn = hn[1:]

## Extracting Ask and Show Posts
We have a dataset containing only posts with comments so now we can proceed to sort the posts into categories for analysis

In [154]:
# Create lists to bin the different categories of posts
ask_posts, show_posts, other_posts = ([] for i in range(3))
posts = [ask_posts, show_posts, other_posts]

# Loop through the hn data and sort each post based off of its title
for row in hn:
    # Set the post's title in lowercase to a variable
    title = row[1].lower()
    
    # Append post data to list if ask post
    if title.startswith('ask hn'):
        ask_posts.append(row)
    # Append post data to list if show post
    elif title.startswith('show hn'):
        show_posts.append(row)
    # Append to other if neither
    else:
        other_posts.append(row)
        
# Check numbers of posts in each category
posts_s = ["ask_posts", "show_posts", "other_posts"]
for i in range(3):
    name = posts_s[i]
    print("Number of posts in ", name," is ", len(posts[i]))

Number of posts in  ask_posts  is  6911
Number of posts in  show_posts  is  5059
Number of posts in  other_posts  is  68431


## Calculating the Avg # of Comments
Now that we have our categories we are interested in looking at we can compute the average number of posts in each

In [155]:
# Create variables to store comment numbers
total_ask_comments, total_show_comments = 0,0

# Iterate over ask posts to count comments on each
for row in ask_posts:
    # set number of comments to integer variable
    num_comments = int(row[4])
    #add to total
    total_ask_comments += num_comments
    
# Iterate over show posts to count comments on each
for row in show_posts:
    # Set number of comments to integer variable
    num_comments = int(row[4])
    # Add to total
    total_show_comments += num_comments
    
# Compute averages for both categories of posts
num_ask_posts = len(ask_posts)
num_show_posts = len(show_posts)
avg_ask_posts = total_ask_comments/num_ask_posts
avg_show_posts = total_show_comments/num_show_posts

print("Average ask and show posts are", round(avg_ask_posts, 2),"and", round(avg_show_posts,2), "respectively." )

Average ask and show posts are 13.74 and 9.81 respectively.


It appears that the mean number of comments on ask posts is significantly higher than the number of comments on show posts. This result is intuitive since an ask post will inherently induce responses since it is asking a question. While the amount of people viewing the two categories of posts may or may not be different, I believe it is likely that in cases of equal viewership the asks posts will inherently have a greater response rate. 

## Average Ask Posts and Comments by Hour
Since ask posts are more likely to receive comments, it will be the focus of the rest of work since our goal is to identify the type and time of the post that has the highest response rate. To this end we will use the datetime library of python to sort the posts by frequency per hour in each of the two categories

In [156]:
import datetime as dt

# Create a list of lists that will store the time and number of comments for each ask post
result_list =  []
for row in ask_posts:
    # Get the time
    created_at = row[6]
    # Get the num of comments as an integer
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

# Create two dictionaries that will be used to create hourly frequency tables
counts_by_hour, comments_by_hour = {}, {}

# Loop through result_list to populate dictionaries
for row in result_list:
    # Extract the hour from the date and create datetime object
    date = row[0]
    num_comments = row[1]
    dt_h = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(dt_h, "%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else: 
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments  

# Create list of lists with hour and average num comments per post in that hour
hourly_avg_comments = []

for hour in counts_by_hour:
    total_posts = counts_by_hour[hour]
    total_comments = comments_by_hour[hour]
    avg_by_hour = total_comments/total_posts
    hourly_avg_comments.append([hour,avg_by_hour,2])
    
print(hourly_avg_comments)    

[['02', 13.198237885462555, 2], ['01', 9.367713004484305, 2], ['22', 11.749128919860627, 2], ['21', 11.056511056511056, 2], ['19', 9.414285714285715, 2], ['17', 13.73019801980198, 2], ['15', 39.66809421841542, 2], ['14', 13.153439153439153, 2], ['13', 22.2239263803681, 2], ['11', 11.143426294820717, 2], ['10', 13.757990867579908, 2], ['09', 8.392045454545455, 2], ['07', 10.095541401273886, 2], ['03', 10.160377358490566, 2], ['16', 10.76144578313253, 2], ['08', 12.43157894736842, 2], ['00', 9.857142857142858, 2], ['23', 8.322463768115941, 2], ['20', 11.38265306122449, 2], ['18', 10.789823008849558, 2], ['12', 15.452554744525548, 2], ['04', 12.688172043010752, 2], ['06', 9.017045454545455, 2], ['05', 11.139393939393939, 2]]


## Identifying Top 5 times for Ask HN Posts by Comments
We now have the per hourly average number of comments for each post, but we have to do some more work to get it into a more readable format so that we can easily scan the best times to post.

In [157]:
# Swap columns of list to prepare for sorting
swap_h_avg_comments = []
for hour in hourly_avg_comments:
    swap_h_avg_comments.append([hour[1],hour[0]])

# Sorting the swapped list in ascending order
sorted_swap = sorted(swap_h_avg_comments, reverse=True)

# Print the top 5 hours to post an Ask
print("Top 5 Hours for 'Ask HN' Comments")

# Iterated through the top 5 of the sorted list
for row in sorted_swap[:5]:
    # Create datetime object in PST/PDT
    hour = int(row[1]) - 3
    # Edge case where if 2-3 = -1 it is instead 23 hours
    if hour == -1:
        hour = 23        
    h_dt = dt.datetime.strptime(str(hour),'%H')
    
    # Convert it to string
    hour = dt.datetime.strftime(h_dt, '%H:%M')
    avg = row[0]
    
    #Create the string to be printed
    hour_avg = "{0} | {1:.2f} comments per hour on average".format(hour, avg)
    print(hour_avg)   

Top 5 Hours for 'Ask HN' Comments
12:00 | 39.67 comments per hour on average
10:00 | 22.22 comments per hour on average
09:00 | 15.45 comments per hour on average
07:00 | 13.76 comments per hour on average
14:00 | 13.73 comments per hour on average


The time in the left column is in 24hour time PST/PDT. This means that 12:00 is 12PM and 23:00 is 11 PM. It appears that the best time to post in order to receive the highest amount of comments is at 12PM PST. This is lunchtime/late afternoon for the US. In general the top three times are clustered at the morning to early afternoon for the US depending on the timezone of either PST/PDT or EST/EDT.

## Conclusion
We analyzed the data of 80,000 Hacker News posts that contained at least one comment. Our purpose was to identify whether an 'ASK HN' or a 'SHOW HN' post receives more comments, and when is the best time to post the winner of these two. For this dataset the HN ask and show posts contained 13.74 and 9.81 comments on average respectively. Thus, for our goal of achieving the highest likelihood of high comment count we focused on 'ASK HN' posts. When we performed a time analysis of the 'ASK HN' comments it appears that the best time is between the hours of 9 to 12 AM PST/PDT, which is 12 to 3 PM in EST/EDT. Thus, our results indicate that of the posts that have received comments the optimal response rate would be achieved with a 'ASK HN' post at 12:00 PM PST/PDT.