# Guided Project: Exploring Hacker News Posts


## 1. Introduction

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions

Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post

title: The title of the post

url: The URL that the posts links to, if it the post has a URL

num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: The number of comments that were made on the post

author: The username of the person who submitted the post

created_at: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?


## 2. Removing Headers from a List of Lists

In [15]:
# We want to first import the libraries that we'll be using


In [19]:
from csv import reader

#we want timport the file and convert it to a list of lists
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers =hn[0]
hn = hn[1:]

In [20]:
# We want to create a function that can repeatedly used to print rows
# in a readable fashion.

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start: end]
    for row in dataset_slice:
        print(row)
        print('\\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        

In [22]:
explore_data(hn, 0, 5)
print('\n')
print(headers)

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
\n
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
\n
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
\n
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
\n
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']
\n


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


## 3. Extracting Ask HN and Show HN posts

Since we're only concerned about post titles beginning with Ask HN and Show HN, we want to filter our data. So we'll create a new list of lists containing just the data for those titles.


In [25]:
# We want to use the string method startswith
# given a string object we can check if it starts with a string of interest
# for example
print('dataquest'.startswith('Data'))
print('dataquest'.startswith('data'))

False
True


In [27]:
# we want to create 3 emptly lists
ask_posts = []
show_posts = []
other_posts = []

# we loop through each row and put the titles in one of the three lists
# based on what they start with
for row in hn:
    title = row[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [28]:
#we see how many items each list has
print(len(ask_posts))

print(len(show_posts))
print(len(other_posts))

9139
10158
273822


## 4. Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [33]:
# we need to find total number of comments in ask posts

total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments/ len(ask_posts)
print(avg_ask_comments)

# we do the same with show posts

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments/ len(show_posts)
print(avg_show_comments)

10.393478498741656
4.886099625910612


It looks like ask posts receive more comments on average at least 2 times more than show posts

## 5. Finding the Amount of Ask Posts and Comments by Hour Created

We want to focus on the Ask Posts because they're more likely to recieve comments.

Question: Are ask posts created at a certain time more likely to attract comments?

We'll answer this question by:
1. calculating the # of ask posts created in each hour of the day, along with the number of comments received.
2. calculating the avg num of comments ask posts receive by hour created.


We'll start with step 1.

In [36]:
# we will use the datetime module to work with the data

# we can use the datetime.strptime() constructor to parse dates stored
# as strings and return datetime objects

#we first import the library

import datetime as dt

# we start by testing it
date_1_str = "December 24, 1984"
date_1_dt = dt.datetime.strptime(date_1_str, "%B %d, %Y")

In [49]:
result_list = []
for row in ask_posts:
    result_list.append([row[6],row[4]])
    


counts_by_hour ={}
comments_by_hour = {}

for row in result_list:
    datetime_str = row[0]
    datetime_dt = dt.datetime.strptime(datetime_str, "%m/%d/%Y %H:%M")
    hour = datetime_dt.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])
        
print(counts_by_hour)
print(comments_by_hour)
    

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


## 6.  Calculating the Average Number of Comments for Ask HN Posts by Hour

We now want to calculate the average number of comments for posts created during each hour of the day.


In [47]:
#we going to practice with an example

sample_dict = {
    'apple': 2,
    'banana': 4,
    'orange': 6
}

fruits = []

for fruit in sample_dict:
    fruits.append([fruit, 10*sample_dict[fruit]])

print(fruits)

[['apple', 20], ['banana', 40], ['orange', 60]]


In [51]:
avg_by_hour = []

for hour in counts_by_hour:
    avg = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg])

print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [53]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [66]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 hours for Ask Posts Comments")
for row in sorted_swap[0:5]:
    hr = row[1]
    avg =row[0]
    print("{hour}:00 {average:.2f} average comments per post".format(hour =hr, average = avg))

Top 5 hours for Ask Posts Comments
15:00 28.68 average comments per post
13:00 16.32 average comments per post
12:00 12.38 average comments per post
02:00 11.14 average comments per post
10:00 10.68 average comments per post


We can conclude that the best hour to create a post is at 15:00

## 8. Next Steps

Guided projects can be used to build a portfolio to showcase to potential employers, so we encourage you to keep working on this. Here are some next steps for you to consider:

Determine if show or ask posts receive more points on average.

Determine if posts created at a certain time are more likely to receive more points.

Compare your results to the average number of comments and points other posts receive.

Use Dataquest's data science project style guide to format your project.


In [71]:
total_ask_points = 0
for row in ask_posts:
    total_ask_points += int(row[3])
    
avg_ask_points = total_ask_points / len(ask_posts)
print(avg_ask_points)

total_show_points = 0
for row in show_posts:
    total_show_points += int(row[3])
    
avg_show_points = total_show_points /len(show_posts)
print(avg_show_points)

#We can conlude that show posts receive more points on average

11.31174089068826
14.843571569206537


In [77]:
point_result_list = []
for row in show_posts:
    point_result_list.append([row[6],row[3]])


show_counts_by_hour ={}
points_by_hour = {}

for row in point_result_list:
    datetime_str = row[0]
    datetime_dt = dt.datetime.strptime(datetime_str, "%m/%d/%Y %H:%M")
    hour = datetime_dt.strftime("%H")
    if hour not in show_counts_by_hour:
        show_counts_by_hour[hour] = 1
        points_by_hour[hour] = int(row[1])
    else:
        show_counts_by_hour[hour] += 1
        points_by_hour[hour] += int(row[1])
        
print(show_counts_by_hour)
print(points_by_hour)


    

{'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172}
{'00': 4291, '23': 5060, '20': 6948, '19': 8928, '18': 9935, '16': 11487, '14': 10503, '10': 4303, '09': 3762, '08': 4640, '06': 3071, '03': 2168, '21': 5990, '17': 10563, '15': 11657, '11': 7742, '07': 3303, '04': 2707, '13': 10381, '12': 10787, '01': 2931, '22': 5026, '02': 2764, '05': 1834}


In [78]:
avg_points_by_hour = []

for hour in show_counts_by_hour:
    avg = points_by_hour[hour]/show_counts_by_hour[hour]
    avg_points_by_hour.append([hour, avg])
    
swap_avg_points_by_hour = []
for row in avg_points_by_hour:
    swap_avg_points_by_hour.append([row[1],row[0]])
    
print(swap_avg_points_by_hour)
    


[[15.547101449275363, '00'], [15.862068965517242, '23'], [13.234285714285715, '20'], [16.057553956834532, '19'], [15.144817073170731, '18'], [14.340823970037453, '16'], [15.09051724137931, '14'], [13.321981424148607, '10'], [12.456953642384105, '09'], [14.683544303797468, '08'], [15.994791666666666, '06'], [10.524271844660195, '03'], [13.930232558139535, '21'], [13.88042049934297, '17'], [13.94377990430622, '15'], [19.258706467661693, '11'], [13.995762711864407, '07'], [13.95360824742268, '04'], [17.018032786885247, '13'], [20.905038759689923, '12'], [11.866396761133604, '01'], [13.331564986737401, '22'], [13.224880382775119, '02'], [10.662790697674419, '05']]


In [79]:
sorted_swap_points = sorted(swap_avg_points_by_hour, reverse=True)

print("Top 5 hours for Show Posts Points")
for row in sorted_swap_points[0:5]:
    hr = row[1]
    avg =row[0]
    print("{hour}:00 {average:.2f} average points per post".format(hour =hr, average = avg))

Top 5 hours for Show Posts Points
12:00 20.91 average points per post
11:00 19.26 average points per post
13:00 17.02 average points per post
19:00 16.06 average points per post
06:00 15.99 average points per post


We see that posts closer to noon receive more points on average