# **Exploring Hacker News Posts**

**Final project of the module *Python for Data Science : Intermediate*.**

In this project, we will explore user-submitted stories ("posts") to the site Hacker News. The storied are voted and commented upon. The posts that make it to the top of the list can get many visitors as a result.

The data that is going to be analysed is divided into the following categories:

- **id:** identifier for the post;
- **title:** title of the post;
- **url:** URL that the posts links to;
- **num_points**: number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes;
- **num_comments:** number of comments that were made on the post;
- **author:** username of the person who submitted the post;
- **created_at:** date and time when the post was submitted.

We are going to analyse the posts whose titles begin with either *Ask HN* or *Show HN*. Users submit *Ask HN* posts to ask questions to the Hacker News community and submit *Show HN* posts to show the Hacker News community a project, or just generally something interesting. 

We will compare these two types of posts to determine the following:

- **Which one receives more comments on average?**
- **Do posts created at a certain time receive more comments on average?**

## Exploring the Data

We'll start by opening the dataset and explore it.

In [1]:
# Read the dataset into a list of lists

from csv import reader

opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
# Remove the header
hn = hn[1:]           

In [2]:
# explore_data() is a function that allows to explore rows and shows the number of rows and columns of a dataset

def explore_data(dataset, start, end, rows_and_columns = False):
    
    data_sample = dataset[start:end]
    
    for row in data_sample:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# Explore the data hn and print the first five row

print(hn_header)
print('\n')
explore_data(hn, 0, 4, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Number of rows: 293119
Number of columns: 7


## Extracting Ask HN and Show HN posts

In [4]:
# Filter the data and select only the posts that begin with Ask HN or Show HN

ask_posts = []    # list of lists with the posts that begin with Ask HN
show_posts = []   # list of lists with the posts that begin with Show HN
other_posts = []  

for row in hn:
    
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    
    else:
        other_posts.append(row)

In [5]:
# Explore the data ask_posts and print the first five rows

print(hn_header)
print('\n')
explore_data(ask_posts, 0, 4, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


Number of rows: 9139
Number of columns: 7


In [6]:
# Explore the data show_posts and print the first five rows

print(hn_header)
print('\n')
explore_data(show_posts, 0, 4, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']


['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']


['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17']


Number of rows: 10158
Number of columns: 7


In [7]:
# Explore the data other_posts and print the first five rows

print(hn_header)
print('\n')
explore_data(other_posts, 0, 4, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Number of rows: 273822
Number of columns: 7


## Discovering the most commented posts

Now we are going to determine if *ask posts* or *show posts* receive more comments on average.

In [8]:
# Find the total number of comments on ask_posts

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

print("Total number of comments on ask posts:", total_ask_comments)
print("Average number of comments on ask posts: {:.2f}".format(avg_ask_comments))

Total number of comments on ask posts: 94986
Average number of comments on ask posts: 10.39


In [9]:
# Find the total number of comments on show_posts

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments = total_show_comments + num_comments
    
avg_show_comments = total_show_comments/len(show_posts)

print("Total number of comments on show posts:", total_show_comments)
print("Average number of comments on ask posts: {:.2f}".format(avg_show_comments))

Total number of comments on show posts: 49633
Average number of comments on ask posts: 4.89


By analysing the total number of comments determined above, we conclude that the ***ask posts* receive, on average, more comments than *show posts***. Therefore, we will focus our remaining analysis only on the *ask posts*. 

## Determine if ask posts created at a certain time are more likely to attract comments

We will determine if *ask posts* created at a certain time are more likely to attract comments by performing this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments per ask posts by hour created.

In [10]:
# Calculate the amount of ask posts and comments by hour created

import datetime as dt

result_list = []
counts_by_hour = {}    # contains the number of ask posts created during each hour of the day
comments_by_hour = {}  # contains the corresponding number of comments ask posts created at each hour received


for row in ask_posts:
    
    num_comments = int(row[4])
    created_at = row[6]
    result_list.append([created_at, num_comments])
    


for row in result_list:
    
    date_time = row[0]
    comment = row[1]
    
    d1_dt = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    time = d1_dt.time()

    # Select just the hour
    hour = time.strftime("%H")

    if hour not in counts_by_hour:   
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

print("Hour : Number of ask posts")
print(counts_by_hour)
print("\n")
print("Hour : Number of comments ask posts")
print(comments_by_hour)
    

Hour : Number of ask posts
{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


Hour : Number of comments ask posts
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [11]:
# Calculate the average number of comments per ask posts received during each hour

avg_by_hour = list()

for k, v in counts_by_hour.items():
    avg = comments_by_hour[k]/v
    avg_by_hour.append([k, avg])

print("[Date, Average number of comments per ask posts]")
print(avg_by_hour)

[Date, Average number of comments per ask posts]
[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [12]:
# Even though we have the results we need, the format above makes it hard to identify the hours with the highest values. 

# Sort the list of lists and printing the five highest values in a format that's easier to read.

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
# sorted() function is used to sort swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, key = None, reverse = True)

print("[Average number of comments per ask posts, Date]")
print(sorted_swap)
print("\n")

print("Top 5 Hours for Ask Posts Comments:")

top_5 = sorted_swap[0:6]

for row in top_5:
    datetime_object = dt.datetime.strptime(row[1], "%H")
    time_format = datetime_object.strftime("%H:%M")
    final_string = "{}: {:.2f} average comments per post".format(time_format, float(row[0]))
    print(final_string) 
    

[Average number of comments per ask posts, Date]
[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10'], [9.7119341563786, '04'], [9.692007797270955, '14'], [9.449744463373083, '17'], [9.190661478599221, '08'], [8.96474358974359, '11'], [8.804177545691905, '22'], [8.794258373205741, '05'], [8.749019607843136, '20'], [8.687258687258687, '21'], [7.948339483394834, '03'], [7.94299674267101, '18'], [7.713298791018998, '16'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [7.163043478260869, '19'], [7.013274336283186, '07'], [6.782051282051282, '06'], [6.696793002915452, '23'], [6.653153153153153, '09']]


Top 5 Hours for Ask Posts Comments:
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
04:00: 9.71 average comments per post


We conclude that the posts created at the five hours above receive more comments on average.