## Hacker News Project
This is a guided project from DataQuest to explore what kinds of posts on Hacker News recieve the most comments, and if posts created at a certain time receive more comments on average.

We're specifically interested in posts whose titles begin with either **Ask HN** or **Show HN**.
- Users submit **Ask HN** posts to ask the Hacker News community a specific question
- Users submit **Show HN** posts to show the Hacker News community a project, product, or just generally something interesting.

The full data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), though for this project we have used an abridged version of the dataset. It has been reduced by DataQuest from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Below are descriptions of the columns: 

>**id**: Hacker News ID of the post    
>**title**: title of the post  
>**url**: the url of the item being linked to  
>**num_points**: the number of upvotes the post received  
>**num_comments**: the number of comments the post received  
>**author**: the name of the account that made the post  
>**created_at**: the date and time the post was made (the time zone is Eastern Time in the US)

In [51]:
from csv import reader

# open the dataset and save as a list of lists
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]


In [52]:
# function from DataQuest to print a selection of rows from dataset
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(hn_header)
print('\n')
explore_data(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of 

### Isolating the Posts of Interest 
Below we pull out the Ask HN posts and Show HN posts into separate lists for analysis.

In [53]:
ask_posts = []
show_posts = []
other_posts = []

# loop through the dataset and assign ask and show posts to specific lists
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(ask_posts[:3])
print('\n')
print(show_posts[:3])
print('\n')
print("Length of ask_posts: ",len(ask_posts))
print("Length of show_posts: ",len(show_posts))
print("Length of other_posts: ",len(other_posts))

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]


Length of ask_posts:  1744
Length of show_posts:  1162
Length of other_posts:  17194


### Find Which Type of Post Receives More Comments
As shown below, it seems that posts asking questions usually receive more comments than posts that showcase something from the poster. This is likely because a question invites engagement.

In [54]:
# a function to find the number of comments in any selection of posts from this dataset
def find_avg_comments(dataset):
    num_posts = int(len(dataset))
    total_comments = 0
    for row in dataset:
        num_comments = int(row[4])
        total_comments += num_comments
        
    avg_comments = total_comments / num_posts
    return avg_comments

avg_ask_comments = find_avg_comments(ask_posts)
print("Average ask post comments: ", format(avg_ask_comments,'.2f'))

avg_show_comments = find_avg_comments(show_posts)
print("Average show post comments: ", format(avg_show_comments,'.2f'))

Average ask post comments:  14.04
Average show post comments:  10.32


### Do Posts at a Certain Time get more Comments?
Below we examine if Ask HN post comments vary depending on the time of the post.

#### Below we calculate the amount of ask posts created in each hour of the day, along with the number of comments received across that hour.

In [55]:
import datetime as dt

# create an empty list of lists to store time created and number of comments for each post
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

# create dictionaries of posts and comments by hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    time = row[0]
    comment = row[1]
    parsed_time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = parsed_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
        
print(counts_by_hour, "\n")
comments_by_hour

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} 



{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

#### Next we calculate the average number of comments seen for posts of each hour.

In [56]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

#### We will now sort the list by top average comments.

In [57]:
# create a swapped list where average comments comes first and hour comes second
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

# sort in descending order by avg number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [58]:
print('Top 5 Hours for Ask Post Comments')

for avg, hr in sorted_swap[:5]:
    template = "{hr}: {cm:.2f} average comments per post"
    hour = dt.datetime.strptime(hr, "%H").strftime('%H:%M')
    comments = avg
    output = template.format(hr = hour, cm=comments)
    print(output)


Top 5 Hours for Ask Post Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


#### The hour that **Ask HN** posts receive the most comments is 15:00, or 3:00pm EST.