# Analyzing Hacker News Data

In this project, we analyze Hacker News data from the post that are made on the site. We want to ultimately see if the posts of type "Show HN" (show Hacker News) or "Ask NH" (ask Hacker News) produce the most traffic and interactions.

First, let's import the data set, explore its size and remove the entries that have no comments.

In [1]:
from csv import reader
opened_file = open(r'C:\Users\renau\DATA_SCIENCE\project2(HackerNewsData)\HN_posts_year_to_Sep_26_2016.csv', encoding="utf8")
read_file = reader(opened_file)
hn_dataset = list(read_file)

hn_dataset_head = hn_dataset[0]
hn_dataset = hn_dataset[1:]

print(hn_dataset_head)
print("Data set length is: " + str(len(hn_dataset)))
        

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
Data set length is: 293119


In [10]:
index = 0
for row in hn_dataset:
    if row[4] == "0":
        del hn_dataset[index]
    index += 1
        
print("Data set length is: " + str(len(hn_dataset)))

Data set length is: 80401


In [11]:
print(hn_dataset[:5])

[['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']]


Now let's separate the posts on wether they are "Ask NH", "Show NH" or "Other"

In [17]:
ask_posts = []
show_posts= []
other_posts = []
for row in hn_dataset:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
    

6911
5059
68431


Now let's calculate the number of comments generated on average on "Ask HN" and "Show HN" posts.

In [20]:
def mean_calc(dataset, index):
    total = 0
    for row in dataset:
        total += int(row[index])
    mean = total / len(dataset)
    return mean

avg_comments_ask = mean_calc(ask_posts, 4)
avg_comments_show = mean_calc(show_posts, 4)

print("Average comments per post ask HN: " + str(round(avg_comments_ask)))
print("Average comments per post show HN: " + str(round(avg_comments_show)))

Average comments per post ask HN: 14
Average comments per post show HN: 10


We can see that "Ask HN" post generate substantially more comments on average. It make sense since the person who write the post expect to receive some help/feedback from the readers.

Now, we'll analyze if there is a correlation between the time at which a post is made and the comments that it receives. 

In [37]:
import datetime as dt
def extract_data(dataset, index_list):
    # Extract desired columns of a dataset 
    # We use this function to extract the dates of the post and the number of
    # comments
    result_list = []
    for row in dataset:
        result = []
        for index in index_list:
            result.append(row[index])
        result_list.append(result)
    return result_list

def change_to_int(dataset, index):
    for row in dataset:
        row[index] = int(row[index])
    return dataset

def change_to_datetime(dataset, index):
    template = "%m/%d/%Y %H:%M"
    for row in dataset:
        formatted = dt.datetime.strptime(row[index], template)
        row[index] = formatted
    return dataset

def count_per_hour(dataset):
    # Takes the formatted dataset with index 0 being a datetime object and 
    # index 1 being an int of the comment number
    counts_by_hour = {}
    comments_by_hour = {}
    for row in dataset:
        time = row[0]
        comments = row[1]
        hour = time.strftime('%H')
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += comments
        elif hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = comments
    return counts_by_hour, comments_by_hour

ask_data_extract = extract_data(ask_posts, [6, 4])
ask_data_extract = change_to_int(ask_data_extract, 1)
ask_data_extract = change_to_datetime(ask_data_extract, 0)
counts_by_hour, comments_by_hour = count_per_hour(ask_data_extract)
print("Articles published and each hour of the day:")
print(counts_by_hour)
print("Total number of comments on the articles published at each hour of the day")
print(comments_by_hour)


Articles published and each hour of the day:
{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
Total number of comments on the articles published at each hour of the day
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


We can now use these dictionnaries to calculate to average number on comments per article for each hour of the day.

In [39]:
avg_comments_by_hour = []
for key in counts_by_hour:
    hour = key
    average = round(comments_by_hour[key] / counts_by_hour[key]) 
    avg_comments_by_hour.append([hour, average])

print("Average comments per article published at each hour of the day")
print(avg_comments_by_hour)

Average comments per article published at each hour of the day
[['02', 13], ['01', 9], ['22', 12], ['21', 11], ['19', 9], ['17', 14], ['15', 40], ['14', 13], ['13', 22], ['11', 11], ['10', 14], ['09', 8], ['07', 10], ['03', 10], ['16', 11], ['08', 12], ['00', 10], ['23', 8], ['20', 11], ['18', 11], ['12', 15], ['04', 13], ['06', 9], ['05', 11]]


Now let's organize this info a little bit better

In [55]:
swap_avg_by_hour = []
for pair in avg_comments_by_hour:
    swap_avg_by_hour.append([pair[1], pair[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 hours for Ask HN posts comments")

for pair in sorted_swap[0:5]:
    time = dt.datetime.strptime(pair[1], '%H')
    avg = pair[0]
    time = time.strftime('%H')
    template = "{time}:00 with and average comments per post of {avg}"
    print(template.format(time = time, avg = avg))
    

Top 5 hours for Ask HN posts comments
15:00 with and average comments per post of 40
13:00 with and average comments per post of 22
12:00 with and average comments per post of 15
17:00 with and average comments per post of 14
10:00 with and average comments per post of 14


We can see that the best time to post something is at 3 P.M. by far Eastern time!