# HackerNews Analysis

## What is HackerNews?
Is a news aggregator and discussion website specifically focused on technology and startups. It was a site started by [Y Combinator](https://www.ycombinator.com/).

## What this project is about?
I will be reviewing the submission patterns for HackerNews and trying to determine two questions:

1) Do Ask or Show HackerNews posts receive more comments than a regular article/post?

2) Do posts created a certain times of the day get higher engagement?

### Importing The Data

We first used the imported the reader so we can read our csv file "hacker_news.csv" from there I went ahead and removed the headers and made the name smaller for ease of reading.

In [1]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hackernews = list(read_file)
hn_headers = hackernews[:1]
hn = hackernews[1:]
print(hn_headers)
print(hn[0:6])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', '

### Isolating The Data

Once we have created our new list "hn" which does not have headers included in it. We want to isolate the data we want into three separate lists so we can then analyze later. We can do this by creating three separate list and using a for loop to sift through the data and place each post in the appropriate bucket.

One thing to note is that we need to have consistent case with our titles as we will be using the string method of startswith to identify the posts we want. And for that method to work, we need consistent case for our words which we can achieve with another method like upper() or lower().

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


### Finding Avg Comments

Now that we have our three big buckets of data. We want to find the average comments and see if there is any difference between posts that ask hackernews or posts that show hackernews and their difference in engagement if it exists.

In [3]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [4]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


### Further Analysis

Based on the above we can see that ask posts tend to do a bit better in engagement than show posts. We will focus on those types of a post for now and try and see if there is a specific time/hours in the day that those types of posts tend to better.

In [5]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
    [post[6], int(post[4])]
    )

counts_by_hour = {}
comments_by_hour= {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] +=1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1
print(comments_by_hour)
print()
print(counts_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [6]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


### Cleaning Up The Results

Now that we have our averages of comments by hour, let's get the data in a more reader friendly format so we can write and come to our conclusions.

In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
print()
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], 

In [8]:
print("Top 5 Hours for Ask HN Posts Comments")
for avg, hr in sorted_swap[:5]:
    print("{}, {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg)
    )


Top 5 Hours for Ask HN Posts Comments
15:00, 38.59 average comments per post
02:00, 23.81 average comments per post
20:00, 21.52 average comments per post
16:00, 16.80 average comments per post
21:00, 16.01 average comments per post


## Conclusion

Based on our findings we can see that from 3pm to 4pm we can see more activity and engagement with Ask HackerNews posts. Based on the [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts) we can see that the data is being captured in Eastern Time Zone which would be 12pm to 1pm California time. This would be an optimal time if your goal is to get high levels of engagement.