## Hacker News Analysis

#### Table of Contents
* [Introduction](#1)
* [Open csv file](#2)
* [extract header and separate data rows](#3)
* [Separating ASK HN, SHOW HN, AND OTHER POSTS](#4)
* [Calculate Avg Comments per SHOW vs ASK posts](#5)
* [Focused Analysis on ASK posts](#6)
* [Conclusion](#7)


#### Introduction <a class='anchor' id='1'></a>
In this project, we will be using a dataset from Hacker News to perform basic data analysis functions in Python.

Specifically, we'll be comparing two types of posts on the Hacker News website to answer the following:

Do Ask HN or Show HN posts receive more comments on average?
Do posts created at a certain time receive more comments on average?

#### Open csv file + view 1st 5 rows <a class='anchor' id='2'></a>

In [18]:
import csv as c

opened_file = open("D:/DataQuest/hacker_news.csv")
read_file = c.reader(opened_file)
hacker_list = list(read_file)

for x in hacker_list[:5]:
    print(x)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


#### extract header and separate data rows <a class='anchor' id='3'></a>

In [19]:
header = hacker_list[:1]
for x in header:
    print(x)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [20]:
hacker_list = hacker_list[1:]
for x in hacker_list[:5]:
    print(x)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


#### Separating ASK HN, SHOW HN, AND OTHER POSTS <a class='anchor' id='4'></a>

In [23]:
ask_hn = []
show_hn = []
other_hn = []

for row in hacker_list:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_hn.append(title)
    elif title.startswith('show hn'):
        show_hn.append(title)
    else:
        other_hn.append(title)

print(len(ask_hn))
print(len(show_hn))
print(len(other_hn))

1744
1162
17194


#### Let's have a look at each list.

In [24]:
for row in ask_hn[:5]:
    print(row)

ask hn: how to improve my personal website?
ask hn: am i the only one outraged by twitter shutting down share counts?
ask hn: aby recent changes to css that broke mobile?
ask hn: looking for employee #3 how do i do it?
ask hn: someone offered to buy my browser extension from me. what now?


In [25]:
for row in show_hn[:5]:
    print(row)

show hn: wio link  esp8266 based web of things hardware development platform
show hn: something pointless i made
show hn: shanhu.io, a programming playground powered by e8vm
show hn: webscope  easy way for web developers to communicate with clients
show hn: geoscreenshot  easily test geo-ip based web pages


In [26]:
for row in other_hn[:5]:
    print(row)

interactive dynamic video
how to use open source and shut the fuck up at the same time
florida djs may face felony for april fools' water joke
technology ventures: from idea to enterprise
note by note: the making of steinway l1037 (2007)


The lists look correct.

### Calculate Avg Comments per SHOW vs ASK posts <a class='anchor' id='5'></a>

In [36]:
total_comments = 0
num_of_comments = len(ask_hn)
for row in hacker_list:
    if row[1].lower() in ask_hn:
        total_comments += int(row[4])
avg_ask_comments = total_comments / num_of_comments
print(avg_ask_comments)

14.038417431192661


In [37]:
total_comments = 0
num_of_comments = len(show_hn)
for row in hacker_list:
    if row[1].lower() in show_hn:
        total_comments += int(row[4])
avg_show_comments = total_comments / num_of_comments
print(avg_show_comments)

10.31669535283993


Ask Posts have on average 14 comments vs SHOW posts average of 10 comments

#### Next, focused analysis on ASK POSTS. <a class='anchor' id='6'></a>
Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

Create dataset with only ASK POSTS

In [68]:
ASK_list = []
for row in hacker_list:
    if row[1].lower() in ask_hn:
        ASK_list.append(row)

We convert date to date time, extract hours and create frequency table of created posts per hour. Next we sort them by descending order.
Afterwards we go back and add in number of comments into the loop, as marked by #

In [69]:
import datetime as dt
freq_hourly = {}
num_comments = {}
for row in ASK_list:
    ask_date = row[-1]
    parse_date = dt.datetime.strptime(ask_date, "%m/%d/%Y %H:%M")
    hour = parse_date.strftime('%H')
    if hour in freq_hourly:
        freq_hourly[hour] += 1
        num_comments[hour] += int(row[-3]) #
    else:
        freq_hourly[hour] = 1
        num_comments[hour] = int(row[-3]) #

rank = []
for x in freq_hourly:
    tuple1 = (freq_hourly[x],x)
    rank.append(tuple1)
    rank = sorted(rank,reverse=True)
for x in rank:
    print(f"{x[1]}: {x[0]}")

15: 116
19: 110
21: 109
18: 109
16: 108
14: 107
17: 100
13: 85
20: 80
12: 73
22: 71
23: 68
01: 60
10: 59
11: 58
02: 58
00: 55
03: 54
08: 48
04: 47
05: 46
09: 45
06: 44
07: 34


Most posts are created around 13:00

In [70]:
rank = []
for x in num_comments:
    tuple1 = (num_comments[x],x)
    rank.append(tuple1)
    rank = sorted(rank,reverse=True)
for x in rank:
    print(f"{x[1]}: {x[0]}")

15: 4477
16: 1814
21: 1745
20: 1722
18: 1439
14: 1416
02: 1381
13: 1253
19: 1188
17: 1146
10: 793
12: 687
01: 683
11: 641
23: 543
08: 492
22: 479
05: 464
00: 447
03: 421
06: 397
04: 337
07: 267
09: 251


Most comments come in shortly after at 1400.

#### Calculate the avg number of comments per post breakdown by hour of the day

In [73]:
avg = {}
for x in freq_hourly:
    avg[x] = round(num_comments[x] / freq_hourly[x],2)

rank = []
for x in avg:
    tuple1 = (avg[x],x)
    rank.append(tuple1)
    rank = sorted(rank,reverse=True)
for x in rank[:5]:
    print(f"{x[1]}: {x[0]}")

15: 38.59
02: 23.81
20: 21.52
16: 16.8
21: 16.01


It appears the top 5 hours of the day to post questions to maximize comment response would be in the times listed above.

#### Conclusion <a class='anchor' id='7'></a>

In this project, we have performed some basic data analysis functions in Python utilizing lists, dictionaries, and tuples.