# Hacker News Posts Analysis
[Hacker News](https://news.ycombinaator.com/) (or HN) is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-subbitted stories (known as "posts") are voted and commented upon, similar to reddit.

> _Hacker News is extremely popular in thechnology and startup circles, and posts that make it to the top of Hacker News' listings can get hundereds of thousands of visitors as a results._

In this analysis, we're going to work with HN submissions data set, and specifically, work with posts that start with either 'Ask HN' or 'Show HN'. 

'Ask HN' posts are meant to ask the HN community a specific question, while 'Show HN' posts are to show the HN community a project, product, or just generally something interesting.

We're comparing 'Ask HN' posts, with 'Show HN' posts, to check the following:
1. Which one receives more comments on average.
2. If posts created at a certain time receive more comments on average.


Data set available [here](https://www.kaggle.com/hacker-news/hacker-news-posts). The data set we're using has been reduced to 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.


## Exploring Data
We will start by exploring the data set rows using a read method (to read the data set) and a display method (to display a given number of rows in an array).

### Adding required code for the analysis:

In [1]:
#Importing required modules
from csv import reader
import datetime as dt

#Reading data set method
#Reads any dataset given, and returns it as a list
def readAFile(file='hacker_news.csv'):
    openFile = open(file)
    readFile = reader(openFile)
    lst = list(readFile)
    return lst

#List display method
#Used to display a list in an organized manner at any time
def display(lst):
    for i in lst:
        print(i, end='\n\n')

### Printing sample data:
We will read the data set first, then create a variable to move the header row to it, and finally, display the first 5 rows of the data set.

In [2]:
#Reading the HN data set
hn = readAFile()

#moving the header to a separate variable
header = hn[0]
hn = hn[1:]

#printing the header
print('Header:\n')
print(header, end='\n\n')

#pritning sample data
print('Sample data:', end='\n\n')
display(hn[0:5])

Header:

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Sample data:

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4

### Data set columns descriptions:
From the header above, and from the data set URL, we conclud the following columns descriptions:

- `id`: The unique identifier from Hacker News for the post.
- `title`: The title of the post.
- `url`: The URL that the posts links to, if it the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the pos
- `created_at`: The date and time at which the post was submitted

## Preparing Required Data
As mentioned above, We're interested in looking at posts that start with 'Ask HN', or 'Show HN'. For this, we'll extract those posts separately to start working on its analysis.

In [3]:
#creating lists to have only posts with 'Ask HN', 'Show HN', and other posts, separately.
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

#print numbers

print('Number of Ask HN posts:', len(ask_posts))
print('Number of Show HN posts:', len(show_posts))
print('Number of Other HN posts:', len(other_posts))


Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of Other HN posts: 17194


In [4]:
#exploring 'Ask HN' posts
display(ask_posts[0:5])

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']

['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']

['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']

['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']



In [5]:
#exploring 'Show HN' posts
display(show_posts[0:5])

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']

['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']

['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']

['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']

['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']



In [6]:
#exploring 'Others' posts
display(other_posts[0:5])

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']



## Analysis

### Determining whether 'Ask HN' or 'Show HN' receive more comments:

In [7]:
total_ask_comments = 0
total_show_comments = 0

#Calculating the average of 'Ask HN' posts comments
for post in ask_posts:
    no_of_comments = int(post[4])
    total_ask_comments += no_of_comments

avg_ask_comments = total_ask_comments / int(len(ask_posts))
print('Average number of comments for \'Ask HN\' posts:', avg_ask_comments)

#Calcuating the average of 'Show HN' posts comments
for post in show_posts:
    no_of_comments = int(post[4])
    total_show_comments += no_of_comments

avg_show_comments = total_show_comments / int(len(show_posts))
print('Average number of comments for \'Show HN\' posts:', avg_show_comments)

Average number of comments for 'Ask HN' posts: 14.038417431192661
Average number of comments for 'Show HN' posts: 10.31669535283993


From the above, we see that the average numebr of comments for 'Ask HN' posts are higher than the average numebr of comments for 'Show HN' posts. This was obvious, as the name 'Ask HN' states, but it was necessary to conduct this check.

Suprisingly, the average number of comments for 'Show HN' posts, are not very low from 'Ask HN' comments. This can show signs of how the HN community are helpful.

### Determining 'when' posts get more comments
To analyze if posts created at a certain time receive more comments on average, we'll concentrate on the 'Ask HN' data set, since it has a lot of comments.

For that, we'll do the following:
1. Calculate the amount of 'Ask HN' posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments 'Ask HN' posts receive by hour created.

In [13]:
#Caclulate the amount of 'Ask HN' posts created in each hour of the day,
#along with the number of comments received.


#Creating a separate list, having only the posts' timestamps and number of comments.

result_list = []

for post in ask_posts:
    created_at = post[6]
    no_of_comments = int(post[4])
    result_list.append([created_at, no_of_comments])


#counts_by_hour:	number of 'Ask HN' posts created in each hour of the day
#comments_by_hour:	number of comments posted in 'Ask HN' posts at each hour of the day

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row[0]
    hour = dt.datetime.strptime(hour,'%m/%d/%Y %H:%M')
    hour = hour.strftime('%H')
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])

#exploring results
print(counts_by_hour)
print(comments_by_hour)

{'07': 34, '10': 59, '08': 48, '22': 71, '20': 80, '01': 60, '14': 107, '03': 54, '23': 68, '15': 116, '12': 73, '11': 58, '17': 100, '05': 46, '00': 55, '19': 110, '09': 45, '02': 58, '13': 85, '16': 108, '21': 109, '06': 44, '18': 109, '04': 47}
{'07': 267, '10': 793, '08': 492, '22': 479, '20': 1722, '01': 683, '14': 1416, '03': 421, '23': 543, '15': 4477, '12': 687, '11': 641, '17': 1146, '05': 464, '00': 447, '19': 1188, '09': 251, '02': 1381, '13': 1253, '16': 1814, '21': 1745, '06': 397, '18': 1439, '04': 337}


In [16]:
#Calculate the average number of comments 'Ask HN' posts receive by hour created.

#avg_by_hour:	a list of lists, where each list contains the hour,
#				and the average number of comments per post

avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

#exploring results
print(avg_by_hour)

[['07', 7.852941176470588], ['10', 13.440677966101696], ['08', 10.25], ['22', 6.746478873239437], ['20', 21.525], ['01', 11.383333333333333], ['14', 13.233644859813085], ['03', 7.796296296296297], ['23', 7.985294117647059], ['15', 38.5948275862069], ['12', 9.41095890410959], ['11', 11.051724137931034], ['17', 11.46], ['05', 10.08695652173913], ['00', 8.127272727272727], ['19', 10.8], ['09', 5.5777777777777775], ['02', 23.810344827586206], ['13', 14.741176470588234], ['16', 16.796296296296298], ['21', 16.009174311926607], ['06', 9.022727272727273], ['18', 13.20183486238532], ['04', 7.170212765957447]]


The above list is hard to read, so we'll fix this by sorting the list.

In [38]:
#swapping the hour value with the average value, to prepare it for sorting
swap_avg_by_hour = []

for item in avg_by_hour:
    swap_avg_by_hour.append([item[1],item[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for \'Ask HN\' Posts Comments',end='\n\n')
template = "{time}: {comment:.2f} average comments per post"
delta = dt.timedelta(hours=7) #to convert from Eastern Time in U.S. to GMT+3
for item in sorted_swap[0:5]:
    hour = item[1]
    hour = dt.datetime.strptime(hour, '%H')
    hour = hour - delta #converting to GMT+3
    hour = dt.datetime.strftime(hour, '%H:%M')
    string = template.format(time=hour, comment=item[0])
    print(string)

Top 5 Hours for 'Ask HN' Posts Comments

08:00: 38.59 average comments per post
19:00: 23.81 average comments per post
13:00: 21.52 average comments per post
09:00: 16.80 average comments per post
14:00: 16.01 average comments per post


From the above, we conclude that posting at 8:00 AM (GMT+3, which is Bahrain's time zone) will give us a higher chance to receive comments from the HN community.

Todo:
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.
- Use Dataquest's data science project style guide to format your project.