# Hacker News Posts Exploration & Analysis


What is Hacker News?

Hacker News is a technology focused blogging site. Users can ask questions or submit stories, and recieve votes and comments from other users.

Posts can have labels such as **Ask HN** or **Show HN**. 

Ask HN posts ask the Hacker News community a specific question.

Show HN posts to show something to the  Hacker News community.

## Questions


1. Do Ask HN or Show HN receive more comments on average?

2. Do posts created at a certain time receive more comments on average?


## Dataset 
[Original](https://www.kaggle.com/hacker-news/hacker-news-posts)

[Modified](https://app.dataquest.io/31d43d5f-8b12-4cb8-b62e-c27f99eb7fb4)


# Importing Data

In [1]:
from csv import reader
import datetime as dt

o_file = open("hacker_news.csv")
r_file = reader(o_file)
hacker_news = list(r_file)
hacker_news_h = hacker_news[0] # save header
hacker_news = hacker_news[1:] #remove header from main dataset

print(hacker_news_h, "\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 



## Important Cols
0 id: the unique identifier from Hacker News for the post

1 title: the title of the post

2 url: the URL that the posts links to, if the post has a URL

3 num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

4 num_comments: the number of comments on the post

5 author: the username of the person who submitted the post

6 created_at: the date and time of the post's submission

## Sample of Data

In [2]:
print(hacker_news[0], "\n")
print(hacker_news[1], "\n")
print(hacker_news[2], "\n")

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 



# Initial Datacleaning

We only need posts that are tagged, "Ask HN" or "Show HN"

Python's str.startswith() will return t/f, if string starts with a given prefix, allowing us to check and filter.

Dataset will be sorted into 3 lists, ask_posts, show_posts, and other_posts


In [9]:
# 3 lists
ask_posts, show_posts, other_posts = [], [] , [] 

# sort data by post type
for post in hacker_news:
    if post[1].lower().startswith("ask hn"):
        ask_posts.append(post)
    elif post[1].lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

# check
print("Ask Posts Count: ",len(ask_posts))
print(ask_posts[0], "\n")
print("Show Posts Count: ",len(show_posts))
print(show_posts[0], "\n")
print("Other Posts Count: ",len(other_posts))
print(other_posts[0], "\n")

Ask Posts Count:  1744
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

Show Posts Count:  1162
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] 

Other Posts Count:  17194
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 



## Finding Comment Averages
In order to answer questions about posts and their comments we need to isolate the comments.

Finding the Average of each comment per post type is a good place to start.

In [23]:
# comment index is 4

# ask comments
total_ask_comments = 0
avg_ask_comments = 0

for comment in ask_posts:
    total_ask_comments += int(comment[4])
avg_ask_comments = total_ask_comments / len(ask_posts)

results = "A ask_hn post gets an avg of {:.2f} comments, with a total of {} comments found across all posts".format(avg_ask_comments, total_ask_comments)
print(results, "\n")

# show comments
total_show_comments = 0
avg_show_comments = 0

for comment in show_posts:
    total_show_comments += int(comment[4])
avg_show_comments = total_show_comments / len(show_posts)

results = "A show_hn post gets an avg of {:.2f} comments, with a total of {} comments found across all posts".format(avg_show_comments, total_show_comments)
print(results, "\n")

# ask comments
total_other_comments = 0
avg_other_comments = 0

for comment in other_posts:
    total_other_comments += int(comment[4])
avg_other_comments = total_other_comments / len(other_posts)

results =  "A other_hn post gets an avg of {:.2f} comments, with a total of {} comments found across all posts".format(avg_other_comments, total_other_comments)
print(results, "\n")

A ask_hn post gets an avg of 14.04 comments, with a total of 24483 comments found across all posts 

A show_hn post gets an avg of 10.32 comments, with a total of 11988 comments found across all posts 

A other_hn post gets an avg of 26.87 comments, with a total of 462055 comments found across all posts 



# Question 1 Results


Do Ask HN or Show HN receive more comments on average?

Our findings suggest that **Ask** posts on average recieve **3.72 more comments** than Show posts.

# Next Steps

Now that we know Ask posts recieve more comments, we want to know if posting during a certain increases recieved comments. Using the ask_posts dataset we can create a freq table based on time of posting

1. Find sum of ask posts and comments per hour
2. Find the avg of comments per hour