# Hacker News Posts Exploration & Analysis


What is Hacker News?

Hacker News is a technology focused blogging site. Users can ask questions or submit stories, and recieve votes and comments from other users.

Posts can have labels such as **Ask HN** or **Show HN**. 

Ask HN posts ask the Hacker News community a specific question.

Show HN posts to show something to the  Hacker News community.

## Questions


1. Do Ask HN or Show HN receive more comments on average?

2. Do posts created at a certain time receive more comments on average?


## Dataset 
[Original](https://www.kaggle.com/hacker-news/hacker-news-posts)

[Modified](https://app.dataquest.io/31d43d5f-8b12-4cb8-b62e-c27f99eb7fb4)


# Importing Data

In [1]:
from csv import reader
import datetime as dt

o_file = open("hacker_news.csv")
r_file = reader(o_file)
hacker_news = list(r_file)
hacker_news_h = hacker_news[0]  # save header
hacker_news = hacker_news[1:]  # remove header from main dataset

print(hacker_news_h, "\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 



## Important Cols
0 id: the unique identifier from Hacker News for the post

1 title: the title of the post

2 url: the URL that the posts links to, if the post has a URL

3 num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

4 num_comments: the number of comments on the post

5 author: the username of the person who submitted the post

6 created_at: the date and time of the post's submission

## Sample of Data

In [2]:
print(hacker_news[0], "\n")
print(hacker_news[1], "\n")
print(hacker_news[2], "\n")

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 



# Initial Datacleaning

We only need posts that are tagged, "Ask HN" or "Show HN"

Python's str.startswith() will return t/f, if string starts with a given prefix, allowing us to check and filter.

Dataset will be sorted into 3 lists, ask_posts, show_posts, and other_posts


In [3]:
# 3 lists
ask_posts, show_posts, other_posts = [], [], []

# sort data by post type
for post in hacker_news:
    if post[1].lower().startswith("ask hn"):
        ask_posts.append(post)
    elif post[1].lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

# check
print("Ask Posts Count: ", len(ask_posts))
print(ask_posts[0], "\n")
print("Show Posts Count: ", len(show_posts))
print(show_posts[0], "\n")
print("Other Posts Count: ", len(other_posts))
print(other_posts[0], "\n")

Ask Posts Count:  1744
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

Show Posts Count:  1162
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] 

Other Posts Count:  17194
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 



## Finding Comment Averages
In order to answer questions about posts and their comments we need to isolate the comments.

Finding the Average of each comment per post type is a good place to start.

In [4]:
# comment index is 4

# ask comments
total_ask_comments = 0
avg_ask_comments = 0

for comment in ask_posts:
    total_ask_comments += int(comment[4])
avg_ask_comments = total_ask_comments / len(ask_posts)

results = "A ask_hn post gets an avg of {:.2f} comments, with a total of {} comments found across all posts".format(
    avg_ask_comments, total_ask_comments
)
print(results, "\n")

# show comments
total_show_comments = 0
avg_show_comments = 0

for comment in show_posts:
    total_show_comments += int(comment[4])
avg_show_comments = total_show_comments / len(show_posts)

results = "A show_hn post gets an avg of {:.2f} comments, with a total of {} comments found across all posts".format(
    avg_show_comments, total_show_comments
)
print(results, "\n")

# ask comments
total_other_comments = 0
avg_other_comments = 0

for comment in other_posts:
    total_other_comments += int(comment[4])
avg_other_comments = total_other_comments / len(other_posts)

results = "A other_hn post gets an avg of {:.2f} comments, with a total of {} comments found across all posts".format(
    avg_other_comments, total_other_comments
)
print(results, "\n")

A ask_hn post gets an avg of 14.04 comments, with a total of 24483 comments found across all posts 

A show_hn post gets an avg of 10.32 comments, with a total of 11988 comments found across all posts 

A other_hn post gets an avg of 26.87 comments, with a total of 462055 comments found across all posts 



# Question 1 Results


Do Ask HN or Show HN receive more comments on average?

Our findings suggest that **Ask** posts on average recieve **3.72 more comments** than Show posts.

# Next Steps

Now that we know Ask posts recieve more comments, we want to know if posting during a certain increases recieved comments. Using the ask_posts dataset we can create a freq table based on time of posting

1. Find sum of ask posts and comments per hour
2. Find the avg of comments per hour

In [5]:
# askposts
# index 6, created_at: the date and time of the post's submission
# '8/16/2016 9:55'
# index 4, comments

# print(ask_posts[0])

# print(dt.datetime.strptime(ask_posts[0][6], "%m/%d/%Y %H:%M")) #2016-06-16 09:55:00


hour_freq = {}
comment_freq = {}

for post in ask_posts:
    comment = int(post[4])

    date = dt.datetime.strptime(post[6], "%m/%d/%Y %H:%M")
    date = int(date.strftime("%H"))

    if date not in hour_freq:
        hour_freq[date] = 1
        comment_freq[date] = comment
    else:
        hour_freq[date] += 1
        comment_freq[date] += comment

print(hour_freq, "\n")
print(comment_freq)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} 

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


## Find the avg comments per hour

In [6]:
avg_comments_hourly = []

for comment in hour_freq:
    avg_comments_hourly.append([comment, (comment_freq[comment] / hour_freq[comment])])

print(avg_comments_hourly)

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


## Data Cleaning

In [7]:
# swapping k:v for sorting

avg_com_swap = []

for avg in avg_comments_hourly:
    # print(avg)
    avg_com_swap.append([avg[1], avg[0]])

# print(avg_com_swap)

# sorting
comments_final = sorted(avg_com_swap, reverse=True)

print("Top 5 Hours for Ask Posts Comments", "\n")

for avg in comments_final:
    s = "Hour {} - Average of {} comments".format(avg[1], avg[0])
    print(s)

Top 5 Hours for Ask Posts Comments 

Hour 15 - Average of 38.5948275862069 comments
Hour 2 - Average of 23.810344827586206 comments
Hour 20 - Average of 21.525 comments
Hour 16 - Average of 16.796296296296298 comments
Hour 21 - Average of 16.009174311926607 comments
Hour 13 - Average of 14.741176470588234 comments
Hour 10 - Average of 13.440677966101696 comments
Hour 14 - Average of 13.233644859813085 comments
Hour 18 - Average of 13.20183486238532 comments
Hour 17 - Average of 11.46 comments
Hour 1 - Average of 11.383333333333333 comments
Hour 11 - Average of 11.051724137931034 comments
Hour 19 - Average of 10.8 comments
Hour 8 - Average of 10.25 comments
Hour 5 - Average of 10.08695652173913 comments
Hour 12 - Average of 9.41095890410959 comments
Hour 6 - Average of 9.022727272727273 comments
Hour 0 - Average of 8.127272727272727 comments
Hour 23 - Average of 7.985294117647059 comments
Hour 7 - Average of 7.852941176470588 comments
Hour 3 - Average of 7.796296296296297 comments
Hour 

## Do posts created at a certain time receive more comments on average?

**Yes.**

As we can see posts made during the hours of 15, 12, 2, 20, 16 all see an increased amount of comments on average

# Final Conclusions

## Questions
1. Do Ask HN or Show HN receive more comments on average?

Ask Hacker news posts receive more comments on average.

2. Do posts created at a certain time receive more comments on average?

Posting time does have an impact on the average amount of comments recieved.

# Conclusion

Posting a ASk_HN post during the hours of 15, 2, and 20 should gurantee a greater >- 20 comments on average