<a href="https://colab.research.google.com/github/mightyPetra/python_playground/blob/master/HackerNewsProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Hacker News Posts


Hacker news is a message board similar to reddit, where user created content can be voted and commented on. In this project we will be analyzing user posted stories, specifically posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Concretely, we are interested in following difference between the two:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

###1. First, let's load the data

[dataset documentation](https://www.kaggle.com/hacker-news/hacker-news-posts) can be found here

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [17]:
from csv import reader

file = open('/content/drive/My Drive/Colab Notebooks/hacker_news.csv')
parsed_file = reader(file)
hn = list(parsed_file)

header = hn[0]
hn = hn[1:]

print(header)
print(hn[:6])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Ti

### 2. Now sort the posts by their titles, to separate *Ask NH*, *Show HN* and all the other posts

In [11]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
  title = post[1].lower()
  if title.startswith('ask hn'):
    ask_posts.append(post)
  elif title.startswith('show hn'):
    show_posts.append(post)
  else:
    other_posts.append(post)

print(f'Ask HN posts: {len(ask_posts)}',  f'Show HN posts: {len(show_posts)}', f'Ask HN posts: {len(other_posts)}', sep = '\n')

Ask HN posts: 1744
Show HN posts: 1162
Ask HN posts: 17194


### 3. Now let's determine whether *Ask NH* or *Show NH* receiver more comments on average

In [15]:
def total_comment_count(list_of_posts):
  total_comments = 0
  for p in list_of_posts:
    comment_count = int(p[4])
    total_comments += comment_count
  return total_comments

def average_comment_count(list_of_posts):
  total = total_comment_count(list_of_posts)
  return total/len(list_of_posts)


total_ask_comments = total_comment_count(ask_posts)
total_show_comments = total_comment_count(show_posts)

avg_ask_comments = average_comment_count(ask_posts)
avg_show_comments = average_comment_count(show_posts)

print('Average Ask HN comment count:', avg_ask_comments)
print('Average Show HN comment count:', avg_show_comments)


Average Ask HN comment count: 14.038417431192661
Average Show HN comment count: 10.31669535283993


> 3.1 As can be seen from the result of calculating averages of comments for both post types, *Ask HN* posts receive more comments, whic makes sense from the common sense perspective, since these posts contain questions. Since this is the case, we will focus our analysis on *Ask HN* posts.

### 4. Next we will determne whether posts made at a certain time of are more likely to receive comments. For that let's calculate the amount of ask posts created per hour, along with the total amount of comments.

> * counts_by_hour: contains the number of ask posts created during each hour of the day.
* comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

In [70]:
import datetime as dt

result_list = []

for post in ask_posts:
  posted_at = post[6]
  num_comments = int(post[4])
  result_list.append([posted_at, num_comments])

post_count_by_hour = {}
comment_count_by_hour = {}

for r in result_list:
  # date format => 8/4/2016 11:52
  date_time = dt.datetime.strptime(r[0], '%m/%d/%Y %H:%M')
  hr_posted = dt.datetime.strftime(date_time, "%H")
  comment_count = r[1]

  if hr_posted in post_count_by_hour:
    post_count_by_hour[hr_posted] += 1
    comment_count_by_hour[hr_posted] += comment_count
  else:
    post_count_by_hour[hr_posted] = 1
    comment_count_by_hour[hr_posted] = comment_count

print('Overall max post count:', max(post_count_by_hour.items()))
print('Overall max comment count:', max(comment_count_by_hour.items()))


Overall max post count: ('23', 68)
Overall max comment count: ('23', 543)


### 5. Now, let's calcualte average comment count per post for each hour

In [86]:
avg_by_hour = []

for hour, comment_count in comment_count_by_hour.items():
   avg_comment_count = comment_count/post_count_by_hour[hour]
   avg_by_hour.append([avg_comment_count, hour])

avg_by_hour = sorted(avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments')
for avg in avg_by_hour[:5]:
  hour = dt.datetime.strptime(avg[1],'%H').strftime('%H:%M ET')
  print(f'{hour} : {avg[0]:.2f} average comments per post')

Top 5 Hours for Ask Posts Comments
15:00 ET : 38.59 average comments per post
02:00 ET : 23.81 average comments per post
20:00 ET : 21.52 average comments per post
16:00 ET : 16.80 average comments per post
21:00 ET : 16.01 average comments per post


In [103]:
print('Top 5 Hours for Ask Posts Comments (UTC)')
for avg in avg_by_hour[:5]:
  delta = dt.datetime.utcoffset(dt.datetime.now(timezone('US/Eastern')))
  hour = (dt.datetime.strptime(avg[1],'%H')+delta).strftime('%H:%M UTC')
  print(f'{hour} : {avg[0]:.2f} average comments per post')

Top 5 Hours for Ask Posts Comments (UTC)
10:00 UTC : 38.59 average comments per post
21:00 UTC : 23.81 average comments per post
15:00 UTC : 21.52 average comments per post
11:00 UTC : 16.80 average comments per post
16:00 UTC : 16.01 average comments per post


### Conclusions

From data collected we can conclude that *Ask HN* posts will receive most comments if are posted around 15:00, 2:00 and 20:00 ET. 
Furthermoe when converted into European time we can see that most of activity is concentrated in the morning (10:00) and late evening (21:00). 