 # An Exploration of Hacker News Posts
 #### A DataQuest Guided Project

### 1 Introduction

This notebook is a small exploration of the Hacker News (HN) Posts dataset found at [kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts/home).   The project appears before the DataQuest curriculum begins to cover `numpy` and `pandas`, so I won't be using those libraries.   

The dataset consists of a collection of information pertaining to posts made to HN.  We are particularly interested in the subset of posts whose titles begin with the phrases "Ask HN" or "Show HN".  More specifically, we are interested in which sorts of posts garner the most comments.

Later in the project, we will examine these posts more closely to find the times of day that provide maximal comment attraction.

### 2 Setup

We start by loading and reading the csv file into a list of lists `hn`.

In [1]:
from csv import reader
import datetime as dt

with open('hacker_news.csv') as fp:
    hn = list(reader(fp))    

for row in hn[:5]:
    print(row, "\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



Next, we separate the `headers` from our dataset.

In [3]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
hn = hn[1:]

Now, this is a bit beyond the project's scope, but I am going to write a simple object oriented interface for the rows (posts) in `hn`.  Just to ease my accessing of the columns.  I will instantiate a new version of `hn` called `HNews` which will be a list of `Post` objects as defined in the next cell.  We will use HNews as our dataset throughout the remainder of the project.

In [9]:
class Post(object):

    def __init__(self, _id, title, url, num_points, num_comments, author, 
                 created_at):
        self.id = _id
        self.title = title
        self.url = url
        self.num_points = int(num_points)
        self.num_comments = int(num_comments)
        self.author = author
        self.created_at = created_at  
        self.post_time = dt.datetime.strptime(self.created_at, "%m/%d/%Y %H:%M")
        
        
    def __str__(self):
        """
        Tried to do this by accessing attributes from dir and the evaluating
        'self.{}'.format(attr) for each, but was told that self wasn't defined.
        """
        attrs = [self.id, self.title, self.url, self.num_points,
                 self.num_comments, self.author, self.created_at]
        return str([str(attr) for attr in attrs])
    
    __repr__ = __str__

In [10]:
HNews = []
for row in hn:
    post = Post(*(row[i] for i in range(7)))
    HNews.append(post)

### 3 Data Cleaning and Organization

The next cell splits the posts into three categories: those beginning with `Ask HN`, those beginning with `Show HN` and everything else.  

In [11]:
ask_posts, show_posts, other_posts = [], [], []

for post in HNews:
    if post.title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif post.title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(ask_posts[:2], "\n")
print(show_posts[:2], "\n")
print(other_posts[:2], "\n")

print("Number of Ask HN Posts", len(ask_posts))
print("Number of Show HN Posts", len(show_posts))
print("Number of Other Posts", len(other_posts))

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']] 

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']] 

[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']] 

Number of Ask HN Posts 1744
Number of Sh

Let's now compute the average number of comments for `Ask` and for `Show` posts.  I'll compute the same result for `Other` posts, and all posts, for completeness sake.

In [8]:
def compute_avg_comments(posts):
    """Computes average number of comments for a list of Post objects."""
    total_comments = sum([post.num_comments for post in posts])
    return total_comments / len(posts)

avg_ask_comments = compute_avg_comments(ask_posts)
avg_show_comments = compute_avg_comments(show_posts)
avg_other_comments = compute_avg_comments(other_posts)
avg_all_comments = compute_avg_comments(HNews)

print("Average Number of Comments per Type of Post\n")
print("  ", "Ask Posts:".ljust(20), avg_ask_comments)
print("  ", "Show Posts:".ljust(20), avg_show_comments)
print("  ", "Other Posts:".ljust(20), avg_other_comments)
print("  ", "All Posts:".ljust(20), avg_all_comments)

Average Number of Comments per Type of Post

   Ask Posts:           14.038417431192661
   Show Posts:          10.31669535283993
   Other Posts:         26.8730371059672
   All Posts:           24.80228855721393


It seems that, neither `Ask` nor `Show` posts are big comment attractors, netting roughly half as many comments as an arbitrary post would.  If racking up the comments is our desire, we should consider posting something else.

Compared to each other, though, we see that `Ask` posts recieve on average 40% more comments than `Show` posts.  This doesn't seem particularly surprising, as the former are actively soliciting responses.  The remainder of this notebook will focus on these `Ask HN` posts, but one could make similar analyses for the other categories.

### 4 Analysis of Ask Post Comments with Respect to Time of Day

We continue by looking at the number of comments for a given type of post with respect to the time of the day at which the post was created, using the `post_time` datetime attribute of our `Post` class.  We will compute a frequency dictionary for comments by the hour of creation for each of our different types of posts.

In [12]:
def get_freqency(dataset):
    counts_by_hour = {}
    comments_by_hour = {}
    for post in dataset:
        hour = post.post_time.hour
        counts_by_hour[hour] = counts_by_hour.get(hour, 0) + 1
        comments_by_hour[hour] = comments_by_hour.get(hour, 0) + post.num_comments
    return counts_by_hour, comments_by_hour
    
def get_average_comments_by_hour(dataset):
    counts, comments = get_freqency(dataset)
    return {hr : comments[hr]/counts[hr] for hr in range(24)}

show_avg_comments_hourly = get_average_comments_by_hour(show_posts)
other_avg_comments_hourly = get_average_comments_by_hour(other_posts)
all_avg_comments_hourly = get_average_comments_by_hour(HNews)

ask_avg_comments_hourly = get_average_comments_by_hour(ask_posts)

In [13]:
avg_per_hour = ask_avg_comments_hourly
avg_per_hour

{0: 8.127272727272727,
 1: 11.383333333333333,
 2: 23.810344827586206,
 3: 7.796296296296297,
 4: 7.170212765957447,
 5: 10.08695652173913,
 6: 9.022727272727273,
 7: 7.852941176470588,
 8: 10.25,
 9: 5.5777777777777775,
 10: 13.440677966101696,
 11: 11.051724137931034,
 12: 9.41095890410959,
 13: 14.741176470588234,
 14: 13.233644859813085,
 15: 38.5948275862069,
 16: 16.796296296296298,
 17: 11.46,
 18: 13.20183486238532,
 19: 10.8,
 20: 21.525,
 21: 16.009174311926607,
 22: 6.746478873239437,
 23: 7.985294117647059}

Now, we sort our results with respect to the number of comments, and display our results.

In [21]:
sorted_hours = sorted(avg_per_hour, key=lambda hour : avg_per_hour[hour],
                      reverse=True)
sorted_ask_results = [[hour, avg_per_hour[hour]] for hour in sorted_hours]

print("Top 5 Hours for Ask HN Posts, by Comments Received".center(64))
print("--------------------------------------------------".center(64))
print("Hour of Posting".center(30) + "Average Number of Comments".center(30))
print("===============".center(30) + "==========================".center(30))

for row in sorted_ask_results[:5]:
    print("{}:00".format(row[0]).center(30) + "{:.2f}".format(row[1]).center(30))

       Top 5 Hours for Ask HN Posts, by Comments Received       
       --------------------------------------------------       
       Hour of Posting          Average Number of Comments  
            15:00                         38.59             
             2:00                         23.81             
            20:00                         21.52             
            16:00                         16.80             
            21:00                         16.01             


According to our results, and since the times given are in my own time zone (Eastern Standard, per the dataset's [description](https://www.kaggle.com/hacker-news/hacker-news-posts/home)), it seem that 3:00 p.m. would be the best time for me to attract comments with an `Ask HN` post.  Good to know.

Of the 5 best times shown above, two are in the afternoon (3p and 4p), while two are in the evening (8p and 9p), in my time zone at least.  These are all quite convenient for me as an East Coast American and could suggest that many posters (and commenters) on Hacker News network are from this time zone, which would not be too surprising.  But I am sure there is much more to it than this.

For instance, the West Coast of the US, I would expect, would also be a big driver of posts and comments on HN.  The four mentioned peak times correspond to 12p, 1p, 5p and 6p, respectively.  Lunch time and dinner time, respectively.  Put down your phone's and eat a nice meal for once, California!  Spend some time with friends and family!

But of course, us Americans are not the only users of Hacker News.  In fact, country of origin of these posts would make for some interesting further analysis.  And I expect that many of the posts made around 2:00 am Eastern Standard, and many of the immediate comments these posts garner, would originate outside the states.  Although Hackers stereotypically do keep odd hours.