## Analysis of Hacker News Submissions

_Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result._

_Below are descriptions of the columns of the dataset, collated for submissions that received comments:_

- **id**: The unique identifier from Hacker News for the post
- **title**: The title of the post
- **url**: The URL that the posts links to, if it the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **created_at**: The date and time at which the post was submitted

### Purpose of the analysis

_We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting._

_We'll compare these two types of posts to determine the following:_

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

_Let's start by importing the libraries we need and reading the data set into a list of lists._

In [2]:
from csv import reader

f = open("hacker_news.csv", 'r', errors='ignore')
hn = list(reader(f))
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


### Removing Headers from a List of Lists

_Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next._

In [3]:
headers = hn[0] # assign first row of the data to the variable header
hn = hn[1:]      # remove the header by assigning the rest of the data to hn
print(headers)
print()
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Ã‚Â“the-data-vaultÃ‚Â”', '1', '0', 'markgainor1', '9/26/2016 3:14']]


### Extracting Ask HN and Show HN Posts

_Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles._

In [4]:
ask_posts   = []
show_posts  = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("length of 'ask_post' is {}".format(len(ask_posts)))
print("length of 'show_post' is {}".format(len(show_posts)))
print("length of 'other_post' is {}".format(len(other_posts)))

length of 'ask_post' is 9139
length of 'show_post' is 10158
length of 'other_post' is 273822


### Calculate the average number of comments for the Ask Posts and the Show Posts

Let us determine if ask posts or show posts receive more comments on average.

In [5]:
def avg_comments(posts):
    """Function takes the list of lists of comments and 
        returns the average number of comment per category"""
    total_comments = 0

    for row in posts:
        num_comment = int(row[4])
        total_comments += num_comment
    
    return total_comments/len(posts)

In [6]:
print("The average number of ask comments is {}".format(int(avg_comments(ask_posts))))
print("The average number of show comments is {}".format(int(avg_comments(show_posts))))
print("The average number of other comments is {}".format(int(avg_comments(other_posts))))

The average number of ask comments is 10
The average number of show comments is 4
The average number of other comments is 6


_On the average, the data shows that ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts._

_Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:_

 - _Calculate the amount of ask posts created in each hour of the day, along with the number of comments received._
 - _Calculate the average number of comments ask posts receive by hour created._

### Create a list of lists containing the creation time of the comment and the number of comments.

In [7]:
# CREATE A LIST OF LISTS CONTAINING THE TIME THE COMMEMT WAS CREATED AND THE NUMBER OF COMMENTS

import datetime as dt

result_list = []
for row in ask_posts:
    date = row[6]
    comment = int(row[4])
    result_list.append([date, comment])        
    
result_list[:5]

[['9/26/2016 2:53', 7],
 ['9/26/2016 1:17', 3],
 ['9/25/2016 22:57', 0],
 ['9/25/2016 22:48', 3],
 ['9/25/2016 21:50', 2]]

In [8]:
counts_by_hour = {}   # number of ask posts created during each hour of the day.
comments_by_hour = {} # corresponding number of comments ask posts created at each hour received.

for _list in result_list:
    hour = _list[0]
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    hour_object = hour.strftime("%H")
    
    if hour_object in counts_by_hour:
        counts_by_hour[hour_object] += 1
        comments_by_hour[hour_object] += _list[1]
    else:
        counts_by_hour[hour_object] = 1
        comments_by_hour[hour_object] = _list[1]

In [9]:
print("Frequency of ask posts per hour:")
counts_by_hour

Frequency of ask posts per hour:


{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

In [10]:
print("Frequency of comments received per hour:")
comments_by_hour

Frequency of comments received per hour:


{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [11]:
avg_by_hour = []

for hr in sorted(comments_by_hour):
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])

In [12]:
avg_by_hour[:5]

[['00', 7.5647840531561465],
 ['01', 7.407801418439717],
 ['02', 11.137546468401487],
 ['03', 7.948339483394834],
 ['04', 9.7119341563786]]

### Sorting and Printing Values from a List of Lists

In [13]:
def swap_sort_list(avg_hour):
    """Returns a sorted and swapped list of lists of hours and 
        count of comments per hour, with the average of the number of
        comments in column one"""
    swap_avg_by_hour = []
    
    for row in avg_hour:                # loop through the list of lists
        swap_avg_by_hour.append([row[1], row[0]]) # swap the content of each list

    sorted_swap = sorted(swap_avg_by_hour, reverse=True)
    return sorted_swap
    
sorted_swap         = swap_sort_list(avg_by_hour) # instantiate the function
sorted_swap_top5    = sorted_swap[:5]             # select the top 5 rows
sorted_swap_bottom5 = sorted_swap[-5:]            # select the bottoem 5 rows

In [14]:
print("Top 5 comments:")
for row in sorted_swap_top5:
    hr = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    avg = row[0]
    print("{}: {:,.2f} average comments per post.".format(hr, avg))

Top 5 comments:
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


In [18]:
print("Bottom 5 comments")
for row in sorted_swap_bottom5:
    hr = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    avg = row[0]
    print("{}: {:,.2f} average comments per post.".format(hr, avg))

Bottom 5 comments
07:00: 7.85 average comments per post.
03:00: 7.80 average comments per post.
04:00: 7.17 average comments per post.
22:00: 6.75 average comments per post.
09:00: 5.58 average comments per post.


### Determine if show or ask posts receive more points on average

In [19]:
sum_ask_point = 0
sum_show_point = 0

for row in ask_posts:
    sum_ask_point += int(row[3])
    
for row in show_posts:
    sum_show_point += int(row[3])
    
print("The average points for the ask posts is {:,.2f}".format(sum_ask_point/len(ask_posts)))
print("The average points for the show posts is {:,.2f}".format(sum_show_point/len(show_posts)))

The average points for the ask posts is 15.06
The average points for the show posts is 27.56


### Determine if posts created at a certain time are more likely to receive more points.

We will create a function to test on for ask posts and show posts.

In [36]:
def date_points(date_col, point_col, post):
    """Returns a list of lists containing the dates and the points"""
    date_points = []
    for row in post:
        date = row[date_col]
        points = int(row[point_col])
        date_points.append([date, points]) 
    return date_points

In [23]:
def post_hour_point(post):
    """Returns two dictionaries:
        counts_by_hour: number of posts created during each hour of the day
        points_by_hour: corresponding number of points received"""
    
    counts_by_hour = {}   
    points_by_hour = {}   

    for row in post:
        date = row[0]
        date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
        hour = date.strftime("%H")
    
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            points_by_hour[hour] += row[1]
        else:
            counts_by_hour[hour] = 1
            points_by_hour[hour] = row[1]
    return [counts_by_hour, points_by_hour]

#### For the Ask Posts

In [42]:
ask_post_hp = post_hour_point(ask_post_date_point)
ask_post_hp_count = ask_post_hp[0]   # Number of ask posts count per hour
ask_post_hp_point = ask_post_hp[1]   # Number of ask posts point per hour

print("The number of ask posts created per hour:")
ask_post_hp_count

The number of ask posts created per hour:


{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

In [43]:
print("The points for the ask posts per hour:")
ask_post_hp_point

The points for the ask posts per hour:


{'00': 451,
 '01': 700,
 '02': 793,
 '03': 374,
 '04': 389,
 '05': 552,
 '06': 591,
 '07': 361,
 '08': 515,
 '09': 329,
 '10': 1102,
 '11': 825,
 '12': 782,
 '13': 2062,
 '14': 1282,
 '15': 3479,
 '16': 2522,
 '17': 1941,
 '18': 1741,
 '19': 1513,
 '20': 1151,
 '21': 1721,
 '22': 511,
 '23': 581}

#### Calculating the Average Number of Points for Ask HN Posts by Hour

In [49]:
def avg_count_point_by_hour(count_hr, point_hr):
    """Calculating the average number of points for each posts by hour"""
    avg_hour = []
    for hr in sorted(point_hr):
        avg_hour.append([hr, point_hr[hr]/count_hr[hr]])
    return avg_hour
        
avg_count_point_by_hour_ask_post = avg_count_point_by_hour(ask_post_hp_count, ask_post_hp_point)
ask_post_swap = swap_sort_list(avg_count_point_by_hour_ask_post)
print("The average ask post points per hour:")
ask_post_swap[:5]

The average ask post points per hour:


[[29.99137931034483, '15'],
 [24.258823529411764, '13'],
 [23.35185185185185, '16'],
 [19.41, '17'],
 [18.677966101694917, '10']]

_The result shows that ask posts posted between 15:00 and 16:00 are very likely to receive more points than ask posts posted at other times._

### For the Show Posts

In [45]:
show_post_hp = post_hour_point(show_post_date_point)
show_post_hp_count = show_post_hp[0]   # Number of show posts count per hour
show_post_hp_point = show_post_hp[1]   # Number of show posts point per hour

print("The number of show posts created per hour:")
show_post_hp_count

The number of show posts created per hour:


{'00': 31,
 '01': 28,
 '02': 30,
 '03': 27,
 '04': 26,
 '05': 19,
 '06': 16,
 '07': 26,
 '08': 34,
 '09': 30,
 '10': 36,
 '11': 44,
 '12': 61,
 '13': 99,
 '14': 86,
 '15': 78,
 '16': 93,
 '17': 93,
 '18': 61,
 '19': 55,
 '20': 60,
 '21': 47,
 '22': 46,
 '23': 36}

In [46]:
print("The points for the show posts per hour:")
show_post_hp_point

The points for the show posts per hour:


{'00': 1173,
 '01': 700,
 '02': 340,
 '03': 679,
 '04': 386,
 '05': 104,
 '06': 375,
 '07': 494,
 '08': 519,
 '09': 553,
 '10': 681,
 '11': 1480,
 '12': 2543,
 '13': 2438,
 '14': 2187,
 '15': 2228,
 '16': 2634,
 '17': 2521,
 '18': 2215,
 '19': 1702,
 '20': 1819,
 '21': 866,
 '22': 1856,
 '23': 1526}

#### Calculating the Average Number of Points for Show HN Posts by Hour

In [50]:
avg_count_point_by_hour_show_post = avg_count_point_by_hour(show_post_hp_count, show_post_hp_point)
show_post_swap = swap_sort_list(avg_count_point_by_hour_show_post)
print("The average show post points per hour:")
show_post_swap[:5]

The average show post points per hour:


[[42.388888888888886, '23'],
 [41.68852459016394, '12'],
 [40.34782608695652, '22'],
 [37.83870967741935, '00'],
 [36.31147540983606, '18']]

_The result shows that the shows posts posted between 23:00 and 00:00 are very likely to receive more points than show posts posted at other times._

### Conclusion

_In this project, we analyzed data from the Hacker News site, which comprises of ask posts (questions on specific topics), show posts (posts on projects, products or general fun topics) and other general posts. The objective of the analysis is to determine which posts (ask or show) receive the most comments from the Hacker News community, and at what hour a post is more likely to receive the most comments_.

_The result of the analysis shows that on the average, the ask posts receive more comments thats the show posts. This indicates that the Hacker News community is a very likely site to recommend for users requiring answers to certain (possibly technical) questions. The result also shows that the hour that receives the most comments on posts on the average is 15:00, with an average of 38.59 comments per post, while the hour that receives the least comments on posts on the average is 09:00, with an average of 5.58 comments per post._

_This implies that for one to stand a chance to receive comments on posts, the recommendation would be to target the time between 15:00. and 16:00._

_With respect to the number of points received, the following recommendations is proposed:_
 - _if interested in getting more points for the ask posts, ask posts should be posted to the Hacker News site between 15:00 to 16:00_
 - _if interested in getting more points for the show posts, show posts should be posted to the Hacker News site between 23:00 to midnight_

In [13]:
a = (5, 6, 7)
b = (3, 6, 10)

def compareTriplets(a, b):
    a_cnt=0
    b_cnt=0
    for i in range(len(a)):
        if a[i] > b[i]:
            a_cnt += 1
        elif a[i] < b[i]:
            b_cnt += 1
        else:
            pass
    return (a_cnt, b_cnt)

In [12]:
compareTriplets(a, b)

(1, 2)