# Hacker News News Posts

Data info:
* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if it the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total  number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

## Objective:
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

### Import the Data

In [1]:
from csv import reader
open_file = open('hacker_news.csv', encoding='utf-8')
read_file = reader(open_file)
hn = list(read_file)
open_file.close()

In [4]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


**Remove Headers**

In [5]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Create two lists of posts starting with `ask_hn` and `show_hn`

In [9]:
print(list(enumerate(headers)))

[(0, 'id'), (1, 'title'), (2, 'url'), (3, 'num_points'), (4, 'num_comments'), (5, 'author'), (6, 'created_at')]


In [11]:
hn_dataset_len = len(hn)
print(hn_dataset_len)

20100


In [12]:
ask_posts, show_posts, other_posts = [], [], []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

#### Explore Distribution

In [14]:
print(f'Number of "Ask HN" Posts: {len(ask_posts)}')
print(f'Number of "Show HN" Posts: {len(show_posts)}')
print(f'Number of other Posts: {len(other_posts)}')

Number of "Ask HN" Posts: 1744
Number of "Show HN" Posts: 1162
Number of other Posts: 17194


### Next, let's determine if ask posts or show posts receive more comments on average.

**Counting Comments in `ask_posts`**

In [20]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
rounded_avg_ask_comments = round(avg_ask_comments, 2)
print(f'Average ask_hn comments: {rounded_avg_ask_comments}')

Average ask_hn comments: 14.04


**Counting Comments in `show_posts`**

In [21]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
rounded_avg_show_comments = round(avg_show_comments, 2)
print(f'Average show_hn comments: {rounded_avg_show_comments}')

Average show_hn comments: 10.32


#### Synopsis
Based on this dataset, on average, `ask_posts` receive 4 more comments than `show_posts`.

Let's drill down further into these **ask_posts**. Lets determine if ask posts created at a certain time are more likely to attract comments. This will be comprised of 2 steps:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [38]:
def enum_headers(headers):
    """Returns index, values for header values for quick reference"""
    return list(enumerate(headers))

In [39]:
print(enum_headers(headers))

[(0, 'id'), (1, 'title'), (2, 'url'), (3, 'num_points'), (4, 'num_comments'), (5, 'author'), (6, 'created_at')]


In [40]:
import datetime as dt


result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

In [41]:
print(result_list[:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


**Create Frequency Tables:**

In [52]:
counts_by_hour, comments_by_hour = {}, {}


for row in result_list:
    num_comments = row[1]
    hour = row[0]
    # Convert to datetime object:
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    
    # Extract the 'hour' as a string:
    hour = hour.strftime('%H')
    
    # Populate Frequency Tables:
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

In [53]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [54]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


**Calculate the average number of comments per post for posts created during each hour of the day**

The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post.

In [59]:
avg_by_hour = []

# avg comments/post by hour = total comments / count, by hour
for hour in counts_by_hour:
    num_posts = counts_by_hour[hour]
    num_comments = comments_by_hour[hour]
    avg_comments = round(num_comments / num_posts, 2)
    avg_by_hour.append([hour, avg_comments])

#### View Results:

In [61]:
for data in avg_by_hour:
    print(data)

['09', 5.58]
['13', 14.74]
['10', 13.44]
['14', 13.23]
['16', 16.8]
['23', 7.99]
['12', 9.41]
['17', 11.46]
['15', 38.59]
['21', 16.01]
['20', 21.52]
['02', 23.81]
['18', 13.2]
['03', 7.8]
['05', 10.09]
['19', 10.8]
['01', 11.38]
['22', 6.75]
['08', 10.25]
['04', 7.17]
['00', 8.13]
['06', 9.02]
['07', 7.85]
['11', 11.05]


#### Sort Results for Easier Viewing

In [63]:
swap_avg_by_hour = []

for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1], hour[0]])

In [65]:
for hour in swap_avg_by_hour:
    print(hour)

[5.58, '09']
[14.74, '13']
[13.44, '10']
[13.23, '14']
[16.8, '16']
[7.99, '23']
[9.41, '12']
[11.46, '17']
[38.59, '15']
[16.01, '21']
[21.52, '20']
[23.81, '02']
[13.2, '18']
[7.8, '03']
[10.09, '05']
[10.8, '19']
[11.38, '01']
[6.75, '22']
[10.25, '08']
[7.17, '04']
[8.13, '00']
[9.02, '06']
[7.85, '07']
[11.05, '11']


In [66]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

### Top 5 Hours for Ask Posts Comments

In [69]:
print('Top 5 Hours for Ask Posts Comments')
for hour in sorted_swap[:5]:
    avg_comments = hour[0]
    hour_format = '%H'
    hour = dt.datetime.strptime(hour[1], hour_format)
    hour = hour.strftime('%H:%M')
    print('{}:00: {} average comments per post'.format(hour, avg_comments))

Top 5 Hours for Ask Posts Comments
15:00:00: 38.59 average comments per post
02:00:00: 23.81 average comments per post
20:00:00: 21.52 average comments per post
16:00:00: 16.8 average comments per post
21:00:00: 16.01 average comments per post


## Conclusion

Based on the findings, if the goal is to identify the posting time that leads to the highest liklihood of receiving comments, I'd recommend an **"Ask-HN"**-themed post, posted around 3 in the afternoon. If you're more of a night owl, 2 AM and 8 PM seem to be the next best bet, respectively.