## Hacker News Posts Analysis

In this project, we will compare two types of posts (Ask HN and Show HN) from a popular site known as [Hacker News](https://news.ycombinator.com/) for technology related stories. 

Users submit Ask HN posts to ask the Hacker News community a specific question and Show HN posts to show the community a project, product, or just generally something interesting. We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We will make use of the datetime module to access the date and time information of each posts using the datetime.strptime() constructor, datetime.strftime() method etc.,

It should be noted that the dataset we're working with was reduced by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

<b>Getting familiar with the dataset</b>

In [2]:
from csv import reader
import datetime as dt
hn = list(reader(open('hnews.csv')))
for i in range (0,3):
    print(hn[i])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


<b>Separating the header from the data</b>

In [3]:
hn_header = hn[0]
hn = hn[1:]
for i in range (0,3):
    print(hn[i])

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


<b>Separating Ask HN, Show HN and other posts</b>

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17193


<b>Calculating the average number of posts Ask HN and Show HN posts receive</b>

In [5]:
total_ask = 0
ask_comments = 0
for row in ask_posts:
    total_ask += 1
    ask_comments += int(row[4])
print('Ask HN posts average no. of. comments:', round(ask_comments/total_ask))

total_show = 0
show_comments = 0
for row in show_posts:
    total_show += 1
    show_comments += int(row[4])
print('Show HN posts average no. of. comments:', round(show_comments/total_show))

Ask HN posts average no. of. comments: 14
Show HN posts average no. of. comments: 10


From this we can see that Ask HN posts receives more comments than Show HN posts as ask posts are more likely to receive comments. Therefore, we'll focus our remaining analysis just on these posts.

<b>Converting the given date and time information to datetime object using datetime.strptime() constructor</b>

In [6]:
for row in ask_posts:
    dtt = row[-1]
    dtt = dt.datetime.strptime(dtt, '%m/%d/%Y %H:%M')
    row[-1] = dtt


### Finding the Amount of Ask Posts and Comments by Hour Created

We'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.

In [7]:
ask_counts = {}
askk_comments = {}

for row in ask_posts:
    hourr = row[-1].hour
    comm = int(row[4])
    if hourr not in ask_counts:
        ask_counts[hourr] = 1
        askk_comments[hourr] = comm
    else:
        ask_counts[hourr] += 1
        askk_comments[hourr] += comm

print(ask_counts)
print(askk_comments)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [8]:
ask_average = {}
for element in ask_counts:
    ask_average[element] = round(askk_comments[element] / ask_counts[element])
print(ask_average)

{9: 6, 13: 15, 10: 13, 14: 13, 16: 17, 23: 8, 12: 9, 17: 11, 15: 39, 21: 16, 20: 22, 2: 24, 18: 13, 3: 8, 5: 10, 19: 11, 1: 11, 22: 7, 8: 10, 4: 7, 0: 8, 6: 9, 7: 8, 11: 11}


## Sorting and Printing Values from a List of Lists

In [9]:
sortt = []
for element in ask_average:
    sortt.append((ask_average[element], element))
print(sortt)

sortedd = sorted(sortt, reverse = True)
print(sortedd)

[(6, 9), (15, 13), (13, 10), (13, 14), (17, 16), (8, 23), (9, 12), (11, 17), (39, 15), (16, 21), (22, 20), (24, 2), (13, 18), (8, 3), (10, 5), (11, 19), (11, 1), (7, 22), (10, 8), (7, 4), (8, 0), (9, 6), (8, 7), (11, 11)]
[(39, 15), (24, 2), (22, 20), (17, 16), (16, 21), (15, 13), (13, 18), (13, 14), (13, 10), (11, 19), (11, 17), (11, 11), (11, 1), (10, 8), (10, 5), (9, 12), (9, 6), (8, 23), (8, 7), (8, 3), (8, 0), (7, 22), (7, 4), (6, 9)]


In [10]:
# Sort the values and print the the 5 hours with the highest average comments.

print('Top 5 hours for Ask HN:')
for avg, hr in sortedd[:5]:
    print(
        '{}:00 : {} average comments per post'.format(hr, avg)
    )

Top 5 hours for Ask HN:
15:00 : 39 average comments per post
2:00 : 24 average comments per post
20:00 : 22 average comments per post
16:00 : 17 average comments per post
21:00 : 16 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 39 comments per post. In conclusion, there's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.