# Exploring Hacker News Posts 

This project aims to complete practical data analysis on a dataset exploring 'Hacker News' posts. In this project, we will be exploring the posts that begin with 'Ask HN' or 'Shown HN' which ask the community a specific question or shows the community a project, product or something interesting. 

The key skills used in this project are:
* Working with strings
* Object-oriented programming
* Working with dates and times

The 'Ask HN' and 'Show HN' posts will be compared to determing whether:
1. Ask HN or Show HN receive more comments on average
2. Posts created at a certain time receive more comments on average

### Reading data and removing header row

In [11]:
import csv
import pandas as pd

with open('hacker_news.csv') as file:
    hn = list(csv.reader(file))

header = hn[0]
print(header)
print('\n')
print(hn[1:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [12]:
#Removing the header row from the analysis
hn = hn[1:]

print(hn[:2])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


### Extracting two different types of posts and calculating average number of posts

Now it's time to seperate the dataset into posts that start with Ask HN and Show HN.

In [13]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of Ask HN posts:', len(ask_posts))
print('\n')
print('Number of Show HN posts:', len(show_posts))
print('\n')
print('Number of Other HN posts:', len(other_posts))

Number of Ask HN posts: 1744


Number of Show HN posts: 1162


Number of Other HN posts: 17194


In [19]:
total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print('Average Ask HN post comments:', round(avg_ask_comments,2))

total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments 

avg_show_comments = total_show_comments / len(show_posts)

print('Average Show HN post comments', round(avg_show_comments,2))
    

Average Ask HN post comments: 14.04
Average Show HN post comments 10.32


From this analysis, we can see that the Ask HN posts recieve more comments on average, and are therefore more popular. The rest of the analysis will focus on the Ask HN posts. 

### Finding the Amount of 'Ask HN' Posts and Comments by Hour
It would be useful to know if posts created at a specific time are likely to get more comments. First, we will calculate the amount of ask posts created in each hour of the day, along with the number of comments recieved. 

In [26]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_hour = row[6]
    comments_hour = int(row[4])
    two_lists = [created_hour, comments_hour]
    result_list.append(two_lists)
    
#Create two empty dictionaries 
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_object = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date_object.strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


We can use the two dictionaries created to work ot the average number of comments for posts created during each hour of the day. 

In [32]:
avg_by_hour = []

for row in comments_by_hour:
    avg_by_hour.append([row, 
                        round((comments_by_hour[row]/counts_by_hour[row]), 2
                             )])
    
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


This analysis allows us to see the average number of comments per post made each hour. Although we have the information we need, the result is not in order, so it's harder to identify the hours which have the highest values. 

### Sorting and printing values from a list of lists 
We can finish this project by sorting the list of lists and showing the five highest values. 

In [33]:
swap_avg_by_hour = []

for row in avg_by_hour:
    first_element = row[1]
    second_element = row[0]
    two_list = [first_element, second_element]
    swap_avg_by_hour.append(two_list)
    
print(swap_avg_by_hour)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]


As this new list has the comments first, we can use the sorted() function to sort this list in a descending order. 

In [36]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16'], [16.01, '21'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [13.2, '18'], [11.46, '17'], [11.38, '01'], [11.05, '11'], [10.8, '19'], [10.25, '08'], [10.09, '05'], [9.41, '12'], [9.02, '06'], [8.13, '00'], [7.99, '23'], [7.85, '07'], [7.8, '03'], [7.17, '04'], [6.75, '22'], [5.58, '09']]


In [40]:
#Displaying the results:

print('Top 5 Hours for Ask Post Comments')

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], '%H')
    hour = hour.strftime('%H:%M')
    result = "{}: {:.2f} average comments per post".format(hour, row[0])
    print(result)


Top 5 Hours for Ask Post Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

To have the highest chance of recieving comments on your Ask HN post, the best hours to post are 3pm, 2am, 8pm, 4pm and 9pm. 