# Exploring Hacker News Posts

This project is intended to show skills such as:
    
    * How to work with strings;
    * Object-oriented programming;
    * How to work with dates and times;

The project is about Hacker News posts. Hacker News is a site where user-submitted stories (known as 'posts') receive votes and comments. It is widely used among the technology and startup community, with highest voted posts reaching hundreds of thousands of users.

The data set used in this project can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). 

For the purpose of this project, we have reduced the original data set from almost $300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments and then randomly sampling from the reamining ones.

We are interested in titles that begin with "Ask HN" or "Show HN". They are used to ask the community specific questions or show a project, product, or just something interesting. Below are some examples:
    
    * Ask HN: How to improve my personal website?
    * Show HN: Shanhu.io, a programming playground powered by e8vm
    


Questions to be considered:

    * Do Ask HN or Show HN receive more comments on average?
    * Do posts created at a certain time receive more comments on average?

## Read the data set

In [1]:
from csv import reader

# read csv file as a list of lists
with open('hacker_news.csv', 'r') as read_obj:
    # pass the file object reader() to get the reader object
    csv_reader = reader(read_obj)
    # pass reader object to list() to get a list of lists
    hn = list(csv_reader)
    #display first 5 rows
    print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Extracting headers from data set

In [2]:
headers = hn[:1]
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [4]:
# removing headers from hn 

hn = hn[1:]
print(hn[:5])

[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos

## Filtering the data

In [17]:
# find the posts that begin with either "Ask HN" or "Show HN"

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
        title = row[1]
        if title.lower().startswith("ask hn"):
            ask_posts.append(row)
        elif title.lower().startswith("show hn"):
            show_posts.append(row)
        else:
            other_posts.append(row)
            
# print the number of rows in ask_posts, show_posts, and other_posts
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17193


In [21]:
# find the average total number of comments in ask posts 

total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)
    

14.038417431192661


In [22]:
# find the average total number of comments in show posts

total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


The output shows that on average Ask HN posts (14.038) have more comments than Show HN posts (10.317). For that reason, we'll be focusing on the Ask HN posts for now on.

Next, we'll determine if ask posts created at a certain time are more likely to attract more comments. For doing that, we'll calculate the number of ask posts created in each hour of the day along with the number of comments received. Also, we'll calculate the average number of comments ask posts receive by hour created.

In [36]:
# import datetime module
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append(
        [row[6], int(row[4])]
    )
    
comments_by_hour = {}
counts_by_hour = {}    
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment 
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
        

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Next, we'll use the two dictionaries we created (comments_by_hour and counts_by_hour) to calculate the average number of comments for posts created during each hour of the day.

In [37]:
avg_by_hour=[]
for row in comments_by_hour:
    avg_by_hour.append([row, comments_by_hour[row] /counts_by_hour[row]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [40]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [43]:
sorted_swap =sorted( swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [46]:
print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[:5]:
      print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The result shows that the hour that receives the most number of comments per post is 15:00, with an average of 38.59 comments per post. Furthermore, there is about a 60% increase in the number of commments between the hours with the highest and second highest average number of comments.
    

## Conclusion

This project was conducted to analyse ask posts and show posts to determine which type of post usually receives most comments on average and at what time of the day. It is important to mention that the data set analysed excluded posts without any comments.
Based on my analysis, to maximize the amount of comments a post receives, I'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3pm - 4pm EST).
