# Exploring Hacker News Posts

## Introduction

[Hacker News](https://news.ycombinator.com/) is a site, similar to reddit, where users submit posts and receive votes and comments. It is extremely popular in technology and startup circles, and popular posts can attract hundreds and thousands of visitors. 

This project will be an exploration of submissions made on this website. The submissions that we're specifically interested in are the ones that begin with the titles `'Show HN'` and `'Ask HN'`. `'Show HN'` posts are submitted to show the Hacker News community a project, product, or simply something interesting. While, `'Ask HN'` submissions ask the community a specific question.

This project will compare these two types of posts to determine:

* Do `'Ask HN'` or `'Show HN'` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

The dataset that will be used for this project is a reduced version this [Kaggle submission](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).

## Reading Data, Creating Lists of Lists, and Removing Header

Opening, reading, and creating a list of lists of the 'hacker_news.csv' file:

In [11]:
from csv import reader

opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

header = hn[0]   # seperating header
hn = hn[1:]

## Extracting Ask HN and Show HN Posts

Now to seperate posts beginning with `'Ask HN'` and `'Show HN'` from our dataset:

In [15]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()         # controlling for case.
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
    
print("Number of 'Ask HN' posts: ", len(ask_posts))
print("Number of 'Show HN' posts: ", len(show_posts))
print("Number of other posts: ", len(other_posts))

Number of 'Ask HN' posts:  1744
Number of 'Show HN' posts:  1162
Number of other posts:  17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now to determine if `'Ask HN'` or `'Show HN'` receive more comments on average:

In [18]:
## Finding total number of comments in ask posts ##

total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
## Computing average number of comments in ask posts ##

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of 'Ask HN' comments: ", avg_ask_comments)

Average number of 'Ask HN' comments:  14.038417431192661


In [19]:
## Finding total number of comments in show posts ##

total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
## Computing average number of comments in show posts ##

avg_show_comments = total_show_comments / len(show_posts)
print("Average number of 'Show HN' comments: ", avg_show_comments)

Average number of 'Show HN' comments:  10.31669535283993


From the data, we can see that `'Ask HN'` submissions recieve 14 comments on average, whereas, `'Show HN'` posts only recieve 10. 

Since `'Ask HN'` posts recieve more comments on average, we will focus the rest of our analysis on these submissions.

## Finding the Amount of Ask Posts and Comments by Hour Created


As stated in the introduction, our next goal is to determine if `'Ask HN'` posts created at a certain *time* are more likely to attract comments. 

The first step for this analysis will be to calculate the number of ask posts created in each hour of the day, along with the number of comments recieved.

In [48]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])  # appending a list with two elements: 'created_at' and 'num_comments'

posts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"   # the format our string date data is in

for row in result_list:
    date = row[0]
    comments = row[1]    # number of comments
    time = dt.datetime.strptime(date, date_format)  # parsing the dates stored as strings
    hour = time.strftime("%H")  # extracting the hour from the date object in string format
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else: 
        posts_by_hour[hour] += 1     # sums the number of posts per hour
        comments_by_hour[hour] += comments   # sums the number of comments per hour
    
posts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}