# Hacker News Data Evaluation 

### Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

### You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

### In this project we are interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## 1. Importing and reading the data

In [15]:
# Importing csv reader
from csv import reader

# Opening, reading, and creating a list of lists
file = open("hacker_news.csv")
hn = list(reader(file))

# The first row contains the title of each column 
headers = hn[0]

# Excludes the title row and stores the rest of the data
hn = hn[1:]


## 2. Filtering our data

This step will consist of separating the posts into a corresponding list by analyzing their title

In [16]:
# List of lists that will store the data
ask_posts = []
show_posts = []
other_posts = []

# Iterating through hn and populating each corresponding list
for row in hn:
    title = row[1] # first column is 'id' 
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

# Printing out the length (number of posts) for each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## 3. Determining average comments

Next, let's determine if ask posts or show posts receive more comments on average

In [17]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

    

14.038417431192661


In [18]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


#### On average the ask posts receive more comments than the show posts 

## 4. Analyze ask posts and determine number of comments per hour of the day

We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received. 

2. Calculate the average number of comments ask posts received by hour created.

In [19]:
# Importing the datetime module
import datetime as dt

# Create list that will hold date of post and number of comments
result_list = []

for post in ask_posts:
    result_list.append([post[6],int(post[4])]) # post[6] = 'created_at' , post[4] = 'num_comments'

# Dictionaries to create frequency tables
counts_by_hour = {} # {'HOUR',# Of times a post was created at this hour}
comments_by_hour = {} # {'HOUR', total # of comments made at this hour}

# variable that will represent the format of the date
date_format = "%m/%d/%Y %H:%M"

# Iterate through result_list and populate dictionaries to create frequency tables
# for both.
for row in result_list:
    hour_st = row[0]
    hour_dt = dt.datetime.strptime(hour_st,date_format)
    hour = hour_dt.hour
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]


In [20]:
print(counts_by_hour)
print(comments_by_hour)


{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


## 5. Calculating the Average Number of Comments for Ask HN Posts by Hour

We will use the two dictionaries or frequency tables created to calculate the average number of comments for posts created during each hour of the day. 

In [21]:
# a list of lists that will hold the the average number of comments per hour
avg_by_hour = []

# Iterate through the counts_by_hour dictionary and append the hour to the
# avg_by_hour list as well as the average number of comments per hour
for hour in counts_by_hour:
    avg_by_hour.append([hour,round((comments_by_hour[hour]/counts_by_hour[hour]),2)])

# Print the list
print(avg_by_hour)
    

[[9, 5.58], [13, 14.74], [10, 13.44], [14, 13.23], [16, 16.8], [23, 7.99], [12, 9.41], [17, 11.46], [15, 38.59], [21, 16.01], [20, 21.52], [2, 23.81], [18, 13.2], [3, 7.8], [5, 10.09], [19, 10.8], [1, 11.38], [22, 6.75], [8, 10.25], [4, 7.17], [0, 8.13], [6, 9.02], [7, 7.85], [11, 11.05]]


## 6. Sorting and Printing Values from a List of Lists

Sorting the values will help in identifying the hours with the highest values.

In [22]:
# List that will store swapped columns of avg_by_hour list
swap_avg_by_hour = []


for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]]) #--- row[0] = 'HOUR', row[1] = 
    
print(swap_avg_by_hour)

[[5.58, 9], [14.74, 13], [13.44, 10], [13.23, 14], [16.8, 16], [7.99, 23], [9.41, 12], [11.46, 17], [38.59, 15], [16.01, 21], [21.52, 20], [23.81, 2], [13.2, 18], [7.8, 3], [10.09, 5], [10.8, 19], [11.38, 1], [6.75, 22], [10.25, 8], [7.17, 4], [8.13, 0], [9.02, 6], [7.85, 7], [11.05, 11]]


In [25]:
# list that holds sorted rows in descencing order
sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print("Top 5 Hours for Ask Posts Comments")

# Iterate through sorted list to format time and print statement
for row in sorted_swap[:5]:
    datetime_obj = dt.datetime.strptime(str(row[1]), "%H")
    time_frmt = datetime_obj.strftime("%H:%M")
    print('{}: {} average comments per post'.format(time_frmt,row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


From analyzing this data set we can say that the best hours to post and receive the most comments would be **15, 2, and 20 o'clock Eastern Time.**