# Analyzing Posts on Hacker News

in this project, I will be analyzing a dataset of approximatly 20,000 posts on the website __[Hacker News](https://news.ycombinator.com/)__

From the data, I will be looking to find the following:
- Do ASK HN or SHOW HN posts get more comments?
- Do posts created at a certain time recieve more comments on average?


To begin, we will import the csv file 'hacker_news.csv' and save it as the variable **hn** which is a list of lists. We will also keep the header row from the csv seperate in a variable titled **headers**.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

Next, we will seperate our data into three different tables. Articles that are ASK HN or SHOW HN will be sorted into the **ask_posts** and **show_posts** tables respectivly. The left-over articles will be put into the **other_posts** table.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [3]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4]) 
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments / len(show_posts)

print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


It appears that the average amount of comments in ASK HN posts are **~14 comments**, the average amount in SHOW HN posts are **~10.3 comments**

Being that ASK HN posts recieve more comments, now we will look to see what hour of the day recieves the most comments for ASK HN posts.

In [4]:
import datetime as dt

result_list = [] 
#A list of lists that looks like ['8/16/2016 9:55', 6]
for post in ask_posts:
    created_comments = [post[6], int(post[4])]
    result_list.append(created_comments)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    date = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    if date.hour not in counts_by_hour:
        counts_by_hour[date.hour] = 1
        comments_by_hour[date.hour] = int(row[1])
    else:
        counts_by_hour[date.hour] += 1
        comments_by_hour[date.hour] += int(row[1])

print(counts_by_hour)
print(comments_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


Two dictionaries were created: **counts_by_hour** which stores the ASK HN articles by {hour of post: num of posts at that hour}, 

as well as **comments_by_hour** which stores data by {hour of post: total ammount of comments for every post at that hour}.

In [5]:
avg_by_hour = []

for hour in comments_by_hour:
    average = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, average])

print(avg_by_hour)

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


Now we have a list of lists called **avg_by_hour**, with the format [hour, average ammount of comments for posts at that hour]. The code below presents the top 5 hours with the highest comment rates in a more readable manner. 

In [6]:
swap_avg_by_hour = []
for hour_average in avg_by_hour:
    swap_avg_by_hour.append([hour_average[1], hour_average[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")

for average_hour in sorted_swap[:5]:
    template = "{}:00: {:.2f} average comments per post"
    print(template.format(average_hour[1], average_hour[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
2:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Results

From our data, the times that have the highest average comments per post are:
- 3:00pm US Eastern Time w/ 38.59 avg comments per post
- 2:00am US Eastern Time w/ 23.81 avg comments per post
- 8:00pm US Eastern Time w/ 21.52 avg comments per post
- 4:00pm US Eastern Time w/ 16.80 avg comments per post
- 9:00pm US Eastern Time w/ 16.01 avg comments per post