<h1>Exploring Hacker News Posts</h1>

<p>This project will explore a dataset of posts from the news site Hacker News.  The data are originally from this Kaggle dataset: https://www.kaggle.com/hacker-news/hacker-news-posts </p>

In [68]:
from csv import reader

opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

#An overview of the data:
headers = hn[0]
print(headers)
for row in hn[1:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


<p> finsing posts that start with Ask HN or Show HN </p>

In [69]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title= row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("There are {} ask posts".format(len(ask_posts)))
print("There are {} show posts".format(len(show_posts)))
print("There are {} other posts".format(len(other_posts)))

        
        

There are 1744 ask posts
There are 1162 show posts
There are 17195 other posts


<p> Which type of posts receive more comments?<p>

In [70]:
def avg_comments(post_list):
    total_comments=0
    for post in post_list:
        num_comments=int(post[4])
        total_comments+=num_comments
    return total_comments/len(post_list)
        
avg_ask_comments= avg_comments(ask_posts)
avg_show_comments= avg_comments(show_posts)

#the "other" data requires more cleaning, I will not analyze it here:
# avg_other_comments= avg_comments(other_posts)

print("There are {:.2f} average comments from an 'Ask' post".format(avg_ask_comments))
print("There are {:.2f} average comments from a 'Show' post".format(avg_show_comments))

There are 14.04 average comments from an 'Ask' post
There are 10.32 average comments from a 'Show' post


<p>There are about 4 more average comments on an "Ask" post than on a "Show" post.  I will focus my remaining analysis on "Ask" posts.  Next I will look at if certain times of the day influence number of posts.</p>

<h2> Which hours have the most comments?</h2> 

In [71]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received.
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [72]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

print(avg_by_hour)

[['06', 9.022727272727273], ['11', 11.051724137931034], ['05', 10.08695652173913], ['23', 7.985294117647059], ['14', 13.233644859813085], ['00', 8.127272727272727], ['20', 21.525], ['09', 5.5777777777777775], ['22', 6.746478873239437], ['02', 23.810344827586206], ['13', 14.741176470588234], ['07', 7.852941176470588], ['12', 9.41095890410959], ['01', 11.383333333333333], ['18', 13.20183486238532], ['08', 10.25], ['04', 7.170212765957447], ['10', 13.440677966101696], ['19', 10.8], ['17', 11.46], ['16', 16.796296296296298], ['15', 38.5948275862069], ['21', 16.009174311926607], ['03', 7.796296296296297]]


In [73]:
swap_avg_by_hour=[]

for row in avg_by_hour:
    new_row=row[::-1]
    swap_avg_by_hour.append(new_row)
sorted_swap= sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:6]:
    print ("{}:00: {:.2f} average comments per post.".format(row[1], row[0]))
    


Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.
13:00: 14.74 average comments per post.


<p> This data shows that the most popular time for posting was at 3pm, presumably when people were at work and had a midday slump.  The next most populat times were late at night and early in the morning.  Could be that people respond to posts more when they have insomnia?! </p>