This project is going to be working with a Hacker News dataset.  Hackernews is an online forum similar to reddit where users submit posts and receive votes and comments.  

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The above code opens the csv file containing the data, converts it to a list of lists, then prints the first 5 rows of the data set

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


The above code removes the header row from the dataset and then prints the header row plus the first 5 rows of the dataset

In [13]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(ask_posts[0])
print("Number of ask posts: ", len(ask_posts))
print("Number of show posts: ", len(show_posts))
print("Number of other posts: ", len(other_posts))

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
Number of ask posts:  1744
Number of show posts:  1162
Number of other posts:  17194


We are only concerned with HackerNews posts that begin with Ask HN or Show HN. The above code block makes a sub dataset of the rows of the original dataset that have the title "Ask HN", a separate sub dataset consisting of rows that have the title "Show HN", and a third dataset that is all other posts of the original dataset.

In [15]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments1 = int(row[4])
    total_ask_comments += num_comments1
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments on ask posts: ", avg_ask_comments)

for row in show_posts:
    num_comments2 = int(row[4])
    total_show_comments += num_comments2
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments on show posts: ", avg_show_comments)

Average number of comments on ask posts:  14.038417431192661
Average number of comments on show posts:  10.31669535283993


On average, ask posts receive more comments than show posts. Because of this, our remaining analysis will focus on just these posts

In [22]:
import datetime as dt
result_list = []

for row in ask_posts:
    elements = []
    created_at_str = row[6]
    
    #Convert created_at to datetime object
    created_at_dt = dt.datetime.strptime(created_at_str, "%m/%d/%Y %H:%M")
    elements.append(created_at_dt)
    number_of_comments = int(row[4])
    elements.append(number_of_comments)
    result_list.append(elements)

counts_by_hour = {}
comments_by_hour = {}

for element in result_list:
    hour = element[0].hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = element[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += element[1]

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


The counts_by_hour dictionary contains the number of ask posts created during each hour of the day. The comments_by_hour dictionary contains the corresponding number of comments ask posts created at each hour received

In [82]:
avg_comments_by_hour = {}

for hour in comments_by_hour:
    avg_comments_by_hour[hour] = comments_by_hour[hour]/counts_by_hour[hour]

#Sort avg_comments_by_hour in order of number of comments, decreasing
sorted_avg_comments_by_hour = sorted(avg_comments_by_hour.items(), key = lambda x:x[1], reverse = True)

#Convert avg_comments_by_hour from a list of touples to a list of lists
new_comments_by_hour = [list(ele) for ele in sorted_avg_comments_by_hour]

#Create a list containing the top 5 hours for comments
top_5_hours = []

for hour in new_comments_by_hour:
    if len(top_5_hours) < 5:
        string = str(hour[0]) + ':00: ' + str(hour[1]) + ' average comments per post'
        top_5_hours.append(string)
    
print(top_5_hours)

['15:00: 38.5948275862069 average comments per post', '2:00: 23.810344827586206 average comments per post', '20:00: 21.525 average comments per post', '16:00: 16.796296296296298 average comments per post', '21:00: 16.009174311926607 average comments per post']


The top 5 hours for comments on ask posts are 15:00, 2:00, 20:00, 16:00, 21:00