# Analysis of Hacker News Posts

In this project I sort data from the Hacker News website into ask posts, show posts, and other posts, then analyze the ask and show groups to determine which engagement metric is most important for each group. I then use these findings to identify the best time of day to submit a post for each group. 

Data has been reduced from roughly 300k rows to about 20k rows by removing posts without comments and then taking a random sample from the rest. Original dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).


In [1]:
#Import and preview data
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
#Separate column headers from the data
headers = hn[0:1]
hn = hn[1:]
print(headers)
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [3]:
#Filter data into ask, show, and other posts
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [4]:
#Find which group of posts receives more comments
#find total number of comments for ask posts
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
    
#find average number of comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

#find total number of comments for show posts
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
    
#find avg number of comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)


14.038417431192661
10.31669535283993


# How many comments does a post from each group get on average?

Ask posts have an average of 14 comments, while show posts have an average of 10.3, therefore ask posts have more comments. This makes sense considering ask posts are expecting an answer to their question while show posts are primarily there to be viewed.

In [5]:
#Now lets find out which type of post gets more points
#find total number of points for ask posts
total_ask_points = 0
for post in ask_posts:
    total_ask_points += int(post[3])
    
#find average number of points
avg_ask_points = total_ask_points / len(ask_posts)
print(avg_ask_points)

#find total number of points for show posts
total_show_points = 0
for post in show_posts:
    total_show_points += int(post[3])
    
#find avg number of points
avg_show_points = total_show_points / len(show_posts)
print(avg_show_points)

15.061926605504587
27.555077452667813


# How many points does a post from each group get on average?

Ask posts have an average of 15.06 points per post, while show posts are higher with an average of 27.55 points per post. This also makes sense since the users posting show posts are typically sharing content that is meant to be useful for other users. 

# What is the best hour to post an ask post?

We found that ask posts receive more comments on average than show posts. For ask posts, users are looking for answers and engagement, so our quality metric will be the number of comments.

In [6]:
#Determine the number of posts and comments posted at each hour for ask posts
import datetime as dt
result_list = []
for post in ask_posts:
    created_at = post[6]
    comments = int(post[4])
    result_list.append([created_at,comments])
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date = dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
    hour = date.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
print(counts_by_hour)
print(comments_by_hour)

{'01': 60, '19': 110, '10': 59, '18': 109, '21': 109, '08': 48, '09': 45, '07': 34, '06': 44, '17': 100, '12': 73, '00': 55, '14': 107, '15': 116, '03': 54, '13': 85, '20': 80, '05': 46, '02': 58, '22': 71, '11': 58, '23': 68, '16': 108, '04': 47}
{'01': 683, '19': 1188, '10': 793, '18': 1439, '21': 1745, '08': 492, '09': 251, '07': 267, '06': 397, '17': 1146, '12': 687, '00': 447, '14': 1416, '15': 4477, '03': 421, '13': 1253, '20': 1722, '05': 464, '02': 1381, '22': 479, '11': 641, '23': 543, '16': 1814, '04': 337}


In [7]:
#Determine the average number of comments per post at each hour
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
print(avg_by_hour)

[['01', 11.383333333333333], ['19', 10.8], ['10', 13.440677966101696], ['18', 13.20183486238532], ['21', 16.009174311926607], ['08', 10.25], ['09', 5.5777777777777775], ['07', 7.852941176470588], ['06', 9.022727272727273], ['17', 11.46], ['12', 9.41095890410959], ['00', 8.127272727272727], ['14', 13.233644859813085], ['15', 38.5948275862069], ['03', 7.796296296296297], ['13', 14.741176470588234], ['20', 21.525], ['05', 10.08695652173913], ['02', 23.810344827586206], ['22', 6.746478873239437], ['11', 11.051724137931034], ['23', 7.985294117647059], ['16', 16.796296296296298], ['04', 7.170212765957447]]


In [8]:
#Swap order of values in order to sort 
#the lists in descending order of comments
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])
print(swap_avg_by_hour)

[[11.383333333333333, '01'], [10.8, '19'], [13.440677966101696, '10'], [13.20183486238532, '18'], [16.009174311926607, '21'], [10.25, '08'], [5.5777777777777775, '09'], [7.852941176470588, '07'], [9.022727272727273, '06'], [11.46, '17'], [9.41095890410959, '12'], [8.127272727272727, '00'], [13.233644859813085, '14'], [38.5948275862069, '15'], [7.796296296296297, '03'], [14.741176470588234, '13'], [21.525, '20'], [10.08695652173913, '05'], [23.810344827586206, '02'], [6.746478873239437, '22'], [11.051724137931034, '11'], [7.985294117647059, '23'], [16.796296296296298, '16'], [7.170212765957447, '04']]


In [9]:
#Sort in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [10]:
#Identify the top 5 hours for ask posts comments
print("Top 5 Hours for Ask Posts Comments")
for avg, hour in sorted_swap[0:5]:
    print(
    '{}: {:.2f} average comments per post'.format(
    dt.datetime.strptime(hour, '%H').strftime('%H:%M'),avg))
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Ask Post findings
15:00(3:00 pm EST) is the best time to submit an ask post with an average of 38.59 comments per post, which is significantly higher than the second best time, 2:00(2:00 am EST), which has an average of 23.81 comments per post. 

# What is the best hour to post a show post?

We found that show posts get more points on average than ask posts. For show posts, users are typically more concerned with popularity rather than engagement. So here we'll find the time with the highest average points per post. 

In [11]:
#Determine the number of posts and points posted at each hour for show posts
#can use same script as with show posts- 
#just swap show_posts for ask_posts and points for comments
result_list = []
for post in show_posts:
    created_at = post[6]
    points = int(post[3])
    result_list.append([created_at,points])
    
counts_by_hour = {}
points_by_hour = {}
for row in result_list:
    date = dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
    hour = date.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        points_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        points_by_hour[hour] += row[1]
print(counts_by_hour)
print(points_by_hour)

{'01': 28, '19': 55, '10': 36, '18': 61, '21': 47, '08': 34, '09': 30, '07': 26, '17': 93, '12': 61, '06': 16, '00': 31, '13': 99, '15': 78, '03': 27, '14': 86, '20': 60, '05': 19, '02': 30, '22': 46, '11': 44, '23': 36, '16': 93, '04': 26}
{'01': 700, '19': 1702, '10': 681, '18': 2215, '21': 866, '08': 519, '09': 553, '07': 494, '17': 2521, '12': 2543, '06': 375, '00': 1173, '13': 2438, '15': 2228, '03': 679, '14': 2187, '20': 1819, '05': 104, '02': 340, '22': 1856, '11': 1480, '23': 1526, '16': 2634, '04': 386}


In [12]:
#Determine the average number of points per post at each hour
avg_by_hour = []
for hour in points_by_hour:
    avg_by_hour.append([hour, points_by_hour[hour] / counts_by_hour[hour]])
print(avg_by_hour)

[['01', 25.0], ['19', 30.945454545454545], ['10', 18.916666666666668], ['18', 36.31147540983606], ['21', 18.425531914893618], ['08', 15.264705882352942], ['09', 18.433333333333334], ['07', 19.0], ['17', 27.107526881720432], ['12', 41.68852459016394], ['06', 23.4375], ['00', 37.83870967741935], ['13', 24.626262626262626], ['15', 28.564102564102566], ['03', 25.14814814814815], ['14', 25.430232558139537], ['20', 30.316666666666666], ['05', 5.473684210526316], ['02', 11.333333333333334], ['22', 40.34782608695652], ['11', 33.63636363636363], ['23', 42.388888888888886], ['16', 28.322580645161292], ['04', 14.846153846153847]]


In [13]:
#Swap order of values in order to sort 
#the lists in descending order of points
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])
print(swap_avg_by_hour)

[[25.0, '01'], [30.945454545454545, '19'], [18.916666666666668, '10'], [36.31147540983606, '18'], [18.425531914893618, '21'], [15.264705882352942, '08'], [18.433333333333334, '09'], [19.0, '07'], [27.107526881720432, '17'], [41.68852459016394, '12'], [23.4375, '06'], [37.83870967741935, '00'], [24.626262626262626, '13'], [28.564102564102566, '15'], [25.14814814814815, '03'], [25.430232558139537, '14'], [30.316666666666666, '20'], [5.473684210526316, '05'], [11.333333333333334, '02'], [40.34782608695652, '22'], [33.63636363636363, '11'], [42.388888888888886, '23'], [28.322580645161292, '16'], [14.846153846153847, '04']]


In [14]:
#Sort in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00'], [36.31147540983606, '18'], [33.63636363636363, '11'], [30.945454545454545, '19'], [30.316666666666666, '20'], [28.564102564102566, '15'], [28.322580645161292, '16'], [27.107526881720432, '17'], [25.430232558139537, '14'], [25.14814814814815, '03'], [25.0, '01'], [24.626262626262626, '13'], [23.4375, '06'], [19.0, '07'], [18.916666666666668, '10'], [18.433333333333334, '09'], [18.425531914893618, '21'], [15.264705882352942, '08'], [14.846153846153847, '04'], [11.333333333333334, '02'], [5.473684210526316, '05']]


In [15]:
#Identify the top 5 hours for show posts points
print("Top 5 Hours for Show Posts Points")
for avg, hour in sorted_swap[0:5]:
    print(
    '{}: {:.2f} average points per post'.format(
    dt.datetime.strptime(hour, '%H').strftime('%H:%M'),avg))

Top 5 Hours for Show Posts Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


# Show Post findings

23:00(11:00 PM EST) is the best time to post a show post with an average of 42.39 points per post, a close second being 12:00(12:00 PM EST) with an average of 41.69 points per post, and third is 22:00(10:00 PM EST) with an average of 40.35 comments per post. 