# Exploring Hacker News Posts

In this project, we'll compare two different types of posts from [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories (or 'posts') are voted and commented upon. We will focus on two types of posts `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

Our analysis will answer the following questions:
 - Do `Ask HN` or `Show HN` receive more comments on average?
 - Do posts created at a certain time receive more comments on average?
 - Do `Ask HN` or `Show HN` receive more points on average?
 - Do posts created at a certain time receive more points on average?

Please note that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Dataset available at [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts)

In [19]:
# Import all libraries
from csv import reader
from datetime import datetime as dt

In [20]:
# Read in the data
opened_file = reader(open('hacker_news.csv'))

# Transform opened_file into a list of lists
hn = list(opened_file)

In [21]:
# Display first five rows of Hacker News dataset
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [22]:
# Remove header and store in variable using pop method 
headers = hn.pop(0)

In [23]:
# Display first five rows of Hacker News dataset
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

### Extracting Ask HN and Show HN Posts

In [24]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'): # search for ask posts
        ask_posts.append(row)
    elif title.startswith('show hn'): # search for show posts
        show_posts.append(row)
    else:
        other_posts.append(row)

In [25]:
# Display number of posts for each post type
print("{} Total ask posts\n{} Total show posts\n{} Total other posts".format(len(ask_posts), len(show_posts), len(other_posts)))

1744 Total ask posts
1162 Total show posts
17194 Total other posts


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [26]:
# Calculate total comments for ask posts  
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4]) # add number of comments to total 

# Calculate average number of comments for ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of comments on ask posts: ", avg_ask_comments)

Average number of comments on ask posts:  14.038417431192661


In [27]:
# Calculate total comments for show posts  
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4]) # add number of comments to total

# Calculate average number of comments for show posts
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments on show posts: ", avg_show_comments)

Average number of comments on show posts:  10.31669535283993


##### Notes
Ask posts receive more comments on average. My theory is that solutions sometimes do not provide enough details so users ask for more clarification or other users provide their own solution to the question. We'll focus this portion of analysis on `Ask HN` posts.

### Finding the Amount of Ask Posts and Comments by Hour Created

In [28]:
# Create a list of list for number of comments per hour 
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])]) # convert comments column to integer
    
result_list[:5]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [29]:
# Empty dictionaries
posts_by_hour = {}
comments_by_hour = {}

# Calculate total posts and comments per hour  
for row in result_list:
    hour = dt.strptime(row[0], '%m/%d/%Y %H:%M').strftime('%H') # Extract hour for created_at column
    if hour in posts_by_hour:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [30]:
# Calculate average commments per hour for `Ask HN`
avg_by_hour = []

for key, value in comments_by_hour.items():
    avg_by_hour.append([key, (value / posts_by_hour[key])])
    
avg_by_hour

[['12', 9.41095890410959],
 ['06', 9.022727272727273],
 ['15', 38.5948275862069],
 ['04', 7.170212765957447],
 ['23', 7.985294117647059],
 ['10', 13.440677966101696],
 ['18', 13.20183486238532],
 ['14', 13.233644859813085],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['02', 23.810344827586206],
 ['20', 21.525],
 ['19', 10.8],
 ['16', 16.796296296296298],
 ['09', 5.5777777777777775],
 ['11', 11.051724137931034],
 ['08', 10.25],
 ['22', 6.746478873239437],
 ['01', 11.383333333333333],
 ['07', 7.852941176470588],
 ['21', 16.009174311926607],
 ['13', 14.741176470588234],
 ['17', 11.46],
 ['00', 8.127272727272727]]

In [31]:
# Reverse elements in average comments per hour list
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [38.5948275862069, '15'],
 [7.170212765957447, '04'],
 [7.985294117647059, '23'],
 [13.440677966101696, '10'],
 [13.20183486238532, '18'],
 [13.233644859813085, '14'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [10.8, '19'],
 [16.796296296296298, '16'],
 [5.5777777777777775, '09'],
 [11.051724137931034, '11'],
 [10.25, '08'],
 [6.746478873239437, '22'],
 [11.383333333333333, '01'],
 [7.852941176470588, '07'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [11.46, '17'],
 [8.127272727272727, '00']]

In [32]:
# Sort swap avg by hour list in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [33]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.strptime(row[1], '%H').strftime('%H:%M'), row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


##### Notes 
These hours are based on Eastern Time Zone.
Top 5 hours for post comments: 3pm, 4pm, 8pm, 9pm, 2am.
There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

### Calculating the Average Number of Points for Ask HN and Show HN Posts

In [35]:
# Calculate total points for ask posts  
total_ask_points = 0
for post in ask_posts:
    total_ask_points += int(post[3]) # add number of comments to total 

# Calculate average number of points for ask posts
avg_ask_points = total_ask_points / len(ask_posts)
print("Average number of points on ask posts: ", avg_ask_points)

Average number of points on ask posts:  15.061926605504587


In [36]:
# Calculate total points for show posts  
total_show_points = 0
for post in show_posts:
    total_show_points += int(post[3]) # add number of points to total

# Calculate average number of points for show posts
avg_show_points = total_show_points / len(show_posts)
print("Average number of points on show posts: ", avg_show_points)

Average number of points on show posts:  27.555077452667813


##### Notes
Show posts receive more points on average. My theory is that show posts might have more interactive elements like videos or graphs that helps generate more points. We'll focus this portion of the analysis on `Show HN` posts.

### Finding the Amount of Show Posts and Points by Hour Created

In [39]:
# Create a list of list for number of points per hour 
result_list = []

for post in show_posts:
    result_list.append([post[6], int(post[3])]) # convert points column to integer
    
result_list[:5]

[['11/25/2015 14:03', 26],
 ['11/29/2015 22:46', 747],
 ['4/28/2016 18:05', 1],
 ['7/28/2016 7:11', 3],
 ['1/9/2016 20:45', 1]]

In [40]:
# Empty dictionaries
posts_by_hour = {}
points_by_hour = {}

# Calculate total posts and points per hour  
for row in result_list:
    hour = dt.strptime(row[0], '%m/%d/%Y %H:%M').strftime('%H') # Extract hour for created_at column
    if hour in posts_by_hour:
        posts_by_hour[hour] += 1
        points_by_hour[hour] += row[1]
    else:
        posts_by_hour[hour] = 1
        points_by_hour[hour] = row[1]

points_by_hour

{'00': 1173,
 '01': 700,
 '02': 340,
 '03': 679,
 '04': 386,
 '05': 104,
 '06': 375,
 '07': 494,
 '08': 519,
 '09': 553,
 '10': 681,
 '11': 1480,
 '12': 2543,
 '13': 2438,
 '14': 2187,
 '15': 2228,
 '16': 2634,
 '17': 2521,
 '18': 2215,
 '19': 1702,
 '20': 1819,
 '21': 866,
 '22': 1856,
 '23': 1526}

### Calculating the Average Number of Points for Show HN Posts by Hour

In [43]:
# Calculate average points per hour for `Show HN`
avg_by_hour = []

for key, value in points_by_hour.items():
    avg_by_hour.append([key, (value / posts_by_hour[key])])
    
avg_by_hour

[['12', 41.68852459016394],
 ['22', 40.34782608695652],
 ['07', 19.0],
 ['15', 28.564102564102566],
 ['04', 14.846153846153847],
 ['06', 23.4375],
 ['10', 18.916666666666668],
 ['20', 30.316666666666666],
 ['14', 25.430232558139537],
 ['23', 42.388888888888886],
 ['05', 5.473684210526316],
 ['02', 11.333333333333334],
 ['19', 30.945454545454545],
 ['03', 25.14814814814815],
 ['16', 28.322580645161292],
 ['09', 18.433333333333334],
 ['11', 33.63636363636363],
 ['08', 15.264705882352942],
 ['18', 36.31147540983606],
 ['01', 25.0],
 ['21', 18.425531914893618],
 ['13', 24.626262626262626],
 ['17', 27.107526881720432],
 ['00', 37.83870967741935]]

In [44]:
# Reverse elements in average comments per hour list
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[41.68852459016394, '12'],
 [40.34782608695652, '22'],
 [19.0, '07'],
 [28.564102564102566, '15'],
 [14.846153846153847, '04'],
 [23.4375, '06'],
 [18.916666666666668, '10'],
 [30.316666666666666, '20'],
 [25.430232558139537, '14'],
 [42.388888888888886, '23'],
 [5.473684210526316, '05'],
 [11.333333333333334, '02'],
 [30.945454545454545, '19'],
 [25.14814814814815, '03'],
 [28.322580645161292, '16'],
 [18.433333333333334, '09'],
 [33.63636363636363, '11'],
 [15.264705882352942, '08'],
 [36.31147540983606, '18'],
 [25.0, '01'],
 [18.425531914893618, '21'],
 [24.626262626262626, '13'],
 [27.107526881720432, '17'],
 [37.83870967741935, '00']]

In [45]:
# Sort swap avg by hour list in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [48]:
# Sort the values and print the the 5 hours with the highest average points.

print("Top 5 Hours for Show Post Points")

for row in sorted_swap[:5]:
    print("{}: {:.2f} average points per post".format(dt.strptime(row[1], '%H').strftime('%H:%M'), row[0]))

Top 5 Hours for Show Post Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


##### Notes 
These hours are based on Eastern Time Zone.
Top 5 hours for post points: 6pm, 10pm, 11pm, 12am, and 12pm.
Four of the five top hours is within 6 hour window. From 6pm - 12am. Average points for the top 5 is 39.7.

# Conclusion
In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments and points on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est). To maximize the amount of points a post receives, we'd recommend the post be categorized as show post and created between 22:00 and 00:00 (10:00 pm est - 12:00 am est).

However, it should be noted that the data set we analyzed excluded posts without any comments. With that in mind, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.