# Hacker News Post Analysis

[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator Y Combinator where users are able to submit posts, particularly in technology and startup circles. These posts are then voted and commented upon which has a similar concept to reddit.

The data set that we'll work with can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). It is important to note that this data set has been reduced from almost 300 000 rows to approximately 20 000 rows. This is done by firstly filtering out the summissions that did not receive any comments. Thereafer, random sampling was used to select data from the remaining submissions.

In this project we'll analyse the posts whose titles begin with either Ask HN or Show HN. We'll compare these two types of posts to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Introduction

Firstly, we'll open and read our data. We'll then store this as a list of lists and display the first 5 rows

In [78]:
#Open the file
opened_file = open('hacker_news.csv')

#Read the file and save as a list of lists
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

#Display the first 5 rows
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing Headers from a List of Lists

Next, we'll remove the header row and save it into a separate variable. We'll then display the first 5 rows of data and the header separately.

In [79]:
header = hn[0]
hn = hn[1:]

#Display the header
print(header)

#Display the first 5 rows
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN posts

We'll now extract Ask HN and Show HN posts into two separate lists as these are the data we are interested in. We'll also create another list containing other posts. Separating the posts into 3 separate lists will make it easier for us to analyse the data.

In [80]:
# Create empty lists
ask_posts = []
show_posts = []
other_posts = []

# Separate posts into their respective list
for row in hn:
    title = row[1].lower()

    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Display the number of post for each list
print('No. of Ask HN posts:', len(ask_posts))
print('No. of Show HN posts:', len(show_posts))
print('No. of other posts:', len(other_posts))

No. of Ask HN posts: 1744
No. of Show HN posts: 1162
No. of other posts: 17194


It seems as if the number of *Ask HN* and *Show HN* post is significantly lower than *Other* posts. Let's look at the percentage of these posts.

In [81]:
print('Percentage Ask HN posts:', round(len(ask_posts)/len(hn),2))
print('Percentage Show HN posts:', round(len(show_posts)/len(hn),2))
print('Percentage other posts:', round(len(other_posts)/len(hn),2))

Percentage Ask HN posts: 0.09
Percentage Show HN posts: 0.06
Percentage other posts: 0.86


It is interesting to note that the *Ask HN* post and *Show HN* posts only makes up approximately 15% of the total posts on the Hacker News site, when combined.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Firstly, we'll calculate the average number of comments for *Ask HN* posts.

In [82]:
# Initiate variable
total_ask_comments = 0

# Calculate total number of comments for Ask HN posts
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

# Calculate the average number of ask comments per post
avg_ask_comments = total_ask_comments/len(ask_posts)

print('Total no. of comments for Ask HN posts:', total_ask_comments)
print('Average no. of comments per Ask HN post:',round(avg_ask_comments, 2))

Total no. of comments for Ask HN posts: 24483
Average no. of comments per Ask HN post: 14.04


Now, we'll calculate the average number of comments for *Show HN* posts.

In [83]:
# Initiate variable
total_show_comments = 0

# Calculate total number of comments for Ask HN posts
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

# Calculate the average number of ask comments per post
avg_show_comments = total_show_comments/len(show_posts)

print('Total no. of comments for Show HN posts:', total_show_comments)
print('Average no. of comments per Show HN post:',round(avg_show_comments, 2))

Total no. of comments for Show HN posts: 11988
Average no. of comments per Show HN post: 10.32


From our calculations, we can see that *Ask HN* posts have an average of 14 comments per post while *Show HN* posts have an average of 10 comments per post. Therefore, users are slightly more interested in commenting on *Ask HN* posts compared to *Show HN* posts. 

Since this is the case, we'll focus our remaining analysis on *Ask HN* posts.

## Do Posts Created at a Certain Time Attract More Comments?

In the next step in our analysis, we want to determine if posts created at a certain time of the day have an effect on the number of comments receive. We'll use the following steps to perform our analysis:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

### Finding the Amount of Ask Posts and Comments by Hour Created

We'll firstly look into calculating the amount of ask posts and comments by hour created. We'll be using the `datetime` module in our code.

In [101]:
# Import module
import datetime as dt

# Initialise list
result_list = []

# Create a list of lists with two elements (creation date and time, no. of comments)
for row in ask_posts:
    created_at = row[6]
    no_comments = row[4]
    
    result_list.append([created_at, no_comments])
    
# Initialise dictionaries
counts_by_hour = {}
comments_by_hour = {}

# Create two frequency tables with no. of posts per hour and no. of comments per hour respectively
for row in result_list:
    date_n_time_str = row[0]
    no_comments = int(row[1]) # Convert str to int as we'll be adding all the no_comments
    
    date_n_time_obj = dt.datetime.strptime(date_n_time_str, "%m/%d/%Y %H:%M") # Parsing string
    hour_str = date_n_time_obj.strftime("%H") # Formating object
    
    if hour_str not in counts_by_hour:
        counts_by_hour[hour_str] = 1
        comments_by_hour[hour_str] = no_comments
    else:
        counts_by_hour[hour_str] += 1
        comments_by_hour[hour_str] += no_comments
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

### Calculating the Average Number of Comments for Ask HN Post by Hour

Now, we'll use the information above to calculate the average number of comments per Ask HN post by hour.

In [102]:
# Initialise dictionary
avg_by_hour = []

# Calculate the average number of comments per post by hour
for hour, posts in counts_by_hour.items():
    avg_by_hour.append([hour, int(comments_by_hour[hour])/posts])

# Display average number of comments per post by hour
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

We can see from the results above that the minimum number of comments per post is approximately 6 and the maximum number of comments per post is approximately 39, which is a wide range.

### Sorting and Printing Values from a List of Lists


To analyse this data further, we'll be sorting the list by hour to make it clearer for us to see if there is a correlation between the number of comments per post and the time of the day.

In [110]:
# Initialising list
swap_avg_by_hour = []

# Swap columns
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

# Display list
print(swap_avg_by_hour)

# Sort list by highest number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


**Top 5 Hours for Ask Posts Comments**

In [119]:
# Display Top 5 Hours for Ask Posts Comments
print("Top 5 Hours for Ask Posts Comments")

template = "{h}:00: {c:.2f} average comments per post."
for row in sorted_swap[:5]:
    print(template.format(h = row[1], c = row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


From our results above, we can see that the best time to create a post to get the most number of comments is at 3 pm. Posts that are created at 2 am and 8 pm will get a bit over half the number of comments. Meanwhile, posts created at 4 pm and 9 pm will get less than half of the number of comments. 

However, as users are posting comments from all over the world, it is important to note what timezone is used in this dataset. According to this [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home), Eastern Time in the US is used. 

Therefore, if I were to create a post while in St. Louis MO to attract the most number of comments, I would do it at 2 pm cst (3 pm est).

**Best Time of Day to Post to Attract Most Comments**

Next, we'll look into when is the best time of day (morning, afternoon, evening, late night/early morning) to post Ask HN posts to get most comments. We'll be breaking up 24 hours into 6 hours for each time of day.
- Morning: 6 am to 12 pm
- Afternoon: 12 pm to 6 pm
- Evening: 6 pm to 12 am
- Late Night/Early Morning: 12 am - 6 am

In [133]:
# Initialising dictionary
comments_by_time = {'Morning':0, 'Afternoon':0, 'Evening':0, 'Late Night/Early Morning':0}

# Sort list by time of day
sorted_avg_per_hour = sorted(avg_by_hour)

# Grouping comments into time of day
for row in sorted_avg_per_hour:
    row[0] = int(row[0])
    
    if row[0] >= 0 and row[0] < 6:
        comments_by_time['Late Night/Early Morning'] += row[1]    
    
    elif row[0] >= 6 and row[0] < 12:
        comments_by_time['Morning'] += row[1]
        
    elif row[0] >= 12 and row[0] < 18:
        comments_by_time['Afternoon'] += row[1]
    
    elif row[0] >= 18 and row[0] <= 23:
        comments_by_time['Evening'] += row[1]
    else:
        print("Error")
        
comments_by_time


{'Morning': 57.195848331008364,
 'Afternoon': 104.23690411701412,
 'Evening': 76.26778216519843,
 'Late Night/Early Morning': 68.37441647218513}

The best time of day to attract most comments for a post is during the afternoon. The worst time of day to attract comments is during the morning which only gets about half the number of comments. This might be due to the fact that people are generally more productive/focused with schoolwork/work in the morning and do not have time to surf the web to comment on posts.

## Do Show Posts or Ask Posts Receive More Points on Average?

The total number of points for a post is the difference in the total number of upvotes and downvotes. We'll be investigating to see if *Show HN* posts or *Ask HN* posts receive more points on average.

In [143]:
# Initiate variables
total_show_points = 0
total_ask_points = 0

for row in show_posts:
    points = int(row[3])
    total_show_points += points
    
for row in ask_posts:
    points = int(row[3])
    total_ask_points += points
    
avg_show_points = total_show_points/len(show_posts)
avg_ask_points = total_ask_points/len(ask_posts)

print("Average points per Show HN posts:", round(avg_show_points, 2))
print("Average points per Ask HN posts:", round(avg_ask_points, 2))

Average points per Show HN posts: 27.56
Average points per Ask HN posts: 15.06


It can be seen that Show HN posts receive approximately 50% more points then Ask HN posts. Show HN posts might be more popular that Ask HN posts because the post might be sharing something new that sparks other users' interest.

## Conclusion
To summarize, firstly in this project we analysed *Ask HN* posts and *Show HN* posts to determine which type of post receive the highest number of comments on average. We found that *Ask HN* posts received a greater number of comments on average so we decided investigate further into *Ask HN* posts.

Through our results, we found that the best hour to create a post that'll receive the most number of comments on average is at 3 pm est. If that is not achievable, the best time of day to create a post is in the afternoon from 12pm est to 6pm est.

Lastly, we also noted that Show HN posts are more popular/liked by users.