# Guided Project: Exploring Hacker News Posts

<div style="text-align: justify"> In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.</div>
<div style="text-align: justify">The dataset we will use can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). For the purpose of this project, it has been redacted - the number of rows have been reduced from almost 300000 to 20000 removing the posts with no comments.</div>
<div style="text-align: justify">Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. We'll compare these two types of posts to determine the following: </div>

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [6]:
import csv

In [7]:
#Let's start by reading in the dataset and displaying the first five rows:

opened = open('hacker_news.csv')
read = csv.reader(opened)
hn = list(read)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [8]:
#Let's remove the header from the list

headers = hn[0]
hn = hn[1:]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [48]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Filtering the data
Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [10]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)


In [11]:
print('Number of Ask posts: {}'.format(len(ask_posts)))
print('Number of Show posts: {}'.format(len(show_posts)))
print('Number of Other posts: {}'.format(len(other_posts)))

Number of Ask posts: 1744
Number of Show posts: 1162
Number of Other posts: 17194


## Average number of comments
Now, let's determine if ask posts or show posts receive more comments on average.

In [74]:
#First we will find the average number of comments:

def avg_comments(rows):
    total_comments = 0
    for post in rows:
        comments = int(post[4])
        total_comments += comments
    
    return total_comments/len(rows)

avg_ask_comments = avg_comments(ask_posts)
avg_show_comments = avg_comments(show_posts)
avg_other_comments = avg_comments(other_posts)
print("The average number of comments on ask posts: {:.2f}".format(avg_ask_comments))
print("The average number of comments on show posts: {:.2f}".format(avg_show_comments))
print("The average number of comments on other posts: {:.2f}".format(avg_other_comments))

The average number of comments on ask posts: 14.04
The average number of comments on show posts: 10.32
The average number of comments on other posts: 26.87


<div style="text-align: justify">On average, the Ask posts receive more comments than the Show comments. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.</div>

## Ask Posts and Comments by Hour Created

<div style="text-align: justify">
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. To do that, we will:</div>

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [16]:
import datetime as dt

In [17]:
#First we will extract the created at and comments columns for each post and 
#add them to a new list

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])

In [24]:
#Next, we will count the number of posts per hour
#We will also add up the number of comments per hour

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = created.hour
    if hour not in counts_by_hour.keys():
        counts_by_hour[hour] = 1
    else:
        counts_by_hour[hour] += 1
    if hour not in comments_by_hour.keys():
        comments_by_hour[hour] = row[1]
    else:
        comments_by_hour[hour] += row[1]

In [34]:
#Next, we will calculate the average number of comments for posts created 
#during each hour of the day

avg_by_hour = []

for hour in counts_by_hour.keys():
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

print(avg_by_hour)

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's sort the list of lists and print the five highest values in a format that's easier to read.

In [33]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]


In [35]:
#Let's sort the list

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [43]:
print("Top 5 Hours for Ask Posts Comments:")
for hour in sorted_swap[:5]:
    h = dt.datetime.strptime(str(hour[1]), "%H")
    print("{}: {:.2f} average comments per post".format(h.strftime("%H:%M"), hour[0]))

Top 5 Hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Above we see the top five best times to create a post on Hacker News and get a comment. The times are in Eastern Time in the US.

In [62]:
#let's create a function which users can use to calculate
#the best time to post in their time zone

from datetime import timedelta

def time_zone(x, tz_name=''):
    """
    Takes in an integer, positive or negative, 
    depending on the time difference from Eastern US time,
    an optional string for the name of the time zone
    and calculates the best time to post in the time zone
    """
    
    print("Top 5 Hours for Ask Posts Comments in time zone {}:".format(tz_name))
    for hour in sorted_swap[:5]:
        h = dt.datetime.strptime(str(hour[1]), "%H")
        tz_time = h + timedelta(hours=x)
        print("{}: {:.2f} average comments per post".format(tz_time.strftime("%H:%M"), hour[0]))

In [64]:
time_zone(-3, 'Pacific time')

Top 5 Hours for Ask Posts Comments in time zone Pacific time:
12:00: 38.59 average comments per post
23:00: 23.81 average comments per post
17:00: 21.52 average comments per post
13:00: 16.80 average comments per post
18:00: 16.01 average comments per post


## Conclusion

In this guided project, we worked with data from posts on the website Hacker News. We isolated the number of Ask Posts and Show Posts. We found out that on average Ask posts receive more comment. We used that to find out during which hours a user should post in order to receive more comments. Next steps to consider:

- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.

## Next steps

In [71]:
#Determine if show or ask posts receive more points on average

def points_on_average(rows):
    total_points = 0
    for row in rows:
        points = int(row[3])
        total_points += points
    return total_points/len(rows)

ask_avg_points = points_on_average(ask_posts)
show_avg_points = points_on_average(show_posts)
other_avg_points = points_on_average(other_posts)

print("On average, Ask posts receive {:.2f} points.".format(ask_avg_points))
print("On average, Show posts receive {:.2f} points.".format(show_avg_points))
print("On average, Other posts receive {:.2f} points.".format(other_avg_points))

On average, Ask posts receive 15.06 points.
On average, Show posts receive 27.56 points.
On average, Other posts receive 55.41 points.


In [75]:
#Determine if posts created at a certain time 
#are more likely to receive more points

hour_count = {}
points_by_hour = {}

for post in hn:
    created = dt.datetime.strptime(post[-1], "%m/%d/%Y %H:%M")
    hour = created.hour
    points = int(post[3])
    
    if hour not in hour_count.keys():
        hour_count[hour] = 1
    else:
        hour_count[hour] += 1
    
    if hour not in points_by_hour.keys():
        points_by_hour[hour] = points
    else:
        points_by_hour[hour] += points
    

In [85]:
average_points_by_hour = []

for hour in hour_count.keys():
    average_points_by_hour.append([hour, points_by_hour[hour]/hour_count[hour]])
    
swap_points = []
for row in average_points_by_hour:
    swap_points.append([row[1], row[0]])
    
sorted_points = sorted(swap_points, reverse=True)

print("On average you are more likely to receive more points if you post at:")
for row in sorted_points[:5]:
    h = dt.datetime.strptime(str(row[1]), "%H")
    print("{}: {:.2f} points per post".format(h.strftime("%H:%M"), row[0]))
    

On average you are more likely to receive more points if you post at:
13:00: 56.17 points per post
15:00: 55.65 points per post
10:00: 54.71 points per post
14:00: 54.44 points per post
19:00: 54.17 points per post
