# Project 2: Exploring Hacker News Posts


In this project, we'll work with a data set of submissions to popular technology site **Hacker News**.

We're **specifically interested** in posts whose titles begin with either Ask HN or Show HN. 

Users submit **Ask HN** posts to ask the Hacker News community a specific question. 

Users submit **Show HN** posts to show the Hacker News community a project, product, or just generally something interesting. 


### We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive **more comments on average**?
* Do posts created at a certain time receive **more comments on average**?

> Download the dataset [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

#### Column Descriptions:
* `id`: The unique identifier from Hacker News for the post

* `title`: The title of the post

* `url`: The URL that the posts links to, if the post has a URL

* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

* `num_comments`: The number of comments that were made on the post

* `author`: The username of the person who submitted the post

* `created_at`: The date and time at which the post was submitted



## 1.0 Open and interpret the csv file

In [1]:
from csv import reader

opened_file = open("hacker_news.csv")

read_file = reader(opened_file)

hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


### Split the header from the data

#### Extract the first row of data, and assign it to the variable headers.

In [2]:
headers = hn[:1]

print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


#### Remove the first row from hn. Display the first five rows of hn to verify that you removed the header row properly.

In [3]:
hn = hn[1:]

print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## 2.0 Filter the data

Now that we've removed the headers from hn, we're ready to filter our data. 

Since we're only concerned with **post titles beginning with Ask HN or Show HN**, we'll create new lists of lists containing just the data for those titles.

To find the **posts that begin with either Ask HN or Show HN**, we'll use the string method `startswith`. 

Given a string object, say, `string1`, we can check if starts with, say, `dq` by inspecting the output of the object `string1.startswith('dq')`. 

If `string1` starts with `dq`, it will return `True`, otherwise it will return `False`.

> **Capitalization matters!** If we wish to control for case, we can use the `lower` method which returns a lowercase version of the starting string.  

In [4]:
ask_posts = []

show_posts = []

other_posts = []

for row in hn:
    
    title = row[1]
    
    if title.lower().startswith("ask hn"):
        
        ask_posts.append(row)
    
    elif title.lower().startswith("show hn"):
        
        show_posts.append(row)
        
    else:
        
        other_posts.append(row)
        
print("The length of ask_posts is: ", len(ask_posts))
print("\n")
print("The length of show_posts is: ", len(show_posts))
print("\n")
print("The length of other_posts is: ", len(other_posts))

The length of ask_posts is:  1744


The length of show_posts is:  1162


The length of other_posts is:  17194


## 3.0 Determine which posts receive more comments on average

#### Find the total number of comments in ask posts

In [5]:
total_ask_comments = 0

for row in ask_posts:
    
    num_comments = int(row[4])
    
    total_ask_comments += num_comments

print(total_ask_comments)    

24483


#### Compute the average number of comments on ask posts

In [6]:
avg_ask_comments = total_ask_comments/len(ask_posts)

print("The average number of comments on ask posts is: ", avg_ask_comments)

The average number of comments on ask posts is:  14.038417431192661


#### Find the total number of comments in show posts

In [7]:
total_show_comments = 0

for row in show_posts:
    
    num_comments = int(row[4])
    
    total_show_comments += num_comments
    
print(total_show_comments)
    

11988


#### Compute the average number of comments on show posts

In [8]:
avg_show_comments = total_show_comments/len(show_posts)

print("The average number of comments on show posts is: ", avg_show_comments)


The average number of comments on show posts is:  10.31669535283993


### Summary of findings

| |Total number of comments|Average number of comments per post|
|------|------|------|
|Ask posts |24483     |14.04     |
|Show posts |11988     |10.32     |

### Ask posts receive more comments than show posts on average. 

Since ask posts are **more likely to receive comments**, we'll focus our remaining analysis just on these posts.

## 4.0 Determine if posts created at a certain time receive more comments on average

We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts **created in each hour** of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive **by hour created**.

We'll use the `datetime` module to work with the data in the `created_at` column.

We can use the `datetime.strptime()` constructor to parse dates stored as strings and return datetime objects.

Iterate over `ask_posts` and append to `result_list` a list with two elements:
1. The first element shall be the column `created_at`
2. The second element shall be the number of comments of the post.

In [9]:
import datetime as dt

result_list = []

for row in ask_posts:
    
    this_list = []
    created_at = row[6]
    num_comments = int(row[4])
    
    this_list = [created_at] + [num_comments]
    
    result_list.append(this_list)
    
print(result_list[0])


['8/16/2016 9:55', 6]


Loop through each row of `result_list`. 

Use the `datetime.strptime()` method to parse the date and create a datetime object.

Use the `datetime.strftime()` method to select just the hour from the datetime object.

Create two dictionaries:

* `counts_by_hour`: contains the number of ask posts created during each hour of the day.
* `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

In [10]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    
    created_at = row[0]
    
    num_comments = row[1]
    
    dt_time = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    
    dt_hour = dt_time.strftime("%H")
    
    if dt_hour not in counts_by_hour:
        
        counts_by_hour[dt_hour] = 1
        
        comments_by_hour[dt_hour] = num_comments
        
    else:
        
        counts_by_hour[dt_hour] += 1
        
        comments_by_hour[dt_hour] += num_comments
        

In [11]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [12]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

#### Calculate the average number of comments for posts created during each hour of the day

Create a list of lists containing the **hours during which posts were created** and the **average number of comments** those posts received.

In [13]:
avg_by_hour = []

for key1 in counts_by_hour:
    
    for key2 in comments_by_hour:
        
        if key1 == key2:
            
            average = comments_by_hour[key1]/counts_by_hour[key1]
            
            avg_by_hour.append([key1, average])

print(avg_by_hour)
            

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## 5.0 Sorting

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. 

Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

Create a list that equals avg_by_hour with swapped columns.

In [14]:
swap_avg_by_hour = []

for row in avg_by_hour:
    
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Use the `sorted()` function to sort `swap_avg_by_hour` in descending order. 

Since the first column of this list is the average number of comments, sorting the list will **sort by the average number of comments**.

In [15]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [16]:
print("Top 5 Hours for 'Ask HN' Comments")
for average, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),average))

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

In this project, we analyzed ask posts and show posts to **determine which type of post at a given time receives the most comments on average**.

From the posts that received comments, ask posts received more comments on average and **ask posts created between 15:00 and 16:00 received the most comments on average.**