## Guided Project 2 - Exploring Hacker News Posts

### Brief Introduction:

[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/) where user- submitted stories (known as "posts") receive votes and comments, similar to reddit.

Hacker News is extremely popular in technology and startup circles and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

### Project Goal:

In this dataset, we're specifically interested in posts with titles that begin with either "Ask HN" or  "Show HN". Users submit the `Ask HN` posts to ask the Hacker News community a specific question whilst they use the `Show HN` posts to show the Hacker News community a project, product or just something interesting.

In this Analysis Project we want to compare these two types of posts to determine the following:

**1)** Which between the `Ask HN` or `Show HN` posts receive more comments on average?


**2)** Do posts created at a certain time receive more comments on average?

#### A Brief of our Data:
Description of Columns in Dataset


| Column name           | Description                                         |
| --------------------- | -----------                                         |
| "id"                  | The unique identfier from Hacker News for the post  |
|  "title"              | The title of the post                               |
|  "url"                | The URL that the posts links to, if the post has a url                           |
|  "num_points"           | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes                             |
|  "num_comments"              | The number of comments on the post                              |
|  "author"   |  The username of the person who submitted the post     |
|  "created_at"   | The date and time of the post's submission  |


----

### Opening and Exploring our Dataset


In [19]:
# Importing the Data
from csv import reader
open_file = open("hacker_news.csv", encoding = "utf-8" )
read_file = reader(open_file)
hn = list(read_file)


In [20]:
# Briefly Displaying a sample of the data
for x in range(6):
    print('\n')
    print(hn[x])
    
print('\n', len(hn))



['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

 20101


In [21]:
# Extracting and removing the header row from the dataset. ONLY RUN ONCE!
headers = hn[0]
print(headers)

hn = hn[1:]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


---

### Data Preparation: Filtering 

 To begin our data analysis we must separate the posts beginning with 'Ask HN' from the ones beginning with 'Show HN'.

In [26]:
# Initializing the variables
ask_posts = []
show_posts = []
other_posts = []

# Looping through the dataset and applying the string filter using the .startswith() method
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

In [27]:
# Checking the number of posts in each lists
print(f"There are {len(ask_posts)} Ask Hn posts ")
print(f"There are {len(show_posts)} Show Hn posts")
print(f"There are {len(other_posts)} other types of posts ")

There are 1744 Ask Hn posts 
There are 1162 Show Hn posts
There are 17194 other types of posts 


### Preliminary Data Analysis

Finding the average number of comments for `Ask HN` & `Show HN` posts

In [31]:
# Ask HN posts comment count
total_ask_comments = 0
for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments
    
# Finding the average number of ask hn posts
avg_ask_comments =  total_ask_comments /  len(ask_posts)
print(f"The average number of comments per Ask HN post is {round(avg_ask_comments)}")

The average number of comments per Ask HN post is 14


In [32]:
# Show HN posts comment count
total_show_comments = 0
for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments
        
# Finding the average number of show hn posts
avg_show_comments =  total_show_comments /  len(show_posts)
print(f"The average number of comments per Show HN post is {round(avg_show_comments)}")

The average number of comments per Show HN post is 10


**Take Away**

From these results generated we can see that `Ask HN` posts receive more comments on average than `Show HN` posts. Since ask posts are more likely to receive comments, for the rest of our analysis we will focus on just these types of posts

-----

### Data Analysis

Now, we want to determine if `Ask HN` posts created at a certain time are more likely to attract comments. We can do so using the following steps:

1) **Step 1** - Calculate the number of ask posts created in each hour of the day, along with the number of comments received.


2) **Step 2** - Calculate the average number of comments ask posts receive by hour created

In [37]:
# STEP 1
# importing the datetime module
import datetime as dt

# Creating a list of lists of the time at which comments were created and the corresponding number of comments at that time
results_list = []
for row in ask_posts:
    n_comments = int(row[4])
    date_string = row[-1]
    results_list.append([date_string, n_comments])

# A Preview of data stored in results list
results_list[:6]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17],
 ['9/26/2015 23:23', 1]]

In [60]:
# Time parsing
counts_by_hour = {}
comments_by_hour = {}

for row in results_list:
    dt_string = row[0]
    n_comments = row[1]
    dt_object = dt.datetime.strptime(dt_string, "%m/%d/%Y %H:%M")
    hour = dt_object.strftime("%H")
    
    # Updating the "number of posts by hour" frequency table
    counts_by_hour[hour] = counts_by_hour.get(hour, 0) + 1
    
    # Updating the "number of comments by hour" frequency table
    comments_by_hour[hour] = comments_by_hour.get(hour, 0) + n_comments

# Sorting the dictionaries 
counts_by_hour = dict(sorted(counts_by_hour.items()))
comments_by_hour = dict(sorted(comments_by_hour.items()))

# Previewing the count_per_hour data
counts_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

In [63]:
# Previewing the comment_per_hour data
comments_by_hour


{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [71]:
# STEP 2 - Calculating the average number of comments received per post by the hour

avg_by_hour = []
for hour in counts_by_hour:
    avg_n_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, round(avg_n_comments)])
    
avg_by_hour = dict(avg_by_hour)

In [74]:
# Sorting this dictionary according to the values so as to immediately see the hour(s) which attract the highest amount of comments per post
avg_by_hour = dict(sorted(avg_by_hour.items(), key = lambda data: data[1], reverse = True))
avg_by_hour

{'15': 39,
 '02': 24,
 '20': 22,
 '16': 17,
 '21': 16,
 '13': 15,
 '10': 13,
 '14': 13,
 '18': 13,
 '01': 11,
 '11': 11,
 '17': 11,
 '19': 11,
 '05': 10,
 '08': 10,
 '06': 9,
 '12': 9,
 '00': 8,
 '03': 8,
 '07': 8,
 '23': 8,
 '04': 7,
 '22': 7,
 '09': 6}

In [77]:
# Displaying results with more finest
for hour in avg_by_hour:
    n_comments = int(avg_by_hour[hour])
    
    dt_hour = dt.datetime.strptime(hour, "%H")
    hour_string = dt_hour.strftime("%I:00%p")
    print("At {} there were {} comments per posts".format(hour_string, n_comments))

At 03:00PM there were 39 comments per posts
At 02:00AM there were 24 comments per posts
At 08:00PM there were 22 comments per posts
At 04:00PM there were 17 comments per posts
At 09:00PM there were 16 comments per posts
At 01:00PM there were 15 comments per posts
At 10:00AM there were 13 comments per posts
At 02:00PM there were 13 comments per posts
At 06:00PM there were 13 comments per posts
At 01:00AM there were 11 comments per posts
At 11:00AM there were 11 comments per posts
At 05:00PM there were 11 comments per posts
At 07:00PM there were 11 comments per posts
At 05:00AM there were 10 comments per posts
At 08:00AM there were 10 comments per posts
At 06:00AM there were 9 comments per posts
At 12:00PM there were 9 comments per posts
At 12:00AM there were 8 comments per posts
At 03:00AM there were 8 comments per posts
At 07:00AM there were 8 comments per posts
At 11:00PM there were 8 comments per posts
At 04:00AM there were 7 comments per posts
At 10:00PM there were 7 comments per po

### Analysis Take Away

At 3:00pm, the highest number of comments per post occur and as such the best time to post an Ask HN on the Hacker News site would be at 3:00pm