## Project 2: 

### Data set: [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts)
Hacker News is a site, where users post stories (known as "posts"), and they vote or comment on other people's stories.

This particular data set is Hacker News posts from September 2015 - September 2016. It includes the following columns:

* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downbotes
* `num_comments` : The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

### Ask the following questions:
* Do posts that ask specific questions (i.e., post title begin with `Ask HN`) or posts that share something (i.e., post title begin with `Show HN`) receive more comments on average?
* Do posts created at a certain time receive more comments on average? 

In [2]:
# Read in the data set
open_hn = open("hacker_news.csv")
from csv import reader
read_hn = reader(open_hn)
hn = list(read_hn)

print(hn[0:5]) # display the first 5 rows

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
# data clean-up

# extract the first row of data, and remove it from the data set
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [4]:
# filter our data

## we are only concerned with post titles beginning with 'Ask HN' or 'Show HN', so create new lists of lists containing just the data for those titles.

# create 3 empty lists:
ask_posts = []
show_posts = []
other_posts = []

# Loop through each row in the data set.
for row in hn:
    title = row[1] 
    
    if title.lower().startswith('ask hn'):  #if the lowercase version of 'title' starts with 'ask hn', append the row to 'ask_posts'
        ask_posts.append(row)
    elif title.lower().startswith('show hn'): #do the same for 'show_posts'
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [6]:
## Take a look at the first 5 rows in ask posts

print(ask_posts[0:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [7]:
## Take a look at the fist 5 rows in show_posts

print(show_posts[0:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


### Determine if `ask` posts or `show` posts receive more comments on average.

In [9]:
# find the average number of comments in ask posts
total_ask_comments = []

for row in ask_posts:
    ask_comments = row[4] # 'num_comments' is the fifth column.
    ask_comments = int(ask_comments) # convert it to integer.
    total_ask_comments.append(ask_comments) # add this value to 'total_ask_comments'

avg_ask_comments = sum(total_ask_comments)/len(total_ask_comments) # compute the average number of comments on ask posts
print(f"the average number of comments on ASK posts is: {avg_ask_comments}") 

the average number of comments on ASK posts is: 14.038417431192661


In [10]:
# find the average number of comments in show posts
total_show_comments = []

for row in show_posts:
    show_comments = row[4]
    show_comments = int(show_comments)
    total_show_comments.append(show_comments)

avg_show_comments = sum(total_show_comments) / len(total_show_comments)
print(f"the average number of commens on SHOW posts is: {avg_show_comments}")

the average number of commens on SHOW posts is: 10.31669535283993


### Result: The average number of comments for ask posts is 14.04; the average number of comments for show posts is 10.32. Thus, `ask` posts receive more comments on average than `show` posts.

### Since `ask` posts are more likely to receive comments, we'll focus the remaining analysis just on these posts.

### Next, determine if ask posts created at a certain *time* are more likely to attract comments.

In [11]:
## Calculate the amount of ask posts created per hour, along with the total amount of comments.

import datetime as dt

result_list = [] # list of lists

for row in ask_posts: 
    result_list.append([
        row[6], int(row[4]) # the first element is 'created_at'; the second element is the number of comments of the post
    ])

result_list[0:5]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [12]:
# create two empty dictionaries
counts_by_hour = {} # contains the number of ask posts during each hour of the day.
comments_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received. 

for row in result_list: # loop through each row of 'result_list'
    date_time = row[0]
    comment = row[1]
    hour = dt.datetime.strptime(date_time,"%m/%d/%Y %H:%M").strftime("%H")
    
    if hour not in counts_by_hour: # If the hour isn't a key in 'counts_by_hour'
        counts_by_hour[hour] = 1 # create the key 'counts_by_hour' and set it equal to 1
        comments_by_hour[hour] = row[1] # create the key in 'comments_by_hour' and set it equal to the 'comment' number.
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

comments_by_hour # number of comments by the hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [13]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [14]:
## Use the two dictionaries above to calculate the average numbers of comments for posts created during each hour of the day.
# Calculate the average number of comments per post for posts created during each hour of the day.

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour] / counts_by_hour[hour])])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [15]:
# Sort the list of lists and print the five highest values (in a format that's easier to raed)

# create a list that equals `avg_by_hour` with swapped columns.
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [16]:
## sort `swap_avg_by_hour` in descending order. 
# sort() will sort by the first column. 

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [17]:
# sort the values and print the 5 hours with the highest avarage comments. 

print("Top 5 Hours for Ask Posts Comments")

import datetime as dt

for avg, hr in sorted_swap[0:5]:
    hour = dt.datetime.strptime(hr, "%H").strftime("%H:%M") # 'strptime()1` to return a datetime object; `strftime()` to specify the format of the time.
    output = "{hour}: {average:.2f} average comments per post".format(hour=hour, average=avg) # use the `str.formt()` method; use `{:.2f}` to indicate that just two decimal places should be used. 
    print(output)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


### Results:
An `ask` post on average receives that most comments when it's posted on the 15:00 hour (with 38.59 comments per post), followed by the 02:00 hour (with 23.81 comments per post), and by the 20:00 hour (with 21.52 comments per post). This is based on the Eastern Time in the US (according do the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts)). Thus, in order to receive a lot of comments, the recommended time to make an ask post is 3pm-4pm EST/12pm-1pm PST.

## Conclusion
In this project, I analyzed the data set from Hacker News to examine whether ask posts (to ask a specific question) would receive more conmments on average than show posts (to share something in particular). The analysis indicated that, of the posts tat received comments (i.e., excluding the posts that received no comment), `ask` posts received more comments on average than `show` posts. Further, the `ask` posts created 3-4pm EST receives the most comments per post compared to ask posts created in other times. 