#  Exploring Hacker News Posts

Hacker News is a popular technology site on which people submit their posts related mainly to technology. It is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

**Our goal here is to count average number of comments each post receives per hour. During an hour, a number of posts can be published by users. Average number of comments will simply be the total number of comments during an hour divided by total number of posts for that hour.**

The dataset for the required analysis is downloaded from an online source and it contains around 300k rows. Below are descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

### Opening and Reading the Dataset `HN_posts_year_to_Sep_26_2016` 

In [1]:
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding = "utf8")
from csv import reader
read_file = list(reader(opened_file))
header = read_file[0]
hn = read_file[1:]
print(header)
for row in hn[:3]:
    print('\n')
    print(row)
    
                   

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


### Creating new lists of lists containing just the data for `Ask HN` or `Show HN` titles

In [2]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Total number of ask posts are {:,}".format(len(ask_posts)))
print("Total number of show posts are {:,}".format(len(show_posts)))
print("Total number of other posts are {:,}".format(len(other_posts)))


Total number of ask posts are 9,139
Total number of show posts are 10,158
Total number of other posts are 273,822


### If ask posts or show posts receive more comments on average?

In [3]:
def avg_comments(post_type):
    length = 0
    comments_no = 0
    for row in post_type:
        comments_no += int(row[4])
        length += 1
    
    average = comments_no/length
    return average

print("Average number of user comments for ask posts are:", avg_comments(ask_posts))
print("Average number of user comments for show posts are:", avg_comments(show_posts))
print("Average number of user comments for other posts are:", avg_comments(other_posts))
    

Average number of user comments for ask posts are: 10.393478498741656
Average number of user comments for show posts are: 4.886099625910612
Average number of user comments for other posts are: 6.4572678601427205


### If ask posts created at a certain time are more likely to attract comments?

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [4]:
import datetime as dt
result_list = []

for row in ask_posts:
    date_time = row[6]
    comments_no = int(row[4])
    new_list = [date_time, comments_no]
    result_list.append(new_list)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row[0]
    comments_no = int(row[1])
    #datetime module, datetime class, datetime method 'strptime' creates a dt.datetime object(parsed) from the string 'hour'  
    hour_obj = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    # 'strftime' converts a dt.datetime object (hour_obj) to a string (hours_string) with desired format (hours)
    hour_string = hour_obj.strftime("%H")
    
    if hour_string in counts_by_hour:
        counts_by_hour[hour_string] += 1
    else:
        counts_by_hour[hour_string] = 1
        
    if hour_string not in comments_by_hour:
        comments_by_hour[hour_string] = comments_no
    else:
        comments_by_hour[hour_string] += comments_no

print(counts_by_hour)
print(comments_by_hour)
        
    
        
    
    
    
    
    
    

    

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


### Calculating the average number of comments for posts created during each hour of the day.

Let's create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [5]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

avg_by_hour
    
    

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

### Sorting the list of lists and printing the five highest values in a format that's easier to read

In [6]:
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    average = row[1]
    new_list = [average, hour]
    swap_avg_by_hour.append(new_list)
    
sorted_avg_by_hour= sorted(swap_avg_by_hour, reverse = True)
sorted_avg_by_hour

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_avg_by_hour[:5]:
    average = row[0]
    hour = row[1]
    hour_obj = dt.datetime.strptime(hour, "%H")
    hour_str = hour_obj.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour_str, average))
    
    
    

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
