# Guided Project of Exploring Hacker News Posts

__Introduction:__ 

Hacker News is an online community forum  popular with technology and startup enthusiasts.
It works by having users posts two types of stories namely; 'Ask HN' or 'Show HN' on varied interests and members of the forum voting and commenting on the posts.
The dataset resulting from the member's engagements are to be explored and analysed to achieve the goal of this project which are:
1. Which type of posts - 'Ask HN' or 'Show HN' receive more comments.
2. Whether posts created at certain time generate more comments

## Exploring and opening the dataset

The actual dataset for Hacker News comprises of almost 300,000 rows but it has been reduced to 20,000 rows to be used here by removing entries without comments. \
The data dictionary are as follows:

* `id`: unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted


The [data set](https://www.kaggle.com/hacker-news/hacker-news-posts/download) containing the approximately 20,000 rows of Hacker News data can be downloaded  directly from this [link](https://www.kaggle.com/hacker-news/hacker-news-posts).




In [1]:
from csv import reader

# Read in the data
opened_file =  open('hacker_news.csv', encoding="utf-8")

# Transform read_file into a list of lists
read_file = reader(opened_file)
hn = list(read_file)

#Extracting the first row and assinging it to variable `header`
header = hn[0]
hn = hn[1:]

# Exploring the length of the list, header info, and first few rows.
print("Length of list: ")
print(len(hn))
print('\n')
print("Column headers:")
print(header)
print('\n')
print("First 5 rows of data:")
print(hn[:5])

Length of list: 
20100


Column headers:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


First 5 rows of data:
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_

## Extracting Ask HN and Show HN from Posts
Since our analysis is based on counting the two types of Posts namely; Ask HN and Show HN to the exclusion of any other posts, we create lists below to contain and count these posts and any other

In [2]:
# creating three empty lists
ask_posts = []
show_posts = []
other_posts = []

# Iterating through rows of Hacker News 'hn' list to populate the empty lists
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))       

1744
1162
17194


Above excercise shows:
* Number of ask_posts = 1,744
* Number of show_posts = 1,162
* Other types of posts total = 17,194

First 3 rows of both `ask_posts` and `show_posts` are shown below 

In [3]:
print("First 3 rows of ask_posts list of lists:")
print(ask_posts[:3])
print('\n')
print("First 3 rows of show_posts list of lists:")
print(show_posts[:3])

First 3 rows of ask_posts list of lists:
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]


First 3 rows of show_posts list of lists:
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]


## Calculating Average Number of Comments for Ask HN and Show HN Posts


In [4]:
# calculating the average number of comments for the Ask HN Posts
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts) 
print(avg_ask_comments)


# calculating the average number of comments for the Show HN Posts
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts) 
print(avg_show_comments)

14.038417431192661
10.31669535283993


* Ask HN posts on average receive higher number of comments (average of 14 comments per post) than Show HN posts which receives a lesser number of 10 comments per post.
* The explanation for this could be because of the nature of the Hacker News model whereby there are prevalent of seekers of opportunities and solutions being sought to all kinds of isuues or matter. This is in contrast with less people or contributors that would have solution to issues raised.
* Another possible reason for lesser average number of comments for the Show HN posts could be that undesirable comments could have been filtered out by back-end operation by administrators.

## Finding amount of Ask Posts and comments created in each hour

As a result of the fact discovered above that, on average there are more Ask posts that Show posts the remainder of analysis focuses on Ask Posts to determine the following:
1. Calculating number of posts and comments created hourly per day
2. Calculating average number of comments ask posts received by hour created.

In [5]:
#import the datetime module and give it the alias dt

import datetime as dt

# creating an empty lists

result_list = []
for row in ask_posts:
    time_created = row[6]
    comments_num = int(row[4])
    result_list.append([time_created, comments_num])
    
print("First 5 rows of result_list list of lists:")
print(result_list[:5])        
    

# creating two empty dictionaries

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date_format = "%m/%d/%Y %H:%M"
    date_1_str = row[0]
    date_1_dt = dt.datetime.strptime(date_1_str, date_format)
    row[0] = date_1_dt
    
    comments_num = row[1]
    hour_dt = row[0]
    hour_str = hour_dt.strftime("%H")
    if hour_str not in counts_by_hour:
        counts_by_hour[hour_str] = 1
        comments_by_hour[hour_str] = comments_num
    else:
        counts_by_hour[hour_str] += 1
        comments_by_hour[hour_str] += comments_num

print('\n')
print("Amount of ask posts created per hour:")        
print(counts_by_hour)
print('\n')
print("Amount of Comments per hour:")
print(comments_by_hour)     

First 5 rows of result_list list of lists:
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


Amount of ask posts created per hour:
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Amount of Comments per hour:
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask HN Posts per hour

The `counts_by_hour` and `comments_by_hour` dictionaries above gives the number of posts and comments generated each hour of the day.
These are now to be used to calulate the average number of comments per post for post created during each hour of the day.

In [6]:
# creating three empty lists

avg_by_hour = []

for hour_str in counts_by_hour:
    avg_by_hour.append([hour_str, comments_by_hour[hour_str]/counts_by_hour[hour_str]])

print("Average of Comments posted for each hour:")
print(avg_by_hour)    

Average of Comments posted for each hour:
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sorting and printing values from a List of Lists
The list of lists created above now need to be sorted to format summarise
the conclussion in easy an easy format

In [7]:
# creating an empty lists needed for sorted and swapped

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print("Average of Comments posted for each hour - Swapped:")
print(swap_avg_by_hour)    

Average of Comments posted for each hour - Swapped:
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [8]:

#Using the sorted() function to sort list in descending order

sorted_swap = sorted(swap_avg_by_hour, reverse=True)


print("Top 5 Hours for Ask Posts Comments:")

sorted_swap = sorted_swap[:5]
for row in sorted_swap:
    each_avg = row[0]
       
    hour_format = "%H"
    hour_str = row[1]
    hour_obt = dt.datetime.strptime(hour_str, hour_format)
    row[1] = hour_obt
    
    hour_dt = row[1]
    hour_str = hour_dt.strftime("%H:%M")  
    template = "{hour:} {average:.2f} average comments per post".format(hour = hour_str, average = each_avg)
    print(template)


Top 5 Hours for Ask Posts Comments:
15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


## Conclussion:
On average  and at the top is the largest amount of comments are received for ASK HN Posts created at 15:00 hours time zone being Eastern Time in the US.
This is followed at the second place at 02:00 hours