## Mission 2: Exploring Hacker News Posts

Hacker News Data used for this mission is located at https://www.kaggle.com/hacker-news/hacker-news-posts.

* Goal of the project:

    1. Analyze the Hacker News Data.
    2. Demonstrate the use of strings.
    2. Demonstrate Object-oriented programming.
    3. Demonstrate date and time formatting.


* Requirements for the analysis:

    1. Count the posts beginning with "Ask HN" and "Show HN".
    2. Determine which of the posts have a higher count.
    3. Determine the highest number of posts for "Ash HN".
    4. Show the average result for the one with the highest posts by the hour.


 
* Column descriptions:


    id: The unique identifier from Hacker News for the post

    title: The title of the post

    url: The URL that the posts links to, if it the post has a URL
    
    num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    
    num_comments: The number of comments that were made on the post
    
    author: The username of the person who submitted the post
    
    created_at: The date and time at which the post was submitted
    


## First task is to print a sample set of the csv file.

In [1]:
#open csv file
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

#check sample data limited to 5 rows
for row in hn[:5]:
    print(row)


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']



## The second task is to remove header from the list.

Below is a function that accepts 2 parameters to read a csv file and a boolean value for the header. The goal is to print the list without the header and to print the first 5 rows. 


In [2]:
#method to read a csv file and a header switch
def open_dataset(file_name, header=False):
    opened_file = open(file_name)
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    
    if header:
        apps_data = data[0]
        return apps_data
    
    apps_data = data[1:]
    return apps_data

#execute method wit the csv file and don't print header
hn = open_dataset('hacker_news.csv', False)

#print sample date of the result set
for row in hn[:5]:
    print(row)
    

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## Task is to extract "Ask HN" and "Show HN Posts" from the list generated without the header in the previous task

The goal is to create 3 lists to display posts filtered by "Ask HN", "Show HN" and other posts. The "hn" list will be used for this task.

In [3]:
#initialize arrays
asks_posts = []
show_posts = []
other_posts = []

#loop through the hn list
for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith("ask hn"):
        asks_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)     
    else:
        other_posts.append(row)

#print length for each list
print(len(asks_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


## The task is to calculate the average number of comments for the "Ask HN" and "Show HN" posts created in the previous task.

The average counts will be displayed after going through the function that performs the iteration through the lists.

In [4]:
#method to get average for the comment list
def get_average_count(comment_list = []):
    total_comment = 0
    avg_comment = 0
    
    for row in comment_list:
        comment_count = int(row[4])
        total_comment = total_comment + comment_count
    
    avg_comment = total_comment / len(comment_list)

    return round(avg_comment,2)

#get average count for "Ask HN"
avg_ask_comments = get_average_count(asks_posts)
print("Average Ask Comments")
print(avg_ask_comments)
print("\n")

#get average count for "Show HN"
avg_show_comments = get_average_count(show_posts)
print("Average Show Comments")
print(avg_show_comments)


        

Average Ask Comments
14.04


Average Show Comments
10.32


#### The average results above shows that the average for "Ask HN" comments is higher than "Show HN".

## The following task is to check if the "Ask HN" attract comments at certain times.

In [5]:
#import datetime
from datetime import datetime
import datetime as df

results_list = []
comment_count = 0
temp_list = []

#loop through the "Ask HN" post list
#extract and save to a new list the create time, and comment count
for row in asks_posts:
    created_at = row[6]
    comment_count = (row[3])
    results_list.append([created_at, int(comment_count)])

counts_by_hour={}
comments_by_hour={}

#print sample data of the results_list
print("Sample Data from the results list")
for row in results_list[:5]:
    print(row)

#initialize date and time formats
date_format = "%m/%d/%Y %H:%M"
time_format = "%H"
#count comments by the hour
for row in results_list:
    comment_dt = row[0]
    comment_dt = datetime.strptime(comment_dt, date_format)
    comment_hr = datetime.strftime(comment_dt, time_format)

    if comment_hr not in counts_by_hour:
        counts_by_hour[comment_hr] = 1
        comments_by_hour[comment_hr] = row[1]
    else:
        counts_by_hour[comment_hr] += 1
        comments_by_hour[comment_hr] += row[1]

#display counts by the hour
print("\n")
print("Count by hour dictionary")
for key, value in counts_by_hour.items():
    print(key, ' : ', value)


#display comments posted by the hour
print("\n")
print("Comments by hour dictionary")
for key, value in comments_by_hour.items():
    print(key, ' : ', value)


Sample Data from the results list
['8/16/2016 9:55', 2]
['11/22/2015 13:43', 28]
['5/2/2016 10:14', 1]
['8/2/2016 14:20', 1]
['10/15/2015 16:38', 28]


Count by hour dictionary
08  :  48
13  :  85
03  :  54
02  :  58
16  :  108
21  :  109
01  :  60
14  :  107
11  :  58
05  :  46
22  :  71
12  :  73
07  :  34
15  :  116
20  :  80
06  :  44
00  :  55
19  :  110
17  :  100
04  :  47
18  :  109
10  :  59
09  :  45
23  :  68


Comments by hour dictionary
08  :  515
13  :  2062
03  :  374
02  :  793
16  :  2522
21  :  1721
01  :  700
14  :  1282
11  :  825
05  :  552
22  :  511
12  :  782
07  :  361
15  :  3479
20  :  1151
06  :  591
00  :  451
19  :  1513
17  :  1941
04  :  389
18  :  1741
10  :  1102
09  :  329
23  :  581


## The task is to calculate the average number of comments posted by the hour.

In [6]:
#initialize variables
avg_by_hour = 0
avg_comments_by_hour = []

#get avg for comments posted by the hour
for chr in comments_by_hour:
    avg_hr = comments_by_hour[chr] / counts_by_hour[chr]
    avg_comments_by_hour.append([chr, avg_hr])

#display avg count by the hour
print("\nAverage comment count by hour\n")   
for row in avg_comments_by_hour:
    print(row)


Average comment count by hour

['08', 10.729166666666666]
['13', 24.258823529411764]
['03', 6.925925925925926]
['02', 13.672413793103448]
['16', 23.35185185185185]
['21', 15.788990825688073]
['01', 11.666666666666666]
['14', 11.981308411214954]
['11', 14.224137931034482]
['05', 12.0]
['22', 7.197183098591549]
['12', 10.712328767123287]
['07', 10.617647058823529]
['15', 29.99137931034483]
['20', 14.3875]
['06', 13.431818181818182]
['00', 8.2]
['19', 13.754545454545454]
['17', 19.41]
['04', 8.27659574468085]
['18', 15.972477064220184]
['10', 18.677966101694917]
['09', 7.311111111111111]
['23', 8.544117647058824]


## The task is to sort and display the list for the average count of comments by the hour.

In [7]:
#import datetime
from datetime import datetime
import datetime as df

#initialize variables
swap_avg_by_hour = []
sorted_swap = []
 
#populate avg list in reverse
for key, value in avg_comments_by_hour:
    swap_avg_by_hour.append([value, key])

#display avg list
print('\n***Results for swap_avg_by_hour****\n')
for row in swap_avg_by_hour:
    print(row)

#sort reverse list in descending order - reverse = True
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

#display results of sorted list
print('\n***Results for sorted_swap****\n')
for row in sorted_swap:
    print(row)
    

#display top 5 rows with a specific format
time_format1 = "%H"
time_format2 = "%H:%M"
print("\nTop 5 Hours for Asks Posts Comments\n") 
for avg, hr in sorted_swap[:5]:
    hr_format = datetime.strptime(hr, time_format1)
    hr_format = datetime.strftime(hr_format, time_format2)
    print(hr_format, ' : ', '{:.2f} average comments per post'.format(avg))



***Results for swap_avg_by_hour****

[10.729166666666666, '08']
[24.258823529411764, '13']
[6.925925925925926, '03']
[13.672413793103448, '02']
[23.35185185185185, '16']
[15.788990825688073, '21']
[11.666666666666666, '01']
[11.981308411214954, '14']
[14.224137931034482, '11']
[12.0, '05']
[7.197183098591549, '22']
[10.712328767123287, '12']
[10.617647058823529, '07']
[29.99137931034483, '15']
[14.3875, '20']
[13.431818181818182, '06']
[8.2, '00']
[13.754545454545454, '19']
[19.41, '17']
[8.27659574468085, '04']
[15.972477064220184, '18']
[18.677966101694917, '10']
[7.311111111111111, '09']
[8.544117647058824, '23']

***Results for sorted_swap****

[29.99137931034483, '15']
[24.258823529411764, '13']
[23.35185185185185, '16']
[19.41, '17']
[18.677966101694917, '10']
[15.972477064220184, '18']
[15.788990825688073, '21']
[14.3875, '20']
[14.224137931034482, '11']
[13.754545454545454, '19']
[13.672413793103448, '02']
[13.431818181818182, '06']
[12.0, '05']
[11.981308411214954, '14']
[11.

## Findings:

Based on the analysis of the data for the "Ask HN" posts, the average with the highest count is at "15:00" hours following a 24-hour clock. Following a 12-hour clock it is at 3:00 pm. Per the data description the time zone used is Eastern Time in the US.

The right time to post in the Hacker News to receive comments following the Mountain Time is at 1:00 pm or at 13:00 hrs.

