# Exploring Hacker News Project

### Finding the amount of Ask Posts and Comments by hour created

##### On average, ask posts receive more comments than show posts. 

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [1]:
from csv import reader

import datetime as dt

opened_file = open("hacker_news.csv", encoding="utf8")

read_file = reader(opened_file)

hn = list(read_file)

#headers = hn[0]          ## This only strips out the header row from the dataset
#print(headers)
print("=========================================================================================")
print("\n")

hn = hn[1:]                ## Now, here, we have all rows except the top header row

for row in hn[:6]:
    print(row)
    print("\n")



['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more',

In [2]:
ask_posts = []

show_posts = []

other_posts = []

for row in hn:
    title = row[1]
    
    lowercase_title = title.lower()
    #if lowercase version on title starts with 'ask hn' then append it to ask_posts  
    if lowercase_title.startswith("ask hn"):
        ask_posts.append(row)
    
    #if lowercase version on title starts with 'show hn' then append it to show_posts  
    elif lowercase_title.startswith("show hn"):
        show_posts.append(row)
        
    else:
        other_posts.append(row)
        
print("No. of Ask Posts are: " + str(len(ask_posts)) + "\n")

print("No. of Show Posts are: " + str(len(show_posts))+ "\n")

print("No. of Other Posts are: " + str(len(other_posts)) + "\n")

No. of Ask Posts are: 1744

No. of Show Posts are: 1162

No. of Other Posts are: 17194



In [3]:
result_list = []
dummy_list = []

for row in ask_posts:
    created_dt =  row[6]
    num_comments = int(row[4]) 
    
    result_list.append([created_dt, num_comments])


In [4]:
print(result_list[:10])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17], ['9/26/2015 23:23', 1], ['4/22/2016 12:24', 4], ['11/16/2015 9:22', 1], ['2/24/2016 17:57', 1], ['6/4/2016 17:17', 2]]


In [5]:
counts_by_hour  = {}

comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"
    
for row in result_list:
    date_str = row[0]
    no_of_comment = row[1]
    
    date_time_obj = dt.datetime.strptime(date_str, date_format)
    time_extracted = date_time_obj.strftime("%H")
    
    if time_extracted not in counts_by_hour:
        counts_by_hour[time_extracted] = 1
        comments_by_hour [time_extracted] = no_of_comment
        
    else:
        counts_by_hour[time_extracted] += 1
        comments_by_hour [time_extracted] += no_of_comment
    
print(comments_by_hour)
print("\n")
print(counts_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


#### Above we created 2 dictionaries:

* counts_by_hour: contains the number of ask posts created during each hour of the day.
* comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

One sample way to add 15 to each value in a dictionary.

for e.g.

we have a dictionary as below:

stud_dict = {'16-20':24, '21-30':34, '30-35':45}



>now we want to add 15 to each value in the dictionary

In [19]:
stud_dict = {"16-20":24, "21-30":34, "30-35":45}
stud_mod_dict = []

for item in stud_dict:
       stud_mod_dict.append([item, stud_dict[item] + 15])

print(stud_mod_dict)

##converting the list into a dictionary
print("\n")

print(dict(stud_mod_dict))


[['16-20', 39], ['21-30', 49], ['30-35', 60]]


{'16-20': 39, '21-30': 49, '30-35': 60}


##### Now let's calculate the average number of comments per post for posts created during each hour of the day.

In [24]:
print("Number of Comments by hour----" + "\n") 
comments_by_hour

Number of Comments by hour----



{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [25]:
print("Number of Ask Posts by hour----" + "\n") 
counts_by_hour

Number of Ask Posts by hour----



{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [27]:
avg_no_of_comments_per_hour = []

for each_row in comments_by_hour:
    avg_no_of_comments_per_hour.append([each_row, comments_by_hour[each_row] / counts_by_hour[each_row]])

In [29]:
print(avg_no_of_comments_per_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. 


Let's finish by 
>sorting the list of lists and printing the five highest values in a format that's easier to read.

In [41]:
swap_avg_by_hour = []
for each_row in avg_no_of_comments_per_hour:
    swap_avg_by_hour.append([each_row[1], each_row[0]])
    
print(swap_avg_by_hour)
print("\n")
#Sorting the swapped list
print("Printing the Sorted Swapped List \n")
sorted_swap_list = sorted(swap_avg_by_hour, reverse=True)
sorted_swap_list

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Printing the Sorted Swapped List 



[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

Top 5 Hours for Ask Posts Comments


In [58]:
print("Top 5 Hours for Ask Posts Comments are as below:  \n")
for avg, hr in sorted_swap_list[:5]:
    template = "{}:  {:.2f} average comments per post"
    
    date_time_obj = dt.datetime.strptime(hr, "%H")
    
    time_hr_format = date_time_obj.strftime("%H:%M")
    #print(date_time_obj)
    #print(time_hr_format)

    print(template.format(hr, avg))

Top 5 Hours for Ask Posts Comments are as below:  

15:  38.59 average comments per post
02:  23.81 average comments per post
20:  21.52 average comments per post
16:  16.80 average comments per post
21:  16.01 average comments per post


#### Points to Analyse above 5 data rows:

>Which hours should you create a post during to have a higher chance of receiving comments? 

>Refer back to the documentation for the data set to convert the times to the time zone you live in. Write a markdown cell explaining your findings.

**Anaylsis point 1:**
    > According to the below dataset i.e.
    
    > Top 5 Hours for Ask Posts Comments are as below:  

- 15:  38.59 average comments per post
- 02:  23.81 average comments per post
- 20:  21.52 average comments per post
- 16:  16.80 average comments per post
- 21:  16.01 average comments per post

The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

Here's a quick summary of what we accomplished in this guided project:

- We set a goal for the project.
- We collected and sorted the data.
- We reformatted and cleaned the data to prepare it for analysis.
- We analyzed the data.


Next steps for us to consider:

- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.
- Use Dataquest's data science project style guide to format your project.