# Exploring Hacker News Posts

### Hacker new is a site started by the startup incubated [Y Cominbator](https://www.ycombinator.com/), where user-submitted stories, known as "posts" receive votes and comments. This is popular in technology and startup cirecles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

The dataset can be found [here in this link](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), but it has been trimmed down to about 20,000 rows by removing all submissions that did not get any comments followed by randomly smapling from the remaining sumbmissions. 

---
We start by reading the ```hacker_news.csv``` file in a list of lists.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The first list in the inner lists contatins headers. In order to analyze our dta, we need to remove this row.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


If we wish to control for upper and lower case, we can create methods beginning with ```Ask HN```and ```Show HN``` (and case variations) into two different lists.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


---
Next, lets see if ask posts or show posts receive more comments on average.

In [4]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


---
We determined that ask posts receive more comments than show posts. Since most of the focus is on ask posts, we will focus our remaining analysis here.

Now, we will see if ask posts created at a certain *time* are more likely to attract comments. We'll use the following steps to perform this analysis:
1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = 1
        counts_by_hour[time] = 1

1. Import the ```datetime``` module as ```dt```.
2. Create an empty list, and assign it to ```result_list```. This will be a list of lists.
3. Iterate over ```ask_posts```, and append to ```result_list``` a list with two elements:
    - The first element should be the column ```created_at```.
        - Because the ```created_at``` column is the seventh column in ```ask_posts```, you'll need to get the element at index ```6``` in each row.
    - The second element should be the number of comments of the post.
        - You'll also need to convert the value to an integer.
4. Create two empty dictionaries called ```counts_by_hour``` and ```comments_by_hour```.
5. Loop through each row of ```result_list```.
6. Extract the hour from the date, which is the first element of the row.
7. Use the ```datetime.strptime()``` method to parse the date and create a datetime object.
8. Use the string we want to parse as the first argument and a string that specifies the format as the second argument.
    - Use the ```datetime.strftime()``` method to select just the hour from the datetime object.
    - **If the hour** isn't a key in ```counts_by_hour```:
        - Create the key in ```counts_by_hour```, and set it equal to ```1```.
        - Create the key in ```comments_by_hour```, and set it equal to the ```comment``` number.
    - **If the hour** is already a key in counts_by_hour:
        - Increment the value in ```counts_by_hour``` by ```1```.
        - Increment the value in ```comments_by_hour``` by the ```comment``` number.

---
Next, we create a list of lists containing the hours during which posts were created and the average number of comments those posts received. 

1. Use the example above to calculate the average number of comments per post for posts created during each hour of the day.
2. The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post. Assign the result to a variable named ```avg_by_hour```. Display the results.

In [6]:
avg_per_hour = []

for hr in comments_by_hour:
    avg_per_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_per_hour

[['09', 5.466666666666667],
 ['13', 14.411764705882353],
 ['10', 13.440677966101696],
 ['14', 13.214953271028037],
 ['16', 16.64814814814815],
 ['23', 7.985294117647059],
 ['12', 9.36986301369863],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 15.98165137614679],
 ['20', 21.5125],
 ['02', 23.775862068965516],
 ['18', 13.192660550458715],
 ['03', 7.796296296296297],
 ['05', 9.478260869565217],
 ['19', 10.781818181818181],
 ['01', 10.85],
 ['22', 6.732394366197183],
 ['08', 10.166666666666666],
 ['04', 7.127659574468085],
 ['00', 7.963636363636364],
 ['06', 9.022727272727273],
 ['07', 7.823529411764706],
 ['11', 11.03448275862069]]

---
Even though we have the results that we need, the format makes it hard to identify the hours with the highest values. We can finish by sorting the list of lists and printing the five hightes values in a format that is easier to read.

1. Create a list that equals ```avg_by_hour``` with swapped columns.
    - Create an empty list and assign it to ```swap_avg_by_hour```.
    - Iterate over the rows of ```avg_by_hour```, and append to ```swap_avg_by_hour``` a list whose first element is the second element of the row, and whose second element is the first element of the row.
2. Print ```swap_avg_by_hour```.
3. Use the ```sorted()``` function to sort ```swap_avg_by_hour``` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
    - Set the ```reverse``` argument to ```True```, so that the highest value in the first column appears first in the list.
    - Assign the result to ```sorted_swap```.
4. Print the string "Top 5 Hours for Ask Posts Comments".
5. Loop through each average and each hour (in this order) in the first five lists of ```sorted_swap```.
6. Use the ```str.format()``` method to print the hour and average in the following format: ```15:00: 38.59 average comments per post```.
    - To format the hours, use the ```datetime.strptime()``` constructor to return a datetime object, and then use the ```strftime()``` method to specify the format of the time.
    - To format the average, you can use ```{:.2f}``` to indicate only two decimal places.

In [12]:
swap_avg_by_hour = []
for row in avg_per_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[5.466666666666667, '09'], [14.411764705882353, '13'], [13.440677966101696, '10'], [13.214953271028037, '14'], [16.64814814814815, '16'], [7.985294117647059, '23'], [9.36986301369863, '12'], [11.46, '17'], [38.5948275862069, '15'], [15.98165137614679, '21'], [21.5125, '20'], [23.775862068965516, '02'], [13.192660550458715, '18'], [7.796296296296297, '03'], [9.478260869565217, '05'], [10.781818181818181, '19'], [10.85, '01'], [6.732394366197183, '22'], [10.166666666666666, '08'], [7.127659574468085, '04'], [7.963636363636364, '00'], [9.022727272727273, '06'], [7.823529411764706, '07'], [11.03448275862069, '11']]


[[38.5948275862069, '15'],
 [23.775862068965516, '02'],
 [21.5125, '20'],
 [16.64814814814815, '16'],
 [15.98165137614679, '21'],
 [14.411764705882353, '13'],
 [13.440677966101696, '10'],
 [13.214953271028037, '14'],
 [13.192660550458715, '18'],
 [11.46, '17'],
 [11.03448275862069, '11'],
 [10.85, '01'],
 [10.781818181818181, '19'],
 [10.166666666666666, '08'],
 [9.478260869565217, '05'],
 [9.36986301369863, '12'],
 [9.022727272727273, '06'],
 [7.985294117647059, '23'],
 [7.963636363636364, '00'],
 [7.823529411764706, '07'],
 [7.796296296296297, '03'],
 [7.127659574468085, '04'],
 [6.732394366197183, '22'],
 [5.466666666666667, '09']]

In [13]:
# Sort the values and print the top 5 hours with the highest average comments.

print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.78 average comments per post
20:00: 21.51 average comments per post
16:00: 16.65 average comments per post
21:00: 15.98 average comments per post


---
Based of the analysts, it would appear that posting at 3:00pm EST would create the highest change of reciving comments under Ask Posts in Hacker News.