## Guided Project: Exploring Hacker News Posts
## *By Naftali Indongo*

### 1. Introduction

This guided project brings the following skills together for some real-world practice:

- How to work with strings
- Object-oriented programming
- Dates and times

In this project, we'll work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com).


 
![alt text][logo]

[logo]: https://s3.amazonaws.com/dq-content/354/hacker_news.jpg "Logo Title Text 2"

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com), where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- <font color='red'>id</font>: the unique identifier from Hacker News for the post
- <font color='red'>title</font>: the title of the post
-<font color='red'>url</font>: the URL that the posts links to, if the post has a URL
- <font color='red'>num_points</font>: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- <font color='red'>num_comments</font>: the number of comments on the post
- <font color='red'>author</font>: the username of the person who submitted the post
- <font color='red'>created_at</font>: the date and time of the post's submission


We're specifically interested in posts with titles that begin with either <font color='red'>Ask HN</font> or <font color='red'>Show HN</font>. Users submit <font color='red'>Ask HN</font> posts to ask the Hacker News community a specific question. Below are a few examples:

> Ask HN: How to improve my personal website?

> Ask HN: Am I the only one outraged by Twitter shutting down share counts?

>Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit <font color='red'>Show HN</font> posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

> Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'

> Show HN: Something pointless I made

> Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

- Do <font color='red'>Ask HN</font>  or <font color='red'>Show HN</font>  receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the dataset into a list of lists.

In [211]:
from csv import reader
open_file = open("hacker_news.csv")
read_file = reader(open_file)
hn = list(read_file)
hn[:5] #The first five rows

TypeError: 'list' object is not callable

In [None]:
len(hn) # length of our data set

### 2. Removing Headers from a List of lists

Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [None]:
headers = hn[0] # The header row
hn = hn[1:]
print(headers)

In [None]:
print(hn[:5]) # The first five rows excluding the header

In [None]:
len(hn) # New length excluding the header row

### 3. Extracting Ask HN and Show HN Posts.

Now that we've removed the headers from <font color='red'>hn</font>, we're ready to filter our data. Since we're only concerned with post titles beginning with <font color='red'>Ask HN</font> or <font color='red'>Show HN</font>, we'll create new lists of lists containing just the data for those titles.


In [None]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Our dataset consists of {} posts beginning with Ask HN, {} posts beginning with Show HN and {} other posts.'.format(len(ask_posts), len(show_posts), len(other_posts)))

### 4. Calculating the Average Number of Comments for Ask HN and Show HN Posts

On the previous screen, we separated the "ask posts" and the "show posts" into two lists of lists named <font color='red'>ask_posts</font> and <font color='red'>show_posts </font>. Below are the first five rows in the ask_posts list of lists:

In [None]:
ask_posts[:3]

Below are the first five rows in the ask_posts list of lists:

In [None]:
show_posts[:3]

Next, let's determine if ask posts or show posts receive more comments on average.

In [240]:
#1) for ask comments
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments+= num_comments
avg_ask_comments = round(total_ask_comments/len(ask_posts))


print(" ")

#2) for show coments
total_show_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_show_comments+= num_comments
avg_show_comments = round(total_show_comments/len(show_posts))
print("The average number of comments on ask posts is {}, while the average number of comments on show posts is {}.".format(avg_ask_comments, avg_show_comments))

 
The average number of comments on ask posts is 14, while the average number of comments on show posts is 21.


On average show posts recieve more comments, because users tent to give more views or inputs on shown posts. Mostly comments critisizing the given posts or asking further questions.

### 5. Find the Number of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In this section, we will use the <font color='red'>datetime </font> <font color='green'>module </font>module to work with the data in the created_at column, to calculate the number of ask posts created per hour, along with the total number of comments..

In [212]:
import datetime as dt
result_list = []
for row in ask_posts:
    list = [row[6], int(row[4])]
    result_list.append(list)

count_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    comment = row[1]
    
    if hour not in count_by_hour:
        count_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        count_by_hour[hour] += 1
        comments_by_hour[hour] += comment


In [213]:
print(count_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [214]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### 6. Calculating the Average Number of Comments for Asks HN Posts by Hour

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.



In [215]:
avg_by_hour = []
for hour in count_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/count_by_hour[hour]])
print(avg_by_hour    )

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


### 7. Sorting and Printing Values from a List of Lists

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [216]:
swap_avg_by_hour = []
for row in avg_by_hour:
    a_list = [row[1], row[0]]
    swap_avg_by_hour.append(a_list)
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [217]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
Top_5=sorted_swap[:5]
print(Top_5)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


In [239]:
for row in Top_5:
    avg = row[0]
    hour = row[1]
    dt_obj = dt.datetime.strptime(hour, "%H")
    time_hour =dt_obj.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(time_hour,avg))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


To have a high chance of recieving comments the post must be created at  around 15:00.

### 8. Conclusion

In conclusion, on average show posts are more  likely to recieve comments then ask posts as since the average number of comments on ask posts is 14, while the average number of comments on show posts is 21. Furtheremore, to have a high chance of recieving comments the post must be created at around 15:00.



In [None]:
On average show posts recieve more comments,