# About The Project

In this project, we will be working on with a [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) of of submissions to popular technology site **Hacker News**. Below are the descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if it the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN` which refers, respectively, to posts regarding asking specific questions or showing a projec/product or just something interesting. We will compare these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?


# Cleaning the Data Set

As suggested, the data set will be first handled in order to remove any post that does not contain any comments. It was then written the script [`data_cleaning.py`](data_cleaning.py) and now we will be working with the data set `hacker_news_cleaned.csv`. This is how the data set we will be working on looks like:

In [1]:
from csv import reader

opened_file = open('hacker_news_cleaned.csv')
read_file = reader(opened_file)
hn = list(read_file)
print (hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']]


The headers will be then eliminated and extracted to the variable `headers`

In [2]:
headers = hn[0]
hn_no_headers = hn[1:]
print (headers, end=' ')
print (hn_no_headers[:5], end = '\n')


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']]


# Handling the Data Set

## Separating `Ask_HN` and `Show HN` posts

The header was then removed successfully and now we will separate posts beginning with `Ask HN` and `Show HN`.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_no_headers:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print (len(ask_posts))
print (len(show_posts))
print (len(other_posts))

6911
5059
68431


Now we have three separate lists with the respective length: `ask_posts`: 6911, `show_posts`: 5059, `other_posts`:68431


## Average comments per type of post
   

In [4]:
total_ask_commments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_commments += num_comments

avg_ask_comments = round(total_ask_commments/len(ask_posts))
print ('The average ask posts comments are:', avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = round(total_show_comments/len(show_posts))
print ('The average show posts comments are:', avg_show_comments)



The average ask posts comments are: 14
The average show posts comments are: 10


Calculating the average number of comments for each ask post and show post, we can conclude that **ask posts have more comments in average.**

## Analysing the impact of posts creation time

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.


In [10]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

# print (result_list[400])    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    datetime_object = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour_string = datetime_object.strftime("%H")
    if hour_string not in counts_by_hour:
        counts_by_hour[hour_string] = 1
        comments_by_hour[hour_string] = row[1]
    else:
        counts_by_hour[hour_string] += 1
        comments_by_hour[hour_string] += row[1]

print(counts_by_hour)
print(comments_by_hour)


{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


Now, we will calculate the average number of comments per post for posts created during each hour of the day.

In [26]:
avg_by_hour = []

for hour in comments_by_hour:
    avg = round(comments_by_hour[hour]/counts_by_hour[hour])
    avg_by_hour.append([hour,avg])
#     print (avg)

print (avg_by_hour)

[['02', 13], ['01', 9], ['22', 12], ['21', 11], ['19', 9], ['17', 14], ['15', 40], ['14', 13], ['13', 22], ['11', 11], ['10', 14], ['09', 8], ['07', 10], ['03', 10], ['16', 11], ['08', 12], ['00', 10], ['23', 8], ['20', 11], ['18', 11], ['12', 15], ['04', 13], ['06', 9], ['05', 11]]
