# Explore posts form Hacker News dataset
**This analysis aims to check if:**
* Do Ask HN posts or Show HN posts, receive more comments on average?
* Do posts created at a certain time receive more comments on average?

This data set is Hacker News posts from the last 12 months (up to September 26 2016).

It includes the following columns:

* title: title of the post (self explanatory)

* url: the url of the item being linked to

* num_points: the number of upvotes the post received

* num_comments: the number of comments the post received

* author: the name of the account that made the post

* created_at: the date and time the post was made (the time zone is Eastern Time in the US

## Import needed libraries

In [1]:
from csv import reader
import datetime as dt

## Create utility functions

`extract_data` funtion takes a file path as an argument and returns csv file converted to python list

In [2]:
def extract_data(file_path):
    open_dataset = open(file_path, encoding='utf8')
    read_dataset = reader(open_dataset)
    return list(read_dataset)

`display_list_data` funtion is used to display list data in user friendly fromat, it allows also to print additional informations like length of rows and length of columns

In [3]:
def display_list_data(dataset, start, end, rows_and_columns):
    sliced_data = dataset[start:end]
    for row in sliced_data:
        print(row)
    print('\n')
    if rows_and_columns:
        print('Number of rows: {:,}'.format(len(dataset)))
        print('Number of columns:', len(dataset[0]))    

`display_dictionary_data` is used to display dictionary data in a more user-friendly format, additionally it allows also to print max and min values

In [4]:
def display_dictionary_data(dictionary, topic, max_and_min_comments=False):
    reversed_dataset = list(zip(dictionary.values(), dictionary.keys()))
    sorted_dict = sorted(zip(dictionary.keys(), dictionary.values()))
    for row in sorted_dict:
        print('At {} there was {} {} added'.format(row[0],row[1],topic))
    print('\n')    
    if max_and_min_comments:
        max_value, max_key = max(reversed_dataset)
        min_value, min_key = min(reversed_dataset)
        print("At {} o'clock we noticed a maximum number of {} which was in total {}".format(max_key, topic, max_value))
        print("At {} o'clock we noticed the smallest number of {} which was in total {}".format(min_key, topic, min_value))

## Import and clean data set

We use `extract_data` function to convert csv file into python list<br/>
We use `display_list_data` function to display first five rows of created list with hacer news posts

In [5]:
hacker_news_posts = extract_data('HN_posts_year_to_Sep_26_2016.csv')
display_list_data(hacker_news_posts, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Number of rows: 293,120
Number of columns: 7


### Separate haders from the rest of the data

In [6]:
headers = hacker_news_posts[:1]
hn = hacker_news_posts[1:]

### Clean the data

In this step we will remove posts without comments

In [7]:
print(headers)
hn[1]
cleaned_hn = []
for row in hn:
    num_comments = int(row[4])
    if num_comments > 0:
        cleaned_hn.append(row)
        
print(len(hn))
print(len(cleaned_hn))
print(len(hn) - len(cleaned_hn))

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
293119
80401
212718


**Conclusion:**<br/>
After removing posts without comments the amount of rows in the dataset is significantly different, it is reduced by 212718 entries

## Extract only the posts that begin with either Ask HN or Show HN
We are only concerned with post titles beginning with `Ask HN` or `Show HN`. Thus, we will create new lists containing just the data for those titles and store them in corresponding variables.

In [8]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(show_posts))  
print(len(other_posts))

10158
273822


Explore `ask_posts` dataset

In [9]:
display_list_data(ask_posts,0, 3, True)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


Number of rows: 9,139
Number of columns: 7


Explore `show_posts` dataset

In [10]:
display_list_data(show_posts,0, 3, True)

['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']
['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']
['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']


Number of rows: 10,158
Number of columns: 7


Explore `other_posts` dataset

In [11]:
display_list_data(other_posts,0, 3, True)

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


Number of rows: 273,822
Number of columns: 7


## Determine if ask posts or show posts receive more comments on average

In [12]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
print(total_ask_comments)
avg_ask_comments = round(total_ask_comments / len(ask_posts), 2)
print(avg_ask_comments)

94986
10.39


In [13]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_posts = round(total_show_comments / len(show_posts), 2)
print(avg_show_posts)

4.89


In [14]:
total_other_comments = 0
for row in show_posts:
    total_other_comments += int(row[4])
avg_other_posts = round(total_other_comments / len(other_posts), 2)
print(avg_other_posts)

0.18


**Conclusion**<br/>
Base on our analysis we can say that ask posts received the bigest amount of comments, the results are follwing:
1. Ask posts received 10.39 comments on avarage
2. Show posts received 4.89 comments on avarage
3. Ask posts received 0.18 comments on avarage

## Determinate number of posts and comments created in each hour

Calculate the number of ask posts created in each hour of the day, along with the number of comments received.

In [15]:
restult_list_ask_posts = []
for row in ask_posts:
    restult_list_ask_posts.append([row[6],int(row[4])])
    
counts_by_hour_ask = {}
comments_by_hour_ask = {}

for row in restult_list_ask_posts:
    dt_obj = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dt_obj.strftime("%H")
    if hour in counts_by_hour_ask:
        counts_by_hour_ask[hour] += 1
        comments_by_hour_ask[hour] += row[1]
    else:
        counts_by_hour_ask[hour] = 1
        comments_by_hour_ask[hour] = row[1]

Explore comments added for each hour<br/>
**comments_by_hour:** contains the corresponding number of comments ask posts created at each hour

In [16]:
display_dictionary_data(comments_by_hour_ask, 'comments', True)

At 00 there was 2277 comments added
At 01 there was 2089 comments added
At 02 there was 2996 comments added
At 03 there was 2154 comments added
At 04 there was 2360 comments added
At 05 there was 1838 comments added
At 06 there was 1587 comments added
At 07 there was 1585 comments added
At 08 there was 2362 comments added
At 09 there was 1477 comments added
At 10 there was 3013 comments added
At 11 there was 2797 comments added
At 12 there was 4234 comments added
At 13 there was 7245 comments added
At 14 there was 4972 comments added
At 15 there was 18525 comments added
At 16 there was 4466 comments added
At 17 there was 5547 comments added
At 18 there was 4877 comments added
At 19 there was 3954 comments added
At 20 there was 4462 comments added
At 21 there was 4500 comments added
At 22 there was 3372 comments added
At 23 there was 2297 comments added


At 15 o'clock we noticed a maximum number of comments which was in total 18525
At 09 o'clock we noticed the smallest number of commen

**Conclusions**<br/>
It can be seen from the above analysis that users are more likely to comment ask posts between 8 am and 22. Since 22  number of added comments starts to decline consequentially. The least user activity in the comment is noticeable from 3 am to 5 am.The peak was at 15 and the bottom was at 9 am.

Explore posts added for each hour<br/>
**counts_by_hour:** contains the number of ask posts created during each hour of the day.

In [17]:
display_dictionary_data(counts_by_hour_ask, 'posts', True)

At 00 there was 301 posts added
At 01 there was 282 posts added
At 02 there was 269 posts added
At 03 there was 271 posts added
At 04 there was 243 posts added
At 05 there was 209 posts added
At 06 there was 234 posts added
At 07 there was 226 posts added
At 08 there was 257 posts added
At 09 there was 222 posts added
At 10 there was 282 posts added
At 11 there was 312 posts added
At 12 there was 342 posts added
At 13 there was 444 posts added
At 14 there was 513 posts added
At 15 there was 646 posts added
At 16 there was 579 posts added
At 17 there was 587 posts added
At 18 there was 614 posts added
At 19 there was 552 posts added
At 20 there was 510 posts added
At 21 there was 518 posts added
At 22 there was 383 posts added
At 23 there was 343 posts added


At 15 o'clock we noticed a maximum number of posts which was in total 646
At 05 o'clock we noticed the smallest number of posts which was in total 209


**Conclusions**<br/>
It can be seen from the above analysis that number of added ask posts fluctates through the whole day.
The peak was at 15 and the bottom was at 5 am.

Calculate the number of show posts created in each hour of the day, along with the number of comments received.

In [18]:
restult_list_show_posts = []
for row in show_posts:
    restult_list_show_posts.append([row[6],int(row[4])])
    
counts_by_hour_show = {}
comments_by_hour_show = {}

for row in restult_list_show_posts:
    dt_obj = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dt_obj.strftime("%H")
    if hour in counts_by_hour_show:
        counts_by_hour_show[hour] += 1
        comments_by_hour_show[hour] += row[1]
    else:
        counts_by_hour_show[hour] = 1
        comments_by_hour_show[hour] = row[1]

**Conclusions**<br/>
It can be seen from the above analysis that number of added ask posts fluctates through the whole day.
The peak was at 15 and the bottom was at 5 am.

Explore `counts_by_hour_show` dataset

In [19]:
display_dictionary_data(counts_by_hour_show, 'posts', True)

At 00 there was 276 posts added
At 01 there was 247 posts added
At 02 there was 209 posts added
At 03 there was 206 posts added
At 04 there was 194 posts added
At 05 there was 172 posts added
At 06 there was 192 posts added
At 07 there was 236 posts added
At 08 there was 316 posts added
At 09 there was 302 posts added
At 10 there was 323 posts added
At 11 there was 402 posts added
At 12 there was 516 posts added
At 13 there was 610 posts added
At 14 there was 696 posts added
At 15 there was 836 posts added
At 16 there was 801 posts added
At 17 there was 761 posts added
At 18 there was 656 posts added
At 19 there was 556 posts added
At 20 there was 525 posts added
At 21 there was 430 posts added
At 22 there was 377 posts added
At 23 there was 319 posts added


At 15 o'clock we noticed a maximum number of posts which was in total 836
At 05 o'clock we noticed the smallest number of posts which was in total 172


Explore `comments_by_hour_show` dataset

In [20]:
display_dictionary_data(comments_by_hour_show, 'comments', True)

At 00 there was 1283 comments added
At 01 there was 1006 comments added
At 02 there was 1076 comments added
At 03 there was 934 comments added
At 04 there was 978 comments added
At 05 there was 592 comments added
At 06 there was 904 comments added
At 07 there was 1577 comments added
At 08 there was 1771 comments added
At 09 there was 1411 comments added
At 10 there was 1228 comments added
At 11 there was 2413 comments added
At 12 there was 3609 comments added
At 13 there was 3314 comments added
At 14 there was 3839 comments added
At 15 there was 3824 comments added
At 16 there was 3769 comments added
At 17 there was 3236 comments added
At 18 there was 3242 comments added
At 19 there was 2791 comments added
At 20 there was 2183 comments added
At 21 there was 1759 comments added
At 22 there was 1450 comments added
At 23 there was 1444 comments added


At 14 o'clock we noticed a maximum number of comments which was in total 3839
At 05 o'clock we noticed the smallest number of comments whi

**Conclusions**<br/>
It can be seen from the above analysis that users are more likely to add comments between 11 am and 19. During the evening number of added comments starts to decline. The least user activity in the comment is noticeable from 3 am to 5 am.
The peak was at 14 and the bottom was at 5 am.

## Calculate the average number of comments per post for posts created during each hour of the day

In [21]:
# comments per post
# need to know
avg_by_hour_ask = []
for hour in comments_by_hour_ask:
    rounded_avg_ask = round(comments_by_hour_ask[hour] / counts_by_hour_ask[hour], 2)
    avg_by_hour_ask.append([hour, rounded_avg_ask])
print(avg_by_hour_ask)

[['02', 11.14], ['01', 7.41], ['22', 8.8], ['21', 8.69], ['19', 7.16], ['17', 9.45], ['15', 28.68], ['14', 9.69], ['13', 16.32], ['11', 8.96], ['10', 10.68], ['09', 6.65], ['07', 7.01], ['03', 7.95], ['23', 6.7], ['20', 8.75], ['16', 7.71], ['08', 9.19], ['00', 7.56], ['18', 7.94], ['12', 12.38], ['04', 9.71], ['06', 6.78], ['05', 8.79]]


In [22]:
display_list_data(avg_by_hour_ask, 0, len(avg_by_hour_ask), True)

['02', 11.14]
['01', 7.41]
['22', 8.8]
['21', 8.69]
['19', 7.16]
['17', 9.45]
['15', 28.68]
['14', 9.69]
['13', 16.32]
['11', 8.96]
['10', 10.68]
['09', 6.65]
['07', 7.01]
['03', 7.95]
['23', 6.7]
['20', 8.75]
['16', 7.71]
['08', 9.19]
['00', 7.56]
['18', 7.94]
['12', 12.38]
['04', 9.71]
['06', 6.78]
['05', 8.79]


Number of rows: 24
Number of columns: 2


## Display the five hours with the bigest amount of average comments per post

In [23]:
swap_avg_by_hour = []
for row in avg_by_hour_ask:
    swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

[[28.68, '15'], [16.32, '13'], [12.38, '12'], [11.14, '02'], [10.68, '10'], [9.71, '04'], [9.69, '14'], [9.45, '17'], [9.19, '08'], [8.96, '11'], [8.8, '22'], [8.79, '05'], [8.75, '20'], [8.69, '21'], [7.95, '03'], [7.94, '18'], [7.71, '16'], [7.56, '00'], [7.41, '01'], [7.16, '19'], [7.01, '07'], [6.78, '06'], [6.7, '23'], [6.65, '09']]


In [24]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    print('{}:00: {}'.format(row[1], row[0]))
    

Top 5 Hours for Ask Posts Comments
15:00: 28.68
13:00: 16.32
12:00: 12.38
02:00: 11.14
10:00: 10.68


### Convert time to CET time zone

In [25]:
ask_post_avg_by_hour_cet = []
for row in sorted_swap:
    hour = int(row[1])
    dt_hour = dt.datetime(2022, 1, 1, hour, 0)
    dt_cet = dt_hour + dt.timedelta(hours = 6)
    ask_post_avg_by_hour_cet.append([row[0],dt_cet.hour])

In [26]:
print("Top 5 Hours for Ask Posts Comments in CET")
for row in ask_post_avg_by_hour_cet[:5]:
    print('{}:00: {}'.format(row[1], row[0])) 

Top 5 Hours for Ask Posts Comments in CET
21:00: 28.68
19:00: 16.32
18:00: 12.38
8:00: 11.14
16:00: 10.68


# Determine if show and ask posts receive more points on average.