# DQ Step 1

## Hacker News Posts - Data Analysis

This is a guided project given by the Dataquest educational platform. This project will evaluate the frequency of viewing and commenting on the popular technology site, *Hacker News*. This analysis may be performed in order to understand general user habits and improve on services or marketable success.

In [1]:
# Opening the dataset for use in the project, then utilizing it as a list of lists.

import pandas as pd      # Importing pandas to view tables more clearly.
from csv import reader
opened_file = open('HN_data.csv', encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)

pd.DataFrame(hn[:5])

Unnamed: 0,0,1,2,3,4,5,6
0,id,title,url,num_points,num_comments,author,created_at
1,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
2,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
3,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
4,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16


# DQ Step 2

In [2]:
# Separating out the headers from the data.

headers = hn[0]

hn = hn[1:]

pd.DataFrame(hn)

Unnamed: 0,0,1,2,3,4,5,6
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14
...,...,...,...,...,...,...,...
293114,10176919,Ask HN: What is/are your favorite quote(s)?,,15,20,kumarski,9/6/2015 6:02
293115,10176917,Attention and awareness in stage magic: turnin...,http://people.cs.uchicago.edu/~luitien/nrn2473...,14,0,stakent,9/6/2015 6:01
293116,10176908,Dying vets fuck you letter (2013),http://dangerousminds.net/comments/dying_vets_...,10,2,mycodebreaks,9/6/2015 5:56
293117,10176907,"PHP 7 Coolest Features: Space Ships, Type Hint...",https://www.zend.com/en/resources/php-7,2,0,Garbage,9/6/2015 5:55


# DQ Step 3

In [3]:
# Filtering out the data to only retain posts that have 'Ask HN' or 'Show HN'
# while accommodating for capitalization differences.

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
    
print('The length of "ask posts" data is: {}'.format(len(ask_posts)))
print('The length of "show posts" data is: {}'.format(len(show_posts)))
print('The length of "other posts" data is: {}'.format(len(other_posts)))

The length of "ask posts" data is: 9139
The length of "show posts" data is: 10158
The length of "other posts" data is: 273822


# DQ Step 4

In [4]:
# Finding the total number of comments in 'ask' versus 'show' posts.

total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average ask post comments: {}'.format(avg_ask_comments))

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average show post comments: {}'.format(avg_show_comments))

Average ask post comments: 10.393478498741656
Average show post comments: 4.886099625910612


### Comparing the Averages

On average, posts that begin with "Ask HN" have more than double the average amount of comments than those that start with "Show Posts". This clearly illustrates humans' natural preference to exhibit their prowesses instead of viewing those of someone else.

# DQ Step 5

In [5]:
# Segregating 'Ask' posts by hour of day, and comparing total number of comments.

import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])    # Creating a time/comment list for each post.
    
    
counts_by_hour = {}
comments_by_hour = {}

#print(result_list[0])

for row in result_list:    # Extracting the hour
    # Example format:    9/6/2015 6:02
   
    post_hour = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    
    #post_hour = post_hour.hour
        # Above threw an error when using strftime. Possibly because it made hour an int.
    
    post_hour = dt.datetime.strftime(post_hour, '%H')    # Using strftime to isolate hour

    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = row[1]
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += row[1]
    
    
print('At 12 noon there were {} posts and {} comments.'.format(counts_by_hour['12'], comments_by_hour['12']))

At 12 noon there were 342 posts and 4234 comments.


# DQ Step 6

In [6]:
# Converting the result_list dictionary to a list of lists
# which uses contains the average of comments instead of a count.

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_by_hour = sorted(avg_by_hour)

print(pd.DataFrame(avg_by_hour))

     0          1
0   00   7.564784
1   01   7.407801
2   02  11.137546
3   03   7.948339
4   04   9.711934
5   05   8.794258
6   06   6.782051
7   07   7.013274
8   08   9.190661
9   09   6.653153
10  10  10.684397
11  11   8.964744
12  12  12.380117
13  13  16.317568
14  14   9.692008
15  15  28.676471
16  16   7.713299
17  17   9.449744
18  18   7.942997
19  19   7.163043
20  20   8.749020
21  21   8.687259
22  22   8.804178
23  23   6.696793


# DQ Step 7

In [7]:
# Creating a new list of averages to sort by highest average instead of chronologically.

swap_avg_by_hour = []

for row in avg_by_hour:
     swap_avg_by_hour.append([row[1], row[0]])

print(pd.DataFrame(swap_avg_by_hour))

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('\n')
print('Top 5 Hours For Ask Posts Comments (US EST)')

for row in sorted_swap[:6]:
    formatted_time = dt.datetime.strptime(row[1], '%H')
    formatted_time = dt.datetime.strftime(formatted_time, '%H:%M')
    print('{}: {:.2f} average comments per post'.format(formatted_time, row[0]))

            0   1
0    7.564784  00
1    7.407801  01
2   11.137546  02
3    7.948339  03
4    9.711934  04
5    8.794258  05
6    6.782051  06
7    7.013274  07
8    9.190661  08
9    6.653153  09
10  10.684397  10
11   8.964744  11
12  12.380117  12
13  16.317568  13
14   9.692008  14
15  28.676471  15
16   7.713299  16
17   9.449744  17
18   7.942997  18
19   7.163043  19
20   8.749020  20
21   8.687259  21
22   8.804178  22
23   6.696793  23


Top 5 Hours For Ask Posts Comments (US EST)
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
04:00: 9.71 average comments per post


## Analysis Results:

The posts with the most comments according to this data are ones that occur during the 3PM hour (EST). 

Interestingly, there is a drop off in the hours before and after, 2PM and 4PM. When considering only the time zone that the data is categorized in (EST), there are no obvious significant reasons why this would be a time for the posts to receive more views (and consequently more comments).

To try to further understand why this is, an examination of what 3PM EST lines up with in other time zones will be necessary. My hypothesis for this would be that 3PM EST is a confluence of multiple opportunistic circumstances for commenting globally. Those circumstances including aspects such as average wake up and sleep times, typical lunch breaks, or end of average work schedules. These factors could be compared against population density of those time zones, and the availability of internet within them.