# Guided Project: Exploring Hacker News Posts

## PART 1 of 8: Introduction

In [1]:
from csv import reader

open_file = open("/home/lumenetix/Documents/python/data_quest/project2/HN_posts_year_to_Sep_26_2016.csv")
read_file = reader(open_file)
hn = list(read_file)
print(hn[:6])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post

title: The title of the post

url: The URL that the posts links to, if it the post has a URL

num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: The number of comments that were made on the post

author: The username of the person who submitted the post

created_at: The date and time at which the post was submitted

### What we are interested in:

are posts titles that begin with Ask HN or Show HN

### Like:

Ask HN: How to improve my personal website?

Ask HN: Am I the only one outraged by Twitter shutting down share counts?

Ask HN: Aby recent changes to CSS that broke mobile?

### or:

Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'

Show HN: Something pointless I made

Show HN: 
    


### We want to compare thse 2 types of posts to determine:

A. Do Ask HN or Show HN receive more comments on average?

B. Do posts created at a certain time receive more comments on average?

## Part 2 of 8: Removing Headers from a List of Lists

We want to analyze the data but in order to do so efficiently we should remove the column header row.

In [2]:
headers =hn[0]
hn = hn[1:]

print(headers)
print("\n")
print(hn[0])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


## Part 3 of 8: Extracting Ask HN and Show HN Posts

We can start filtering data. We need to create new lists containing only dat wtih what we are interested in, i.e. Ask HN and Show HN.


In [3]:
ask_posts = []
show_posts = []
other_posts = []

#we first make all titles lower case for consistency
# then we check what each title starts with
# then we categorize and place each in the appropriate list
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(ask_posts[:5])
print('\n')
print(len(show_posts))
print('\n')
print(len(other_posts))
        

9139
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


10158


273822


## Part 4 of 8: Calculating the Average Number of Comments for Ask HN and Show HN Posts

We want to determine if asks posts or show posts receive more comments on average.

In [4]:
total_ask_comments = 0
ask_counter = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    ask_counter += 1
    
avg_ask_comments = total_ask_comments / ask_counter
print(avg_ask_comments)


total_show_comments = 0
show_counter = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    show_counter += 1
    
avg_show_comments = total_show_comments / show_counter
print(avg_show_comments)

print(avg_ask_comments - avg_show_comments)

10.393478498741656
4.886099625910612
5.507378872831044


The ask_posts receive more comments on by about 8 comments. Althought there are about 1000 more show posts than ask posts. Since the majority of comments come from ask posts we are going to focus on analyzing those.

## Part 5 of 8: Finding the amount of Ask Posts and Comments by Hour Created

In [5]:
import datetime as dt

# we'll calculate for ask posts first
ask_result_list = []

for row in ask_posts:
    sub_list= []
    created_at = row[6]
    num_comments = int(row[4])
    sub_list.append(created_at)
    sub_list.append(num_comments)
    ask_result_list.append(sub_list)
    
ask_counts_by_hour = {}
ask_comments_by_hour = {}

for row in ask_result_list:
    date = row[0]
    comment =row[1]
    format_string = "%m/%d/%Y %H:%M"
    dt_object = dt.datetime.strptime(date, format_string)
    dt_hour = dt_object.hour
    if dt_hour not in ask_counts_by_hour:
        ask_counts_by_hour[dt_hour] = 1
        ask_comments_by_hour[dt_hour] = comment
    else:
        ask_counts_by_hour[dt_hour] += 1
        ask_comments_by_hour[dt_hour] += comment
        
print(ask_counts_by_hour)
print(ask_comments_by_hour)

#we'll repeat the process for show posts


{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}
{2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


Now we have two dictionaries:

counts_by_hour: contains the number of ask posts created during each hour of the day.

comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

## Part 6 of 8: Calculating the Average Number of Comments for Ask HN Posts by Hour

We want to use these dictionaries to calculate the average number of comments for posts created during each hour of the day.

We want to do this by initializing an empty list.
Then we want to iterate it over the keys of the respective dictionaries and appended them to the empty list where the first element is the key and the second element is the value of comments by hour divided by the value of counts by hour.

In [7]:
ac_avg_by_hour =[]

for hour in ask_comments_by_hour:
    ac_avg_by_hour.append([hour, ask_comments_by_hour[hour]/ask_counts_by_hour[hour]])
    
print(ac_avg_by_hour)

[[2, 11.137546468401487], [1, 7.407801418439717], [22, 8.804177545691905], [21, 8.687258687258687], [19, 7.163043478260869], [17, 9.449744463373083], [15, 28.676470588235293], [14, 9.692007797270955], [13, 16.31756756756757], [11, 8.96474358974359], [10, 10.684397163120567], [9, 6.653153153153153], [7, 7.013274336283186], [3, 7.948339483394834], [23, 6.696793002915452], [20, 8.749019607843136], [16, 7.713298791018998], [8, 9.190661478599221], [0, 7.5647840531561465], [18, 7.94299674267101], [12, 12.380116959064328], [4, 9.7119341563786], [6, 6.782051282051282], [5, 8.794258373205741]]


## Part 7 of 8: Sorting and Printing Values from a List of Lists

Now that we have the average number of comments for posts created during each hour of the day.

Now we need to make it easier to identify the hours with the highest value by sorting the lists and printing the five highest values.

In [9]:
swap_avg_by_hour = []
for row in ac_avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)



for row in sorted_swap:
    avg_format = "{}:00: {:.2f} average comments per post"
    print(avg_format.format(row[1], row[0]))

[[11.137546468401487, 2], [7.407801418439717, 1], [8.804177545691905, 22], [8.687258687258687, 21], [7.163043478260869, 19], [9.449744463373083, 17], [28.676470588235293, 15], [9.692007797270955, 14], [16.31756756756757, 13], [8.96474358974359, 11], [10.684397163120567, 10], [6.653153153153153, 9], [7.013274336283186, 7], [7.948339483394834, 3], [6.696793002915452, 23], [8.749019607843136, 20], [7.713298791018998, 16], [9.190661478599221, 8], [7.5647840531561465, 0], [7.94299674267101, 18], [12.380116959064328, 12], [9.7119341563786, 4], [6.782051282051282, 6], [8.794258373205741, 5]]
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
2:00: 11.14 average comments per post
10:00: 10.68 average comments per post
4:00: 9.71 average comments per post
14:00: 9.69 average comments per post
17:00: 9.45 average comments per post
8:00: 9.19 average comments per post
11:00: 8.96 average comments per post
22:00: 8.80 average commen

A post should be created around 3pm in order to maximize the chances of getting posts.

## Part 8 of 8: Extra Steps

Now we want to try to determine other important details:
1. Determine if show or ask posts receive more points on average.
2. Determine if posts created at a certain time are more likely to receive more points.
3. Compare your results to the average number of comments and points other posts receive.

Now we're going to try to determine  where show or ask posts receive more points on average.

In [10]:
total_ask_points = 0

for row in ask_posts:
    num_points = int(row[3])
    total_ask_points += num_points
    
avg_ask_points = total_ask_points / ask_counter

total_show_points = 0

for row in show_posts:
    num_points = int(row[3])
    total_show_points += num_points
    
avg_show_points = total_show_points / show_counter

print(avg_ask_points)
print(avg_show_points)

11.31174089068826
14.843571569206537


It's seems like on average show posts get more upvote points on average but not by much.

Now I'll determine if posts created at a certain time are more likely to receive more points. We'll start by calculating for ask posts.

In [11]:
ap_rlist = []

for row in ask_posts:
    sub_list= []
    created_at = row[6]
    num_comments = int(row[3])
    sub_list.append(created_at)
    sub_list.append(num_comments)
    ap_rlist.append(sub_list)
    
ask_counts_by_hour = {}
ask_points_by_hour = {}

for row in ap_rlist:
    date = row[0]
    comment =row[1]
    format_string = "%m/%d/%Y %H:%M"
    dt_object = dt.datetime.strptime(date, format_string)
    dt_hour = dt_object.hour
    if dt_hour not in ask_counts_by_hour:
        ask_counts_by_hour[dt_hour] = 1
        ask_points_by_hour[dt_hour] = comment
    else:
        ask_counts_by_hour[dt_hour] += 1
        ask_points_by_hour[dt_hour] += comment
        
print(ask_counts_by_hour)
print(ask_points_by_hour)


{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}
{2: 2944, 1: 2662, 22: 3601, 21: 5042, 19: 4782, 17: 7155, 15: 13978, 14: 5390, 13: 7962, 11: 2856, 10: 3789, 9: 1763, 7: 2040, 3: 2539, 23: 2616, 20: 4491, 16: 5970, 8: 2744, 0: 2835, 18: 6850, 12: 4643, 4: 2650, 6: 2030, 5: 2046}


Now we want to calculate these dictionaries for show posts.

In [12]:
sp_rlist = []

for row in show_posts:
    sub_list= []
    created_at = row[6]
    num_comments = int(row[3])
    sub_list.append(created_at)
    sub_list.append(num_comments)
    sp_rlist.append(sub_list)
    
show_counts_by_hour = {}
show_points_by_hour = {}

for row in sp_rlist:
    date = row[0]
    comment =row[1]
    format_string = "%m/%d/%Y %H:%M"
    dt_object = dt.datetime.strptime(date, format_string)
    dt_hour = dt_object.hour
    if dt_hour not in show_counts_by_hour:
        show_counts_by_hour[dt_hour] = 1
        show_points_by_hour[dt_hour] = comment
    else:
        show_counts_by_hour[dt_hour] += 1
        show_points_by_hour[dt_hour] += comment
        
print(show_counts_by_hour)
print(show_points_by_hour)


{0: 276, 23: 319, 20: 525, 19: 556, 18: 656, 16: 801, 14: 696, 10: 323, 9: 302, 8: 316, 6: 192, 3: 206, 21: 430, 17: 761, 15: 836, 11: 402, 7: 236, 4: 194, 13: 610, 12: 516, 1: 247, 22: 377, 2: 209, 5: 172}
{0: 4291, 23: 5060, 20: 6948, 19: 8928, 18: 9935, 16: 11487, 14: 10503, 10: 4303, 9: 3762, 8: 4640, 6: 3071, 3: 2168, 21: 5990, 17: 10563, 15: 11657, 11: 7742, 7: 3303, 4: 2707, 13: 10381, 12: 10787, 1: 2931, 22: 5026, 2: 2764, 5: 1834}
