## Exploring Hacker News Posts
### Introduction
In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories are voted and commented upon, similar to Reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Data set source: [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts).

**Note:**
The data set has been reduced from almost 300,000 rows to approximately 20,000 by removing the submissions without comments, and then randomly sampling from the remaining submissions.


In [1]:
#imports
import csv

In [2]:
# read data
with open('hacker_news.csv', encoding='utf-8') as f:
    hn = list(csv.reader(f))
hn[:4]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20']]

Let's describe columns.
- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the post links to if the post has an URL
- `num_ponts`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

### The Project's target
We are specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question and `Show HN` posts to show a project, product or just something interesting.

We'll compare these two types of posts to determine the following:
- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### Removing Headers from a List of Lists

In [3]:
headers = hn[0]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [4]:
hn = hn[1:] # reassigned the data without headers to the *hn* variable
hn[:2]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30']]

### Extracting *Ask HN* and *Show HN* Posts

Since we are only concerned with posts titles beginning with *Ask HN* or *Show HN*, we'll create new lists of lists containing just the data for those titles.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [6]:
print('Ask HN:', len(ask_posts), 'posts')
print('Show HN:', len(show_posts), 'posts')
print('Other:', len(other_posts), 'posts')

Ask HN: 1744 posts
Show HN: 1162 posts
Other: 17194 posts


We separated the posts by title in three different lists:`ask_posts`, `show_posts` and `other_posts`. Next, we want to determine if ask posts or show posts receive more comments on average.

### Calculating the Average Number of Comments for `Ask HN` and `Show HN` Posts

In [7]:
# find the average number of comments for *ask posts*
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
        
avg_ask_comments = total_ask_comments / len(ask_posts)

avg_ask_comments

14.038417431192661

In [8]:
# find the average number of comments for *show posts*
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)

avg_show_comments

10.31669535283993

On average the `Ask HN` posts get more comments than `Shown HN` posts, 14 and respectively 10. This is good news for the HN community, people help each other when questions occur.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

### Finding the Amount of Ask Posts and Comments by Hour Created

In [9]:
import datetime as dt

result_list = []
for post in ask_posts:
    created_at = post[6]
    n_comments = int(post[4])
    result_list.append([created_at, n_comments])
    
result_list[:5]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [10]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M").strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = row[1]
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += row[1]
        
import collections
# a sorted version of the dictionaries for a tidy picture
sort_counts = collections.OrderedDict(sorted(counts_by_hour.items()))
sort_comments = collections.OrderedDict(sorted(comments_by_hour.items()))

sort_comments

OrderedDict([('00', 447),
             ('01', 683),
             ('02', 1381),
             ('03', 421),
             ('04', 337),
             ('05', 464),
             ('06', 397),
             ('07', 267),
             ('08', 492),
             ('09', 251),
             ('10', 793),
             ('11', 641),
             ('12', 687),
             ('13', 1253),
             ('14', 1416),
             ('15', 4477),
             ('16', 1814),
             ('17', 1146),
             ('18', 1439),
             ('19', 1188),
             ('20', 1722),
             ('21', 1745),
             ('22', 479),
             ('23', 543)])

Next, we'll use the dictionaries created to calculate the average number of comments for posts created during each hour of the day.

### Calculating the Average Number  of Comments for Ask HN Posts by Hour

In [11]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

The format we have now is hard to read and get a tidy picture. Let's sort the list in the code below.

### Sorting and Printing Values from a List of Lists

In [12]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [13]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [14]:
print("Top 5 Hours for Ask Posts Comments")
print('\n')

Top 5 Hours for Ask Posts Comments




In [15]:
for avg, hour in sorted_swap[:5]:
    print(
          "{}: {:.2f} average comments per post".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


### Conclusion
The hours that receive the most comments are between 3:00 pm and 4:00 pm with an average of 38.6 comments per post.

According to the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home) `created_at`column used the time zone from Eastern Time in the US. 

### Further Exploration

Let's examine the `points` column for the ask and show posts.

### Calculating the Average Number of Posts for `Ask HN` and `Show HN` Posts 

In [16]:
# find the average number of points for *ask posts*
total_ask_points = 0
for row in ask_posts:
    total_ask_points += int(row[3])
        
avg_ask_points = total_ask_points / len(ask_posts)

avg_ask_points

15.061926605504587

In [17]:
# find the average number of points for *show posts*
total_show_points = 0
for row in show_posts:
    total_show_points += int(row[3])
        
avg_show_points = total_show_points / len(show_posts)

avg_show_points

27.555077452667813

Compared to the `comments` study, the `points` picture stands completely different, we get an average of 27.5 on `show posts`, almost twice as big as the average of 15 for `ask posts`. 

Let's find the active hours for getting the points on `show posts`.

### Finding The Amount of Show Posts and Points by Hour Created

In [18]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [19]:
result_list = []

for post in show_posts:
    result_list.append([post[-1], int(post[3])])
    
result_list[:3]

[['11/25/2015 14:03', 26], ['11/29/2015 22:46', 747], ['4/28/2016 18:05', 1]]

In [20]:
counts_by_hour_p = {}
points_by_hour = {}

for post in result_list:
    time = dt.datetime.strptime(post[0], "%m/%d/%Y %H:%M").strftime("%H")
    if time not in counts_by_hour_p:
        counts_by_hour_p[time] = 1
        points_by_hour[time] = post[1]
    else:
        counts_by_hour_p[time] += 1
        points_by_hour[time] += post[1]
        
points_by_hour
    

{'14': 2187,
 '22': 1856,
 '18': 2215,
 '07': 494,
 '20': 1819,
 '05': 104,
 '16': 2634,
 '19': 1702,
 '15': 2228,
 '03': 679,
 '17': 2521,
 '06': 375,
 '02': 340,
 '13': 2438,
 '08': 519,
 '21': 866,
 '04': 386,
 '11': 1480,
 '12': 2543,
 '23': 1526,
 '09': 553,
 '01': 700,
 '10': 681,
 '00': 1173}

### Calculating the Average Number of Points for Show HN Posts by Hour

In [21]:
avg_by_hour_p = []

for hour, value in points_by_hour.items():
    avg_by_hour_p.append([hour, value / counts_by_hour_p[hour]])
    
avg_by_hour_p

[['14', 25.430232558139537],
 ['22', 40.34782608695652],
 ['18', 36.31147540983606],
 ['07', 19.0],
 ['20', 30.316666666666666],
 ['05', 5.473684210526316],
 ['16', 28.322580645161292],
 ['19', 30.945454545454545],
 ['15', 28.564102564102566],
 ['03', 25.14814814814815],
 ['17', 27.107526881720432],
 ['06', 23.4375],
 ['02', 11.333333333333334],
 ['13', 24.626262626262626],
 ['08', 15.264705882352942],
 ['21', 18.425531914893618],
 ['04', 14.846153846153847],
 ['11', 33.63636363636363],
 ['12', 41.68852459016394],
 ['23', 42.388888888888886],
 ['09', 18.433333333333334],
 ['01', 25.0],
 ['10', 18.916666666666668],
 ['00', 37.83870967741935]]

### Sorting the Average Number of Points by Hour

In [22]:
swap_avg_by_hour_p = []

for row in avg_by_hour_p:
    swap_avg_by_hour_p.append([row[1], row[0]])
    
swap_avg_by_hour_p

[[25.430232558139537, '14'],
 [40.34782608695652, '22'],
 [36.31147540983606, '18'],
 [19.0, '07'],
 [30.316666666666666, '20'],
 [5.473684210526316, '05'],
 [28.322580645161292, '16'],
 [30.945454545454545, '19'],
 [28.564102564102566, '15'],
 [25.14814814814815, '03'],
 [27.107526881720432, '17'],
 [23.4375, '06'],
 [11.333333333333334, '02'],
 [24.626262626262626, '13'],
 [15.264705882352942, '08'],
 [18.425531914893618, '21'],
 [14.846153846153847, '04'],
 [33.63636363636363, '11'],
 [41.68852459016394, '12'],
 [42.388888888888886, '23'],
 [18.433333333333334, '09'],
 [25.0, '01'],
 [18.916666666666668, '10'],
 [37.83870967741935, '00']]

In [23]:
# sorting by the average number of points
sorted_swap_p = sorted(swap_avg_by_hour_p, reverse=True)

sorted_swap_p

[[42.388888888888886, '23'],
 [41.68852459016394, '12'],
 [40.34782608695652, '22'],
 [37.83870967741935, '00'],
 [36.31147540983606, '18'],
 [33.63636363636363, '11'],
 [30.945454545454545, '19'],
 [30.316666666666666, '20'],
 [28.564102564102566, '15'],
 [28.322580645161292, '16'],
 [27.107526881720432, '17'],
 [25.430232558139537, '14'],
 [25.14814814814815, '03'],
 [25.0, '01'],
 [24.626262626262626, '13'],
 [23.4375, '06'],
 [19.0, '07'],
 [18.916666666666668, '10'],
 [18.433333333333334, '09'],
 [18.425531914893618, '21'],
 [15.264705882352942, '08'],
 [14.846153846153847, '04'],
 [11.333333333333334, '02'],
 [5.473684210526316, '05']]

In [24]:
print("Top 5 Hours for Show Posts Points")
print('\n')
for avg, hour in sorted_swap_p[:5]:
    print("{}: {:.2f} average points per post".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg))
    

Top 5 Hours for Show Posts Points


23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


### Conclusion

Now we have a clear picture of the number of points on average for `show posts`. Thus, late night posts, from 10:00 pm to 1:00 am, and those from the middle of the day, 12:00 am to 1:00 pm (East Timezone in the USA), get the best number of points on average.