### Objective

We are using the data Hacker News to answer the below two questions 

- Do `Ask HN` or `Show HN` receive more comments on average ?
- Do posts created at a certain time receive more comments on average ?

Source of data - [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts)

We load the libraries and data set as a list of lists.

In [8]:
from csv import reader
hacker_news_csv = open("hacker_news.csv")
dataset = reader(hacker_news_csv)
hn = list(dataset)

Displaying the first few rows

In [9]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Here is the description of the columns

- id: the unique identifier from Hacker News for the post
- title: the title of the post
- url: the URL that the posts links to, if the post has a URL
- num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: the number of comments on the post
- author: the username of the person who submitted the post
created_at: the date and time of the post's submission

Storing the headers in a separate list and removing from the main dataset

In [11]:
headers = hn[:1]
print(headers)

hn = hn[1:]
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [16]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print("Number of posts in Ask HN is {}.".format(len(ask_posts)))
print("Number of posts in Show HN is {}.".format(len(show_posts)))
print("Number of other posts are {}.".format(len(other_posts)))
    

Number of posts in Ask HN is 1744
Number of posts in Show HN is 1162
Number of other posts are 17194


### Let us see if posts in ask_posts and show_posts receive more comments on average

In [20]:
def count_comments(posts):
    total_comments = 0
    for post in posts:
        total_comments += int(post[4])
    return total_comments

total_comments_on_ask_posts = count_comments(ask_posts)
avg_ask_comments = total_comments_on_ask_posts / len(ask_posts)
print("Average comments on a post with Ask HN: {0:.2f}".format(avg_ask_comments))

total_comments_on_show_posts = count_comments(show_posts)
avg_show_comments = total_comments_on_show_posts / len(show_posts)
print("Average comments on a post with Show HN: {0:.2f}".format(avg_show_comments))

total_comments_on_other_posts = count_comments(other_posts)
avg_other_comments = total_comments_on_other_posts / len(other_posts)
print("Average comments on other posts: {0:.2f}".format(avg_other_comments))

Average comments on a post with Ask HN: 14.04
Average comments on a post with Show HN: 10.32
Average comments on other posts: 26.87


As we can see, while Ask HN posts receive more comments on average than Show HN; it is the other posts which receive the maximum average comments - almost double of that on Ask HN.

Ask HN posts, on an average receive almost 40% more comments than Show HN. Let is focus our remaining analysis here.

### Let us know determine of a ask post created at a certain time will attract more comments or not.

Let us look at the first few rows.

In [21]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


We will create a frequency table with key being the hour of the day and value being the average number of comments posted on posts created in that hour.

In [53]:
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[-1], int(post[4])])

counts_by_hour = {}
comments_by_hour = {}
for result in result_list:
    date_time_object = dt.datetime.strptime(result[0],
                                            "%m/%d/%Y %H:%M")
    hour = date_time_object.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = result[1]

In [54]:
import json
print("Posts by hour \n", json.dumps(counts_by_hour, indent=2))
print("Comments by hour \n", json.dumps(comments_by_hour, indent=2))

Posts by hour 
 {
  "09": 45,
  "13": 85,
  "10": 59,
  "14": 107,
  "16": 108,
  "23": 68,
  "12": 73,
  "17": 100,
  "15": 116,
  "21": 109,
  "20": 80,
  "02": 58,
  "18": 109,
  "03": 54,
  "05": 46,
  "19": 110,
  "01": 60,
  "22": 71,
  "08": 48,
  "04": 47,
  "00": 55,
  "06": 44,
  "07": 34,
  "11": 58
}
Comments by hour 
 {
  "09": 251,
  "13": 1253,
  "10": 793,
  "14": 1416,
  "16": 1814,
  "23": 543,
  "12": 687,
  "17": 1146,
  "15": 4477,
  "21": 1745,
  "20": 1722,
  "02": 1381,
  "18": 1439,
  "03": 421,
  "05": 464,
  "19": 1188,
  "01": 683,
  "22": 479,
  "08": 492,
  "04": 337,
  "00": 447,
  "06": 397,
  "07": 267,
  "11": 641
}


Let us know find out the avg comments per post by the hour.

In [58]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append(
        [hour, 
         round((comments_by_hour[hour]/counts_by_hour[hour]),2)])

The average number of comments received by post by the hour are :


In [60]:
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


Let us sort the list by descending order and show the five highest value.

In [82]:
# To sort the list by the avg number of comments we will swap the 
# columns of the list and then sort

swapped_avg_by_hour = []
for item in avg_by_hour:
    swapped_avg_by_hour.append([item[1], item[0]])

sorted_swap = sorted(swapped_avg_by_hour, reverse=True)



Let us see the five best hours to publish ton receive max traction via number of comments

In [84]:
for item in sorted_swap[:5]:
    print("{}:00 {}   average comments per post".
          format(item[1], item[0]))

15:00 38.59   average comments per post
02:00 23.81   average comments per post
20:00 21.52   average comments per post
16:00 16.8   average comments per post
21:00 16.01   average comments per post


As we can observe, if we want to receive the maximum comments on our posts, then we should publish in the afternoon around 3pm.

Further explorations
- Convert to your time zone
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.
- [ A Detailed Exploration](https://nbviewer.org/urls/community.dataquest.io/uploads/short-url/ue205yI1Q81nWJYcE0BeN8YptoF.ipynb)