# Introduction
Our aim is to analyze Hacker News (HN) posts whose titles begin with either `Ask HN` or `Show HN`.
- Users submit Ask HN posts to ask the Hacker News community a specific question.
- Users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting

We want to know:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

---
### Given: Data Set
- Data from the Hacker News and the dataset's documentation [are found here.](https://www.kaggle.com/hacker-news/hacker-news-posts)

--------
## Step 1: Open the dataset

We'll open our data set now and make it ready to be used for analysis.

In [1]:
#global imports
from csv import reader
import datetime as dt

# open Hacker News .csv file
csv_file = open('hacker_news.csv')
read_csv = reader(csv_file)
hn = list(read_csv)

----------------
## Step 2: Explore the data
Let's print out a few lines of the datasets to see what they look like!

In [2]:
# print for Janki's sanity
hn[:3]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24']]

---
## Step 3: Clean the data
Now that we've opened and explored our dataset, we'll need to scrub and clean the data.

This process of preparing our data for analysis is called `data cleaning`. Data cleaning is done before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

### Remove header row
Let's first start by moving the header row to another variable so that our `hn` list only contains information about each HN article.

In [3]:
# separate out the header row
headers = hn[0]
del hn[0]

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
# print for Janki's sanity
hn[:2]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24']]

### Remove rows with no comments
Let's remove rows for articles that have no comments.

In [5]:
# separate out articles with and without comments
zero_comm_hn=[]
comm_hn=[]

for row in hn:
    comments=int(row[4])
    if comments==0:
        zero_comm_hn.append(row)
    else:
        comm_hn.append(row)
        
hn = comm_hn

print(len(hn))

80401


### Parse out `Ask HN` and `Show HN` post rows
Let's now create three lists to separate out `Ask HN`, `Show HN`, and `Other Posts`.

In [6]:
# create lists that contain only "showHN",
# "askHN", and non-show + non-ask posts ("other")
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    if title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# print for Janki's sanity
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print(ask_posts[:1])
print(show_posts[:1])
print(other_posts[:1])

6911
5059
75342
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']]
[['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']]
[['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']]


---
## Step 4: Analyze the data
### Comments average
Let's determine if ask posts or show posts receive more comments on average.

In [7]:
# calculate the average comments in AskHN
total_ask_comments=0
num_ask_posts=len(ask_posts)

for row in ask_posts:
    comms=int(row[4])
    total_ask_comments+=comms

avg_ask_comments = total_ask_comments/num_ask_posts
ask_print="avg ask comments: " + format(round(avg_ask_comments))
print(ask_print)



# calculate the average comments in ShowHN
total_show_comments=0
num_show_posts=len(show_posts)

for row in show_posts:
    comms=int(row[4])
    total_show_comments+=comms

avg_show_comments = total_show_comments/num_show_posts
show_print="avg show comments: " + format(round(avg_show_comments))
print(show_print)

avg ask comments: 14
avg show comments: 10


It looks like `Ask HN` posts receive more engagement from users than `Show HN`. However, this is almost to be expected because users are usually more willing to answer a question than they are to comment about a regular post.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

### `AskHN` posts by time
Now, we want determine if ask posts created at a certain time are more likely to attract comments.

Here's the method we'll use:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.


In [18]:
# create a list of when posts were created
result_list=[]
for row in ask_posts:
    created=row[6]
    num_comms=int(row[4])
    result_list.append([created, num_comms])

    

#create two dicts: (1) posts by hour,
# and (2) comments by hour
counts_by_hour={}
comments_by_hour={}
for row in result_list:
    hour=row[0]
    comms=row[1]
    date1=dt.datetime.strptime(hour,"%m/%d/%Y %H:%M")
    date2=dt.datetime.strftime(date1, "%H")
    if date2 not in counts_by_hour:
        counts_by_hour[date2]=1
        comments_by_hour[date2]=comms
    else:
        counts_by_hour[date2]+=1
        comments_by_hour[date2]+=comms
        
print(counts_by_hour)
print(" ")
print(comments_by_hour)

{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
 
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


### Average comments during each hour of the day
Let's use the two dictionaries we've just created (`counts_by_hour` and `comments_by_hour`) to calculate the average number of comments for posts created during each hour of the day.

In [26]:
avg_by_hour=[]
for row in counts_by_hour:
    key1=row
    comments=comments_by_hour[key1]
    counts=counts_by_hour[row]
    avg_by_hour.append([key1, comments/counts])

# print for J's sanity
avg_by_hour

[['02', 13.198237885462555],
 ['01', 9.367713004484305],
 ['22', 11.749128919860627],
 ['21', 11.056511056511056],
 ['19', 9.414285714285715],
 ['17', 13.73019801980198],
 ['15', 39.66809421841542],
 ['14', 13.153439153439153],
 ['13', 22.2239263803681],
 ['11', 11.143426294820717],
 ['10', 13.757990867579908],
 ['09', 8.392045454545455],
 ['07', 10.095541401273886],
 ['03', 10.160377358490566],
 ['16', 10.76144578313253],
 ['08', 12.43157894736842],
 ['00', 9.857142857142858],
 ['23', 8.322463768115941],
 ['20', 11.38265306122449],
 ['18', 10.789823008849558],
 ['12', 15.452554744525548],
 ['04', 12.688172043010752],
 ['06', 9.017045454545455],
 ['05', 11.139393939393939]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [68]:
swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap=sorted(swap_avg_by_hour, reverse=True)


print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[0:5]:
    template="{}: {:.2f} average comments per post."
    hour1=dt.datetime.strptime(row[1], "%H")
    hour2=dt.datetime.strftime(hour1,"%H:%M")
    t=template.format(hour2, row[0])
    print(t)

Top 5 Hours for Ask Posts Comments
15:00: 39.67 average comments per post.
13:00: 22.22 average comments per post.
12:00: 15.45 average comments per post.
10:00: 13.76 average comments per post.
17:00: 13.73 average comments per post.


So, which hours should you create a post during to have a higher chance of receiving comments? It looks like 10AM, noon, 1PM, 3PM, and 5PM are the best chances for your post to get the most engagement through comments (assuming you're posting an `AskHN` article!).