# Exploring Hacker News Posts

In this project, we're going to explore a dataset from Hacker News. Hacker News is a website in which people can submit stories, vote, and comment similar to Reddit. The dataset can be downloaded [here](https://www.kaggle.com/hacker-news/hacker-news-posts). In this project, we're going to use samples from the above dataset which was randomized and removed stories that don't have any comment. The original dataset has about 300,000 rows and the dataset in this project is about 20,000 rows.

In [12]:
# Read a dataset and display first five rows
from csv import reader
hn = list(reader(open('hacker_news.csv')))
headers = hn[0]
hn = hn[1:]

print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [15]:
# Create lists to store title of posts which start with 'ask hn', 'show hn' or other
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [21]:
# Calculate average number of ask comments
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of ask comments is', avg_ask_comments)

# Calculate average number of show comments
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of show comments is ',avg_show_comments)

Average number of ask comments is 14.038417431192661
Average number of show comments is  10.31669535283993


Ask posts have more comments than show posts about 4 comments on average

## Analyze Ask Posts

Since ask posts have more comments compared to show posts, we want to analyze further whether times which ask posts were created have greater chances of getting more comments.

In [50]:
from datetime import datetime

# Create an empty list to store time ask post were created and number of comments
result_list = []

for row in ask_posts:
    result_list.append([row[6], row[4]])
    
counts_by_hour = {}
comments_by_hour = {}

# Calculate a number of comments and a number of ask posts created each hour of the day
for row in result_list:
    row[0] = datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = datetime.strftime(row[0], '%H')
    row[1] = int(row[1])
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

avg_by_hour = []

# Calculate the average number of comments per post each hour of the day
for hour in counts_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour])])

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Top 5 Hours for Ask Post Comments')
for row in sorted_swap[:5]:
    print(datetime.strftime(datetime.strptime(row[1], '%H'), '%H:%M') + ': ' + str(row[0]) + ' average comments per post')

Top 5 Hours for Ask Post Comments
15:00: 39 average comments per post
02:00: 24 average comments per post
20:00: 22 average comments per post
16:00: 17 average comments per post
21:00: 16 average comments per post


If you want to get the most comments from ask posts 15.00 is the best time for that. 

The top five hours of the day that have the most average comments per posts in ask posts are 15.00, 02.00, 20.00, 16.00, and 21.00, which have the average comments 39, 24, 22, 17, and 16 respectively. 