## Analysing Hacker News Posts

#### In this project we will explore a dataset of Hacker News posts to determine which type of posts includes more comments on average and which times of posts receive more comments on average.

For this project we will use a simplified version of <a href="https://www.kaggle.com/hacker-news/hacker-news-posts">this</a> Kaggle dataset

In [1]:
from csv import reader
opened_file=open("hacker_news.csv")
file=reader(opened_file)
hn=list(file)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
headers=hn[0]
hn=hn[1:]
print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


For the purpose of this exercise we will focus on two specific types of posts: Ask HN and Show Hn. 

In [3]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    post=row[1].lower()
    if post.startswith("ask hn"):
        ask_posts.append(row)
    elif post.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
result="There are {} ASK HN posts in the dataset, {} SHOWN HN posts in the dataset and {} other types of posts.".format(len(ask_posts), len(show_posts), len(other_posts))
print(result)      

There are 1744 ASK HN posts in the dataset, 1162 SHOWN HN posts in the dataset and 17194 other types of posts.


In [4]:
## Exploring the first few rows of each list
print(ask_posts[:5])
print("\n")
print(show_posts[:5])
print("\n")
print(other_posts[:5])
print("\n")

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 

In [5]:
## Finding out number of commets per post

total_ask_comments=0
total_show_comments=0
total_other_comments=0

for item in ask_posts:
    num_comments=int(item[4])
    total_ask_comments+=num_comments
    
avg_ask_comments=total_ask_comments/len(ask_posts)


for item in show_posts:
    num_comments=int(item[4])
    total_show_comments+=num_comments
    
avg_show_comments=total_show_comments/len(show_posts)

for item in other_posts:
    num_comments=int(item[4])
    total_other_comments+=num_comments
    
avg_other_comments=total_other_comments/len(other_posts)

print(("The average number of comments in ASK HN posts is {:.2f}.").format(avg_ask_comments))
print(("The average number of comments in SHOW HN posts is {:.2f}.").format(avg_show_comments))
print(("The average number of comments in other types of HN posts is {:.2f}.").format(avg_other_comments))

The average number of comments in ASK HN posts is 14.04.
The average number of comments in SHOW HN posts is 10.32.
The average number of comments in other types of HN posts is 26.87.


When comparing the average numbers of comments in Ask HN or Show HN we find Ask HN posts have more comments (14.04) when compared to Show HN. Intuitively this makes sense, as Ask HN posts are posed as questions and hence are more likely to receive other users input/help.

However, when compared with the remaining types of posts in our dataset, both Ask HN and Show HN have less posts on average, as the remaining posts have an average of 26.87 comments per post.

Lets now observe if posts created at certain times are more likely to attract comments.

In [6]:
import datetime as dt
results_list=[]

for item in ask_posts:
    pair=[item[6], int(item[4])]
    results_list.append(pair)

    
counts_by_hour={}
comments_by_hour={}

for row in results_list:
    
    time=dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    created=time.hour
    if created in counts_by_hour:
        counts_by_hour[created]+=1
    else:
        counts_by_hour[created]=1
    if created in comments_by_hour:
        comments_by_hour[created]+=row[1]
    else:
        comments_by_hour[created]=row[1]

print(counts_by_hour)
print(comments_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


Finding the average number of posts for each hour of the day

In [7]:
average_per_hour=[]
for item in counts_by_hour:
    number_of_posts=counts_by_hour[item]
    number_of_comments=comments_by_hour[item]
    average_per_hour.append([item, round(number_of_comments/number_of_posts,2)])

average_per_hour
    

[[9, 5.58],
 [13, 14.74],
 [10, 13.44],
 [14, 13.23],
 [16, 16.8],
 [23, 7.99],
 [12, 9.41],
 [17, 11.46],
 [15, 38.59],
 [21, 16.01],
 [20, 21.52],
 [2, 23.81],
 [18, 13.2],
 [3, 7.8],
 [5, 10.09],
 [19, 10.8],
 [1, 11.38],
 [22, 6.75],
 [8, 10.25],
 [4, 7.17],
 [0, 8.13],
 [6, 9.02],
 [7, 7.85],
 [11, 11.05]]

In [23]:
swap_avg_per_hour=[]

for item in average_per_hour:
    swap_avg_per_hour.append([item[1], item[0]])


sorted_swap=sorted(swap_avg_per_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for item in sorted_swap:
    hr=dt.datetime.strptime(str(item[1]),"%H")
    h1=hr.hour
    result="{}: {} average comments per post".format(h1,item[0])
    print(result)

Top 5 Hours for Ask Posts Comments
15: 38.59 average comments per post
2: 23.81 average comments per post
20: 21.52 average comments per post
16: 16.8 average comments per post
21: 16.01 average comments per post
13: 14.74 average comments per post
10: 13.44 average comments per post
14: 13.23 average comments per post
18: 13.2 average comments per post
17: 11.46 average comments per post
1: 11.38 average comments per post
11: 11.05 average comments per post
19: 10.8 average comments per post
8: 10.25 average comments per post
5: 10.09 average comments per post
12: 9.41 average comments per post
6: 9.02 average comments per post
0: 8.13 average comments per post
23: 7.99 average comments per post
7: 7.85 average comments per post
3: 7.8 average comments per post
4: 7.17 average comments per post
22: 6.75 average comments per post
9: 5.58 average comments per post
