# Exploring 'Hacker News' posts
The aim of this project is to examine a dataset on the posts that have been made on the technology blog 'Hacker news'. The dataset includes important information like title of the post, respective url, points(upvotes - downvotes), username of person who posted it, date of post. 

We are going to focus on the posts that start with 'Ask HN' and 'Show HN', as they are those posts regarding questions being asked as well as products, ideas, etc. being shown, respectively
Let us begin by reading in the dataset and defining it as a list of lists-

In [2]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])
#we print a few rows to inspect the dataset#

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
        

1744
1162
17194


In [5]:
print(ask_posts[:5])
print('\n')
print(show_posts[:5])
#printing first 5 rows of the ask and show post lists#

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 

In [6]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)


14.038417431192661


In [7]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)


10.31669535283993


From the average calculated, it is clear that 'Ask posts' recieve more comments on an average. So now, we are going to focus our remaining analysis just on these posts.
Now we will find out if posts made during a certain time are more likely to attract more comments. We will do so by-
-calculating posts made in each hour, along with comments recieved.
-calculating average number of comments ask posts recieve by hour.

In [8]:
print(ask_posts[:2])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


In [9]:
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[-1], int(post[4])])

counts_by_hour = {}
comments_by_hour = {}
for row in  result_list:
    dateAndtime = row[0]
    date_time_object = dt.datetime.strptime(dateAndtime, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date_time_object, '%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])
comments_by_hour


{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

We have created 2 dictionaries-
counts_by_hour is a frequency table holding the number of ask posts created each hour.
comments_by_hour is a frequency table holding the number of comments made on ask posts each hour.
Now, we are going to calculate the average number of comments per post created during each hour of the day.


In [10]:
avg_by_hour = []
for hourr in comments_by_hour:
    avg_by_hour.append([hourr, comments_by_hour[hourr]/counts_by_hour[hourr]])
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [11]:
swap_avg_hour = []
for avg in avg_by_hour:
    swap_avg_hour.append([avg[1], avg[0]])
print(swap_avg_hour)
sorted_swap = sorted(swap_avg_hour, reverse= True)
print('Top 5 hours for Ask Post comments')
for avg in sorted_swap[:5]:
    text = '{} : {:.2f} average comments per post.'
    Hrr = avg[1]
    Hrr = dt.datetime.strptime(Hrr, '%H').strftime('%H:%M')
    Avg = avg[0]
    print(text.format(Hrr, Avg))

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Top 5 hours for Ask Post comments
15:00 : 38.59 average comments per post.
02:00 : 23.81 average comments per post.
20:00 : 21.52 average comments per post.
16:00 : 16.80 average comments per post.
21:00 : 16.01 average comments per post.


# Conclusion #1
From the above cell it is clear that 'Ask Hn' Posts created between 15:00(3:00 PM) & 16:00 would recieve the most comments.
We have conducted the analysis on only 'Ask Hn' & 'Show Hn' posts which recieved comments, and so it would be more concise to say that of the posts that recieved comments, 'Ask Hn' posts created at 15:00 recieved the highest comments.

# Further Analysis on Show Hn & Ask Hn posts

Now we will find out average number of points recieved by Ask posts and Show posts.

In [12]:
ask_posts_points = 0
for post in ask_posts:
    ask_posts_points += int(post[3])
    avg_ask_points = ask_posts_points / len(ask_posts)
avg_ask_points
    


15.061926605504587

In [13]:
show_posts_points = 0
for post in show_posts:
    show_posts_points += int(post[3])
    avg_show_points = show_posts_points / len(show_posts)

avg_show_points

27.555077452667813

Now we will find out the average number of points recieved by a post created during each hour in a day.
Let's start with 'Ask Posts'.


In [18]:
new_ask_list = []
for post in ask_posts:
    new_ask_list.append([post[-1], int(post[3])])
point_counts = {}
asks_per_hour = {}
for ask in new_ask_list:
    timee = ask[0]
    time_object = dt.datetime.strptime(timee, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(time_object, '%H')
    if hour not in asks_per_hour:
        point_counts[hour] = 1
        asks_per_hour[hour] = ask[1]
    else:
        point_counts[hour] += 1
        asks_per_hour[hour] += ask[1]
#now finding the average#
avg_ask_points = []
for ask in asks_per_hour:
    avg_ask_points.append([asks_per_hour[ask]/point_counts[ask], ask])

sorted_avg_pointsby_hour = sorted(avg_ask_points, reverse = True)
print(sorted_avg_pointsby_hour[:5])
print('\n')
print('At the following hour, most points were recieved by Ask Posts on an average-')
for pointt in sorted_avg_pointsby_hour[:5]:
    hrr = pointt[1]
    hrr = dt.datetime.strptime(hrr, '%H').strftime('%H:%M')
    textt = '{} : {:.2f} points recieved on an average at this hour'
    print(textt.format(hrr, pointt[0]))

[[29.99137931034483, '15'], [24.258823529411764, '13'], [23.35185185185185, '16'], [19.41, '17'], [18.677966101694917, '10']]


At the following hour, most points were recieved by Ask Posts on an average-
15:00 : 29.99 points recieved on an average at this hour
13:00 : 24.26 points recieved on an average at this hour
16:00 : 23.35 points recieved on an average at this hour
17:00 : 19.41 points recieved on an average at this hour
10:00 : 18.68 points recieved on an average at this hour


The most points recieved by an 'Ask Hn' post was between 3pm and 4pm.
Let's do the same for Show Posts now-

In [19]:
new_show_list = []
for post in show_posts:
    new_show_list.append([post[-1], int(post[3])])

show_counts = {}
show_points = {}
for post in new_show_list:
    timee = post[0]
    time_object = dt.datetime.strptime(timee, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(time_object, '%H')
    if hour not in show_counts:
        show_counts[hour] = 1
        show_points[hour] = post[1]
    else:
        show_counts[hour] += 1
        show_points[hour] += post[1]
show_points
avg_show_points = []
for key in show_points:
    avg_show_points.append([show_points[key]/show_counts[key], key])
sorted_avg_show_points = sorted(avg_show_points, reverse = True)

print(sorted_avg_show_points[:5])
print('At the following hour, most points were recieved by Show Posts on an average-')
for point in sorted_avg_show_points[:5]:
    hrr = point[1]
    hrr = dt.datetime.strptime(hrr, '%H').strftime('%H:%M')
    text = '{} : {:.2f} points on average'
    print(text.format(hrr, point[0]))

[[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00'], [36.31147540983606, '18']]
At the following hour, most points were recieved by Show Posts on an average-
23:00 : 42.39 points on average
12:00 : 41.69 points on average
22:00 : 40.35 points on average
00:00 : 37.84 points on average
18:00 : 36.31 points on average


# Conclusion #2
'Show Hn' posts recieved more points on an average between 11pm and 12am.
It is clear now that 'Show Hn' posts recieve more points than 'Ask Hn' posts despite 'Ask Hn' posts recieving more comments. The reason for this could probably be that 'Ask Hn' posts are about questions and doubts, and thus lead to further debate, explaination in the comments and 'Show Hn' posts are about showing off an idea, technology etc. and thus more interaction is made through the like and dislike button.