# Exploring Hacker News Posts

Baan Da Paani - jai hoIn this project, we will analyze whether more comments are received by 'Ask HN' posts or 'Show HN' posts - essentially are Hacker News visitors more interested in answering questions posed by the community or learning from the community. Furthermore, we will also assess how time of posting of a post is correlated with number of comments the post receives.

The aur le beta dataset can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). The dataset comprises 300,000 rows but has been pared down to approximately 20,000 for project purposes. Finally, following is a list and description of the columns - 

id: the unique identifier from Hacker News for the post

title: the title of the post

url: the URL that the posts links to, if the post has a URL

num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: the number of comments on the post

author: the username of the person who submitted the post

created_at: the date and time of the post's submission

Opening and reading the csv file.

In [1]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

Taking a cursory look at the data for better understanding by printing the first few rows.

In [2]:
print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


Separating the data from the headers.

In [3]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


Calculating how many posts are of each type.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(title)
        
    elif title.lower().startswith("show hn"):
        show_posts.append(title)
    else:
        other_posts.append(title)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Building frequncy table for number of comments to see if there are any odd characters or entries.

In [5]:
ask_posts_freq ={}
show_posts_freq = {}
other_posts_freq = {}

In [6]:
for row in hn:
    count = row[4]
    title = row[1]
    if title.lower().startswith("ask hn"):
        if count in ask_posts_freq:
            ask_posts_freq[count] += 1
        else:
            ask_posts_freq[count] = 1 
    if title.lower().startswith("show hn"):
        if count in show_posts_freq:
            show_posts_freq[count] += 1
        else:
            show_posts_freq[count] = 1 
    else:
        if count in other_posts_freq:
            other_posts_freq[count] += 1
        else:
            other_posts_freq[count] = 1 


print(ask_posts_freq)
print(show_posts_freq)
print(other_posts_freq)

{'6': 101, '29': 8, '1': 332, '3': 195, '17': 10, '4': 137, '2': 321, '7': 66, '22': 16, '20': 10, '33': 4, '5': 98, '11': 32, '9': 52, '37': 5, '182': 1, '8': 49, '24': 4, '10': 37, '140': 1, '30': 4, '12': 30, '72': 1, '130': 1, '15': 19, '43': 5, '19': 9, '234': 1, '25': 6, '71': 1, '61': 2, '13': 10, '185': 1, '55': 2, '35': 2, '250': 1, '93': 1, '92': 1, '112': 1, '16': 15, '32': 5, '28': 4, '60': 4, '62': 2, '18': 9, '34': 2, '266': 1, '183': 1, '14': 14, '26': 5, '46': 3, '41': 4, '85': 2, '42': 3, '910': 1, '95': 1, '66': 1, '125': 1, '40': 3, '51': 2, '31': 5, '162': 1, '69': 3, '117': 1, '83': 1, '49': 2, '65': 2, '109': 1, '90': 1, '81': 1, '514': 1, '21': 6, '23': 3, '78': 1, '128': 2, '50': 4, '520': 1, '94': 1, '118': 1, '27': 3, '53': 2, '477': 1, '111': 1, '96': 1, '135': 1, '101': 1, '52': 1, '45': 2, '47': 1, '147': 1, '144': 1, '131': 1, '190': 1, '383': 1, '44': 1, '73': 2, '947': 1, '91': 1, '691': 1, '58': 2, '283': 1, '138': 1, '97': 1, '57': 1, '102': 1, '281': 

From the output it does not seem like there are any odd characters.

Calculating the average number of comments for - Ask HN and Show HN.

In [7]:
total_ask_comments = 0
total_show_comments = 0

for row in hn:
    count = int(row[4])
    title = row[1]

    if title.lower().startswith("ask hn"):
        total_ask_comments += count
       
    if title.lower().startswith("show hn"):
        total_show_comments += count
 
#print(total_ask_comments)
#print(total_show_comments)
print("Average Ask HN posts")
avg_ask_comments = total_ask_comments/(len(ask_posts))
print(avg_ask_comments)

print("\n")

print('Average Show HN posts')
avg_show_comments = total_show_comments/(len(show_posts))

print(avg_show_comments)


Average Ask HN posts
14.038417431192661


Average Show HN posts
10.31669535283993


On average ask posts get more comments than show posts. Since ask posts receive more comments on average, focusing attention just on these.

Evaluating what is a date entry like and what is the type.

In [8]:
for element in hn[:2]:
    print(element[-1])
    print(type(element[-1]))

8/4/2016 11:52
<class 'str'>
1/26/2016 19:30
<class 'str'>


Testing code on how to id and append to make a new list of Ask HN.

In [9]:
result_list = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        result_list.append(row)
        
print(result_list[:2])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


Developing frequency tables for count by hour and comments.

In [10]:
import datetime

counts_by_hour = {}
comments_by_hour = {}

for element in result_list:
    date_1 = element[-1]
    comments = int(element[4])
    
    date_fmt = datetime.datetime.strptime(date_1, "%m/%d/%Y %H:%M")
    #print(date_fmt)
    time_hr = datetime.datetime.strftime(date_fmt,"%H")
    #print(time_hr)
    if time_hr in counts_by_hour:
        counts_by_hour[time_hr] += 1
    else:
        counts_by_hour[time_hr] = 1
    if time_hr in comments_by_hour:
        comments_by_hour[time_hr] += comments
    else:
        comments_by_hour[time_hr] = comments
        
print(counts_by_hour)
print(comments_by_hour)


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


result_list.append(
        [post[6], int(post[4])]
    )
    
Above is a method to append 2 elements at teh same time to the list

Making a dictionary of average comments by the hour.

In [11]:
avg_by_hour_dictionary = {}

for key in counts_by_hour:
    for next_key in comments_by_hour:
        if key == next_key:
            avg_by_hour_dictionary[key] = comments_by_hour[key]/counts_by_hour[key]
            
print(avg_by_hour_dictionary)

{'09': 5.5777777777777775, '13': 14.741176470588234, '10': 13.440677966101696, '14': 13.233644859813085, '16': 16.796296296296298, '23': 7.985294117647059, '12': 9.41095890410959, '17': 11.46, '15': 38.5948275862069, '21': 16.009174311926607, '20': 21.525, '02': 23.810344827586206, '18': 13.20183486238532, '03': 7.796296296296297, '05': 10.08695652173913, '19': 10.8, '01': 11.383333333333333, '22': 6.746478873239437, '08': 10.25, '04': 7.170212765957447, '00': 8.127272727272727, '06': 9.022727272727273, '07': 7.852941176470588, '11': 11.051724137931034}


Making a list of average comments instead of dictionary

In [12]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
#appending two items in a list at the same time
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Swapping column for easy read

In [13]:
swap_avg_by_hour = []

for element in avg_by_hour:
    swap_avg_by_hour.append([element[1],element[0]])
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

print(sorted_swap[:5])

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [

Formatting the hours and averages to two decimels.

In [14]:
print("Top 5 Hours for Ask Posts Comments")
for element in sorted_swap[:5]:
    hour = element[1]
    average = element[0]
    fmt_hr = datetime.datetime.strptime(hour,"%H").strftime("%H:%M")
    output = "{} : {avg:,.2f} average comments per post.".format(fmt_hr,avg= average)
    print(output)
    

Top 5 Hours for Ask Posts Comments
15:00 : 38.59 average comments per post.
02:00 : 23.81 average comments per post.
20:00 : 21.52 average comments per post.
16:00 : 16.80 average comments per post.
21:00 : 16.01 average comments per post.


The best times for posting Ask posts for maximizing comments would be 3 PM, 2 AM followed by 8 PM (all times in Eastern Standard Time, US classification essentially New York time). While, 3 PM and 8 PM times for maximizing comments make sense - maybe HN community is wrapping up a day's work and they browse HN around 3 PM or similarly they sit down with their computers for some leisure browsing at 8 PM after dinner. The 2 AM time frame is puzzling - maybe it is the 'night owl' HN community which checks out and comments on posts before going to bed.

# Conclusion:

The aim of this project was to analyze which out of the two - "show" or "tell" projects garner more comments from the HN community. Furthermore, what are the bext times to post to get maximum comments.

We found that on average "Ask" or "Tell" posts garner more comments per post. We also found that 3 - 4 PM EST is the bext time post "ask" posts to get maximum comments.