### A COMPARISON OF REACTIONS TO POSTS IN HACKERS NEWS

The aim of this analysis was to find out which kind of posts receive more comments and points on average. A key factor that was considered was the time of post creation.

The dataset includes posts from Hacker News' listings and are classified into two: ask posts and show posts. Ask posts ask the Hacker News community a specific question. On the other hand, show posts are submitted to show the Hacker News community a project, product, or just generally something interesting. 

The data can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

In [75]:
from csv import reader 

In [76]:
# Opening and reading the source file

opened_file =  open ('hacker_news.csv', encoding='utf8')
read_file = reader (opened_file)
hn = list (read_file)

### Exploring the Data

In [77]:
#Displaying the first five rows

first_five_rows = hn [:5]
print (first_five_rows)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [78]:
#Removing the header row

headers = hn [0]
print (headers)
print ('\n')
hn = hn [1:]
print (hn [:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Cleaning and Preparing the Data
Since we're only concerned with ask posts and show posts, whose titles begin with Ask HN or Show HN; we'll create new lists of lists containing just the data for those titles. 

In [79]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row [1]
    title = title.lower () #uniformity for efficient analysis of string data
                              
    if title.startswith ('ask hn'):
        ask_posts.append (row)
        
    elif title.startswith ('show hn'):
        show_posts.append (row)
        
    else:
        other_posts.append (row)
        
n_ask_posts = len (ask_posts)
print (n_ask_posts) 

n_show_posts = len (show_posts)
print (n_show_posts)

n_other_posts = len (other_posts)
print (n_other_posts)
print ('\n')

#Verifying the lengths
print (n_show_posts + n_ask_posts + n_other_posts)
print (len (hn))

1744
1162
17194


20100
20100


### Analyzing the Data


#### 1. Which category of posts received the most comments on average?

In [80]:
total_ask_comments = 0

for row in ask_posts:
    comments = int (row[4])
    total_ask_comments = total_ask_comments + comments

avg_ask_comments = total_ask_comments/n_ask_posts
print (avg_ask_comments)
     

14.038417431192661


In [81]:
total_show_comments = 0

for row in show_posts:
    comments = int (row[4])
    total_show_comments = total_show_comments + comments
    
avg_show_comments = total_show_comments/n_show_posts
print (avg_show_comments)   

10.31669535283993


#### Conclusion 1:
From the above numbers, on average, ask posts receive more comments than show posts. 

#### 2. What is the relationship between the time that an ask  post is created and the number of comments it attracts?

First, isolate the two columns that are relevant at this stage('created_at' and 'num_comments') :

In [82]:
result_list = []

for row in ask_posts:
    post_time = row [-1]
    comments = int (row[4])
    comments_time = [post_time, comments]
    result_list.append (comments_time)

print (result_list[:5])    

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


Second, extract hours from dates :

In [83]:
from datetime import datetime as dt

In [84]:
str_date_format = '%m/%d/%Y %H:%M' 
    
#change string date to datetime object
for row in result_list:
    post_time = row [0]
    post_time_object = dt.strptime (post_time, str_date_format)

#extract 'hour' from datetime object    
    post_hour = dt.strftime (post_time_object, '%H')
    row [0] = post_hour
    
print (result_list [:5])   

[['09', 6], ['13', 29], ['10', 1], ['14', 3], ['16', 17]]


Third, analyze the number of ask posts created per hour, alongside the number of comments received :

In [85]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row [0]
    comments = row [1]
          
    if hour not in counts_by_hour:
        counts_by_hour [hour] = 1   
        comments_by_hour [hour] = comments
    else:  
        counts_by_hour [hour] += 1  
        comments_by_hour [hour] += comments
        
print (counts_by_hour)
print ('\n')
print (comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Finally, calculate the average number of comments received on the ask posts created in each hour of the day :

In [86]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_comments_per_post= comments_by_hour[hour]/counts_by_hour [hour]
    avg_by_hour.append ([hour, avg_comments_per_post])

#Sorting the avg_by_hour list
swap_avg_by_hour = []
for row in avg_by_hour:
    avg_comments_per_post = row [1]
    hour = row [0]
    swap_avg_by_hour.append ([avg_comments_per_post, hour])

swap_avg_by_hour.sort (reverse = True)
sorted_swap = swap_avg_by_hour
print (sorted_swap) 

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


Top 5 hours with highest average comments :

In [87]:
print ("Top 5 Hours for Ask Posts Comments")
print ('\n')
for row in sorted_swap [:5]:
    avg_comments_per_post = row [0]
    hour = row [1]
    print (row)

Top 5 Hours for Ask Posts Comments


[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']


Formatting hours :

In [88]:
str_hour_format = '%H'

for row in sorted_swap:
    hour = row [1]
    hour_object = dt.strptime (hour, str_hour_format)
    formatted_hour = dt.strftime (hour_object, '%H:%M')
    row [1] = formatted_hour
print (sorted_swap [1:5])

[[23.810344827586206, '02:00'], [21.525, '20:00'], [16.796296296296298, '16:00'], [16.009174311926607, '21:00']]


Formatting the average comments per post :

In [89]:
for row in sorted_swap:
    avg_comments_per_post = row [0]
    formatted_avgs = "{:.2f}".format(avg_comments_per_post)
    row [0] = formatted_avgs
print (sorted_swap [1:5])

[['23.81', '02:00'], ['21.52', '20:00'], ['16.80', '16:00'], ['16.01', '21:00']]


In [90]:
for row in sorted_swap:
    final_avgs = row [0]
    final_hours = row [1]
    print ("{} : {}"" average comments per post." 
           .format (final_hours, final_avgs))

15:00 : 38.59 average comments per post.
02:00 : 23.81 average comments per post.
20:00 : 21.52 average comments per post.
16:00 : 16.80 average comments per post.
21:00 : 16.01 average comments per post.
13:00 : 14.74 average comments per post.
10:00 : 13.44 average comments per post.
14:00 : 13.23 average comments per post.
18:00 : 13.20 average comments per post.
17:00 : 11.46 average comments per post.
01:00 : 11.38 average comments per post.
11:00 : 11.05 average comments per post.
19:00 : 10.80 average comments per post.
08:00 : 10.25 average comments per post.
05:00 : 10.09 average comments per post.
12:00 : 9.41 average comments per post.
06:00 : 9.02 average comments per post.
00:00 : 8.13 average comments per post.
23:00 : 7.99 average comments per post.
07:00 : 7.85 average comments per post.
03:00 : 7.80 average comments per post.
04:00 : 7.17 average comments per post.
22:00 : 6.75 average comments per post.
09:00 : 5.58 average comments per post.


#### Conclusion 2:
Based on the averages, the best time to create ask posts would be between 3:00 PM to 3:59 PM EST time. The least favorable time to create an ask post is between 9:00 AM to 9:59 AM. Each of these would be an hour later in CST as evident when you compare the two time zones as in below code :

In [91]:
from pytz import timezone
string_format = "%Y-%m-%d %H:%M:%S %Z%z"
# Eastern time in UTC
now_utc = dt.now(timezone('US/Eastern'))
print (now_utc.strftime(string_format))
# Convert to US/Central time zone
now_central = now_utc.astimezone(timezone('US/Central'))
print (now_central.strftime(string_format))

2020-01-26 04:36:29 EST-0500
2020-01-26 03:36:29 CST-0600


#### 3. What category of posts received more points on average?

In [92]:
total_ask_points = 0

for row in ask_posts:
    points = int (row[3])
    total_ask_points = total_ask_points + points

avg_ask_points = total_ask_points/n_ask_posts
print (avg_ask_points)

15.061926605504587


In [93]:
total_show_points = 0

for row in show_posts:
    points = int (row[3])
    total_show_points = total_show_points + points
    
avg_show_points = total_show_points/n_show_posts
print (avg_show_points)   

27.555077452667813


#### Conclusion 3:
From the above numbers, on average, show posts receive more points than ask posts. 

### Further Analysis
#### 2. What is the relationship between the time that a show  post is created and the number of points it receives?

In [94]:
points_list = []

for row in show_posts:
    post_time = row [-1]
    points = int (row[3])
    points_time = [post_time, points]
    points_list.append (points_time)

print (points_list[:5])  

[['11/25/2015 14:03', 26], ['11/29/2015 22:46', 747], ['4/28/2016 18:05', 1], ['7/28/2016 7:11', 3], ['1/9/2016 20:45', 1]]


Then extract hours from the dates :

In [95]:
str_date_format = '%m/%d/%Y %H:%M' 
    
#change string date to datetime object
for row in points_list:
    post_time = row [0]
    post_time_object = dt.strptime (post_time, str_date_format)

#extract 'hour' from datetime object    
    post_hour = dt.strftime (post_time_object, '%H')
    row [0] = post_hour
    
print (points_list [:5]) 

[['14', 26], ['22', 747], ['18', 1], ['07', 3], ['20', 1]]


Calculate the number of show posts created per hour, along with the number of points they received :

In [96]:
counts_by_hour = {}
points_by_hour = {}

for row in points_list:
    hour = row [0]
    show_points = row [1]
          
    if hour not in counts_by_hour:
        counts_by_hour [hour] = 1   
        points_by_hour [hour] = show_points
    else:  
        counts_by_hour [hour] += 1  
        points_by_hour [hour] += show_points
        
print (counts_by_hour)
print ('\n')
print (points_by_hour)

{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}


{'14': 2187, '22': 1856, '18': 2215, '07': 494, '20': 1819, '05': 104, '16': 2634, '19': 1702, '15': 2228, '03': 679, '17': 2521, '06': 375, '02': 340, '13': 2438, '08': 519, '21': 866, '04': 386, '11': 1480, '12': 2543, '23': 1526, '09': 553, '01': 700, '10': 681, '00': 1173}


Get the average number of points received on the show posts created in each hour of the day :

In [97]:
avg_points_by_hour = []

for hour in points_by_hour:
    avg_points_per_post= points_by_hour[hour]/counts_by_hour [hour]
    avg_points_by_hour.append ([hour, avg_points_per_post])

#sorting the swap_avg_points_by_hour    
swap_avg_points_by_hour = []

for row in avg_points_by_hour:
    avg_points_per_post = row [1]
    hour = row [0]
    swap_avg_points_by_hour.append ([avg_points_per_post, hour])
swap_avg_points_by_hour.sort (reverse = True)
sorted_points = swap_avg_points_by_hour
print (sorted_points)

[[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00'], [36.31147540983606, '18'], [33.63636363636363, '11'], [30.945454545454545, '19'], [30.316666666666666, '20'], [28.564102564102566, '15'], [28.322580645161292, '16'], [27.107526881720432, '17'], [25.430232558139537, '14'], [25.14814814814815, '03'], [25.0, '01'], [24.626262626262626, '13'], [23.4375, '06'], [19.0, '07'], [18.916666666666668, '10'], [18.433333333333334, '09'], [18.425531914893618, '21'], [15.264705882352942, '08'], [14.846153846153847, '04'], [11.333333333333334, '02'], [5.473684210526316, '05']]


In [98]:
print ("Top 5 Hours for Show Posts Points")
print ('\n')
for row in sorted_points [:5]:
    avg_points_per_post  = row [0]
    hour = row [1]
    print (row)

Top 5 Hours for Show Posts Points


[42.388888888888886, '23']
[41.68852459016394, '12']
[40.34782608695652, '22']
[37.83870967741935, '00']
[36.31147540983606, '18']


#### Conclusion 4:
Based on the averages, the show posts that receive the most points are created between 11:00 PM to 11:59 PM EST. The posts that receive the least points are created between 5:00 AM to 5:59 AM EST. 

#### 5. What is the relationship between the time that an ask  post is created and the number of points it receives?

In [99]:
ask_points_list = []

for row in ask_posts:
    post_time = row [-1]
    points = int (row[3])
    ask_points_time = [post_time, points]
    ask_points_list.append (ask_points_time)

print (ask_points_list[:5])  

[['8/16/2016 9:55', 2], ['11/22/2015 13:43', 28], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 1], ['10/15/2015 16:38', 28]]


In [100]:
str_date_format = '%m/%d/%Y %H:%M' 
    
#change string date to datetime object
for row in ask_points_list:
    post_time = row [0]
    post_time_object = dt.strptime (post_time, str_date_format)

#extract 'hour' from datetime object    
    post_hour = dt.strftime (post_time_object, '%H')
    row [0] = post_hour
    
print (ask_points_list [:5])  

[['09', 2], ['13', 28], ['10', 1], ['14', 1], ['16', 28]]


In [101]:
counts_by_hour = {}
ask_points_by_hour = {}

for row in ask_points_list:
    hour = row [0]
    ask_points = row [1]
          
    if hour not in counts_by_hour:
        counts_by_hour [hour] = 1   
        ask_points_by_hour [hour] = ask_points
    else:  
        counts_by_hour [hour] += 1  
        ask_points_by_hour [hour] += ask_points
        
print (counts_by_hour)
print ('\n')
print (ask_points_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 329, '13': 2062, '10': 1102, '14': 1282, '16': 2522, '23': 581, '12': 782, '17': 1941, '15': 3479, '21': 1721, '20': 1151, '02': 793, '18': 1741, '03': 374, '05': 552, '19': 1513, '01': 700, '22': 511, '08': 515, '04': 389, '00': 451, '06': 591, '07': 361, '11': 825}


What is the average number of points received on the ask posts created in each hour of the day?

In [102]:
avg_ask_points_by_hour = []

for hour in ask_points_by_hour:
    ask_points_per_post= ask_points_by_hour[hour]/counts_by_hour [hour]
    avg_ask_points_by_hour.append ([hour, ask_points_per_post])

# sorting avg_ask_points_by_hour
swap_ask_points_by_hour = []

for row in avg_ask_points_by_hour:
    ask_points_per_post = row [1]
    hour = row [0]
    swap_ask_points_by_hour.append ([ask_points_per_post, hour])
swap_ask_points_by_hour.sort (reverse = True)
sorted_ask_points = swap_ask_points_by_hour
print (sorted_ask_points)

[[29.99137931034483, '15'], [24.258823529411764, '13'], [23.35185185185185, '16'], [19.41, '17'], [18.677966101694917, '10'], [15.972477064220184, '18'], [15.788990825688073, '21'], [14.3875, '20'], [14.224137931034482, '11'], [13.754545454545454, '19'], [13.672413793103448, '02'], [13.431818181818182, '06'], [12.0, '05'], [11.981308411214954, '14'], [11.666666666666666, '01'], [10.729166666666666, '08'], [10.712328767123287, '12'], [10.617647058823529, '07'], [8.544117647058824, '23'], [8.27659574468085, '04'], [8.2, '00'], [7.311111111111111, '09'], [7.197183098591549, '22'], [6.925925925925926, '03']]


In [103]:
print ("Top 5 Hours for Ask Posts Points")
print ('\n')
for row in sorted_ask_points [:5]:
    ask_points_per_post = row [0]
    hour = row [1]
    print (row)

Top 5 Hours for Ask Posts Points


[29.99137931034483, '15']
[24.258823529411764, '13']
[23.35185185185185, '16']
[19.41, '17']
[18.677966101694917, '10']


#### 6. Other Posts

In [104]:
total_other_comments = 0

for row in other_posts:
    other_comments = int (row[4])
    total_other_comments = total_other_comments + other_comments

avg_other_comments = total_other_comments/n_other_posts

print ("Other Posts Average Number of Comments :")
print ('\n')
print (avg_other_comments)

Other Posts Average Number of Comments :


26.8730371059672


In [105]:
total_other_points = 0

for row in other_posts:
    other_points = int (row[3])
    total_other_points = total_other_points + other_points

avg_other_points = total_other_points/n_other_posts

print ("Other Posts Average Number of Points :")
print ('\n')
print (avg_other_points)

Other Posts Average Number of Points :


55.4067698034198
