# Exploring Hacker News Posts

The aim of the project is to compare types of posts from **Hacker News**, where we can find user-submitted technology related stories/posts that receives votes and comments from other users.  
  
Types of posts that I'll be comparing are:
* **Ask HN**: posts that are submitted by users when they want to ask the Hacker News community a specific question,
* **Show HN**: posts that are submitted by users when they want to show the Hacker News community a project, product or something interesting.

## Opening the dataset

In [3]:
from csv import reader
import datetime as dt

In [4]:
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

In [5]:
def explore_data(dataset, start, end, rows_and_columns=False):
    
    """
    Displays a specified slice of the dataset and optionally prints the dataset's dimensions.
    
        Parameters:
            dataset (list of lists): The dataset to be explored, where each inner list represents a row.
            start (int): Starting index of the slice.
            end (int): Ending index of the slice (non-inclusive).
            rows_and_columns (bool, optional): If True, prints the number of rows and columns in the dataset.

        Returns:
            None
            
    """
        
    dataset_slice = dataset[start:end] 
    
    for row in dataset_slice:
        print(row)
        print('\n') 

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [6]:
hn_header = hn[0]
hn_data = hn[1:]

In [7]:
print(hn_header)
print("\n")
explore_data(hn_data, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of 

As we can see, in our dataset there are 20100 rows and each of them represtents piece of information about one post.  
  
Each row contains following informations:
* **id**: the unique identifier from Hacker News for the post
* **title**: the title of the post
* **url**: the URL that the posts links to, if the post has a URL
* **num_points**: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **num_comments**: the number of comments on the post
* **author**: the username of the person who submitted the post
* **created_at**: the date and time of the post's submission

## Extracting Ask HN and Show HN posts

In [8]:
ask_posts = []
show_posts = []
other_posts = []

In [9]:
for row in hn_data:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
    

In [10]:
explore_data(ask_posts, 0, 5, True)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


Number of rows: 1744
Number of columns: 7


In [11]:
explore_data(show_posts, 0, 5, True)

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']


['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']


['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']


['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']


Number of rows: 1162
Number of columns: 7


In [12]:
explore_data(other_posts, 0, 5, True)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of rows: 17194
Number of columns: 7


We can see that in our dataset there are: 
* 1744 Ask HN posts,
* 1162 Show HN posts,
* 17194 other posts.

## Calculating the average number of comments for Ask HN and Show HN posts

In [13]:
total_ask_comments = 0

In [14]:
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

In [15]:
avg_ask_comments = total_ask_comments / len(ask_posts)

In [16]:
avg_ask_comments

14.038417431192661

In [17]:
total_show_comments = 0 

In [18]:
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    

In [19]:
avg_show_comments = total_show_comments / len(show_posts)

In [20]:
avg_show_comments

10.31669535283993

In [21]:
total_other_comments = 0

In [22]:
for post in other_posts:
    num_comments = int(post[4])
    total_other_comments += num_comments

In [23]:
avg_other_comments = total_other_comments / len(other_posts)
                                                

In [24]:
avg_other_comments

26.8730371059672

**Ask HN** post receives on average 14 comments, meanwhile **Show HN** receives on average 10 comments. Other posts receives on average receives approximately two times more comments than Ask HN and Show HN posts: 27 comments.

## Finding the Number of Ask Posts and Comments by Hour Created

In [25]:
result_list = []

In [26]:
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    

In [27]:
counts_by_hour = {}
comments_by_hour = {}

In [28]:
for result in result_list:
    date = result[0]
    date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date_dt, "%H")
    num_comments = result[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

Now we have two dictionaries:
* **counts_by_hour**: contains the number of ask posts created during each hour a day 
* **comments_by_hour**: contains the corresponding number of comments ask posts created at each hour received

In [29]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for Ask HN Posts by Hour


Now I will calculate the average number of comments for Ask HN posts by hour.

In [30]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

Next, it is time to sort these number in descending order

In [33]:
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1], hour[0]])

In [35]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [36]:
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [37]:
for row in sorted_swap:
    print(f"{dt.datetime.strptime(row[1], '%H').strftime('%H:%M')}: {row[0]:.2f} average comments per post")

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


We can see that the hour that receives the most comments per post is 15:00, with on average 38.59 comments per post. The hour with the smallest number of comments is 9:00, with on average 5.58 comments per post.  
  
Lets see if the same pattern is when we analyse Show HN posts.

## Finding the Number of Show HN Posts and Comments by Hour Created

In [38]:
show_result_list = []

In [39]:
for post in show_posts:
    created_at = post[6]
    num_comments = int(post[4])
    show_result_list.append([created_at, num_comments])
    

In [40]:
show_counts_by_hour = {}
show_comments_by_hour = {}

In [42]:
for result in show_result_list:
    date = result[0]
    date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date_dt, "%H")
    num_comments = result[1]
    if hour not in show_counts_by_hour:
        show_counts_by_hour[hour] = 1
        show_comments_by_hour[hour] = num_comments
    else:
        show_counts_by_hour[hour] += 1
        show_comments_by_hour[hour] += num_comments

In [43]:
show_comments_by_hour

{'14': 1156,
 '22': 570,
 '18': 962,
 '07': 299,
 '20': 612,
 '05': 58,
 '16': 1084,
 '19': 539,
 '15': 632,
 '03': 287,
 '17': 911,
 '06': 142,
 '02': 127,
 '13': 946,
 '08': 165,
 '21': 272,
 '04': 247,
 '11': 491,
 '12': 720,
 '23': 447,
 '09': 291,
 '01': 246,
 '10': 297,
 '00': 487}

## Calculating the Average Number of Comments for Show HN Posts by Hour

In [57]:
show_avg_by_hour = []
for hour in show_counts_by_hour:
    show_avg_by_hour.append([hour, show_comments_by_hour[hour] / show_counts_by_hour[hour]])

In [58]:
show_swap_avg_by_hour = []
for hour in show_avg_by_hour:
    show_swap_avg_by_hour.append([hour[1], hour[0]])

In [59]:
show_sorted_swap = sorted(show_swap_avg_by_hour, reverse=True)

In [62]:
for row in show_sorted_swap:
    print(f"{dt.datetime.strptime(row[1], '%H').strftime('%H:%M')}: {row[0]:.2f} average comments per post")

18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post
12:00: 11.80 average comments per post
16:00: 11.66 average comments per post
07:00: 11.50 average comments per post
11:00: 11.16 average comments per post
03:00: 10.63 average comments per post
20:00: 10.20 average comments per post
19:00: 9.80 average comments per post
17:00: 9.80 average comments per post
09:00: 9.70 average comments per post
13:00: 9.56 average comments per post
04:00: 9.50 average comments per post
06:00: 8.88 average comments per post
01:00: 8.79 average comments per post
10:00: 8.25 average comments per post
15:00: 8.10 average comments per post
21:00: 5.79 average comments per post
08:00: 4.85 average comments per post
02:00: 4.23 average comments per post
05:00: 3.05 average comments per post


When we analyse Show HN posts we can see that 15:00 is far less popular hour. We can see that at 15:00 Show HN posts received on average 4.75 times less comments per post than Ask HN posts.  
  
The hour with the smallest number of comments is 05:00, with on average 3.05 comments per post.
  
The hour with the most comments per post is 18:00, with on average 15.77 comments per post.  
  
We can also see that the hour that receives the most comments per post among Ask HN posts receives two times more comments per post than hour that receives the most comments per post among Show HN posts.