# Exploring Hacker News Posts
***
This is a guided-project under Dataquest's curriculum and is meant for personal development.
***
Hacker News is a reddit-like website that is popular among technology and startup circles. There are many types of posts on Hacker News, for example, sharing information or news, asking questions, or showing interesting projects. 

This project is to analyze posts which titles begin with *Ask HN* and *Show HN*, which one has more comments on average and whether there is a point in time where post receives more comments. Title begins with *Ask HN* is used when a user would like to ask a question, and title begins with *Show HN* is used when a user would like to show a project or any interesting things to the community.


## Step 1 - Data preparation

First, the data is downloaded from <a href="https://www.kaggle.com/hacker-news/hacker-news-posts">Kaggle</a> in CSV format. The data will be imported, cleaned, and separated into 3 groups (ask posts, show posts, and other posts) per our goal.

### Step 1a - Import data
The data set contains Hacker News posts during 12-month period (up to 26 September 2016). There are 7 columns in this data set.

|  Column name  | Description                              |
|---------------|------------------------------------------|
| id            | ID of the post                           |
| title         | Title of the post                        |
| url           | URL of the item being linked to          |
| num_points    | Number of votes the posts received       |
| num_comments  | Number of comments the post received     |
| author        | User that submitted the post             |
| created_at    | Date and time the post was made (in EST) |

In [1]:
# This function is to extract data in the file to a more readable format.
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)

In [2]:
from csv import reader
#1. Open the file
data = open("HN_posts_year_to_Sep_26_2016.csv", encoding="utf-8")
#2. Convert the file into a list
hn = list(reader(data))
#3. Separate headers and data
headers = hn[0]
hn = hn[1:]

print(headers)
print()
explore_data(hn, 0 , 5)
print()
print(f"Number of rows: {len(hn):,.0f}")
print(f"Number of columns: {len(hn[0]):,.0f}")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']

Number of rows: 293,

### Step 1b - Clean data
First, we will check that each rows contain all columns. Then we will remove posts with zero comment as we focus only posts with comment(s).

In [3]:
# Check each row contain all columns
col_row = True

while col_row:
    for row in hn:
        if len(headers) != len(row):
            print(row, "\n")
            print("Index of the row is " + str(hn.index(row)))
            col_row = False
    print("Check ended.")
    print(f"Each rows contain all columns: {col_row}")
    break

Check ended.
Each rows contain all columns: True


In [4]:
# Remove posts with zero comments
hn_final = []

for row in hn:
    if int(row[4]) != 0:
        hn_final.append(row)

print(f"Hacker News posts with comment(s): {len(hn_final):,.0f} posts")
print()
explore_data(hn_final, 0, 3)

Hacker News posts with comment(s): 80,401 posts

['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']


### Step 1c - Separate data
The data set will now be separated into 3 groups: ask posts, show posts, and other posts. for our analysis, we will focus only at ask posts and show posts.

In [5]:
# Separate Ask HN, Show HN and other posts into different lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn_final:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

assert len(hn_final) == len(ask_posts) + len(show_posts) + len(other_posts)

print(f"Ask HN : {len(ask_posts):,.0f} posts")
explore_data(ask_posts, 0, 3)
print()
print(f"Show HN : {len(show_posts):,.0f} posts")
explore_data(show_posts, 0, 3)
print()
print(f"Other posts : {len(other_posts):,.0f} posts")
explore_data(other_posts, 0, 3)
print()

Ask HN : 6,911 posts
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']

Show HN : 5,059 posts
['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']
['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']
['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']

Other posts : 68,431 posts
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/wha

## Step 2 - Data analysis
Our analysis is aim to answer 2 questions:

1. which group has more comments on average 
2. whether there is a point in time where post receives more comments. 

### Which group has more comments on average?
For each groups, an average number of comments is calculated, then compared against each other.

In [6]:
# Find total number of comments for Ask HN posts
total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

print(f"Total comments in Ask HN: {total_ask_comments:,.0f} comments")

# Find average number of comments for Ask HN posts
avg_ask_comments = total_ask_comments / len(ask_posts)

print(f"Average comments per an Ask HN post: {avg_ask_comments:,.0f} comments")

Total comments in Ask HN: 94,986 comments
Average comments per an Ask HN post: 14 comments


In [7]:
# Find total number of comments for Show HN posts
total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

print(f"Total comments in Show HN: {total_show_comments:,.0f} comments")

# Find average number of comments for Show HN posts
avg_show_comments = total_show_comments / len(show_posts)

print(f"Average comments per a Show HN post: {avg_show_comments:,.0f} comments")

Total comments in Show HN: 49,633 comments
Average comments per a Show HN post: 10 comments


On average, *Ask HN* posts receive more comments than *Show HN* posts, 14 to 10 posts respectively.

### Is there a point in time where post receives more comments?
Here we use our ask posts as our sample data set. We will analyze at which hour of day the ask posts receive more comments.

In [8]:
# Create a new list containing selected columns (created_at and num_comments) for the analysis
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[-1]
    comments = int(row[4])
    result = [created_at, comments]
    result_list.append(result)

explore_data(result_list, 0, 3)

['9/26/2016 2:53', 7]
['9/26/2016 1:17', 3]
['9/25/2016 22:48', 3]


In [9]:
# Create 2 dictionaries to keep track of (1) number of posts created during each hour, and (2) number of comments for posts created during each hour.
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    # number of posts created during each hour
    created_dt = row[0]
    dt_format = "%m/%d/%Y %H:%M"
    created_dt = dt.datetime.strptime(created_dt, dt_format)
    created_hour = dt.datetime.strftime(created_dt, "%H")
    counts_by_hour.setdefault(created_hour, 0)
    counts_by_hour[created_hour] += 1
    # number of comments for posts created during each hour
    comments = row[1]
    comments_by_hour.setdefault(created_hour, 0)
    comments_by_hour[created_hour] += comments

check_tot_posts = 0
for k, v in counts_by_hour.items():
    check_tot_posts += v

check_tot_comments = 0
for k, v in comments_by_hour.items():
    check_tot_comments += v

assert check_tot_posts == len(ask_posts)
assert check_tot_comments == total_ask_comments

print("Number of posts created during each hour:")
print(counts_by_hour)
print()
print("Number of comments for posts created during each hour:")
print(comments_by_hour)

Number of posts created during each hour:
{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}

Number of comments for posts created during each hour:
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [10]:
# Calculate average number of comments for posts created during each hour
avg_by_hour = []

for hour in comments_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])

# Swap the row to sort by average number of comments
swap_avg_by_hour = []

for row in range(len(avg_by_hour)):
    swap = avg_by_hour[row][1], avg_by_hour[row][0]
    swap_avg_by_hour.append(swap)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Show Top 5 result
print("Top 5 hours for Ask HN comments (on average):")
for row in sorted_swap[0:5]:
    dt_hour = row[1]
    hour_format = "%H"
    dt_hour1 = dt.datetime.strptime(dt_hour, hour_format)
    dt_hour2 = dt.datetime.strftime(dt_hour1, "%H:00")
    print(f"{dt_hour2} >> {row[0]:,.2f} comments per post")

Top 5 hours for Ask HN comments (on average):
15:00 >> 39.67 comments per post
13:00 >> 22.22 comments per post
12:00 >> 15.45 comments per post
10:00 >> 13.76 comments per post
17:00 >> 13.73 comments per post


Based on data above, a user should submit an *Ask HN* post during 15.00hrs (EST) to get more comments or answers.

## Conclusion
This project is to analyze posts which titles begin with Ask HN and Show HN, which one has more comments on average and whether there is a point in time where post receives more comments.

Per our analysis, a user should submit an *Ask HN* post during 15.00hrs (EST) to get more comments.

## Next steps
### a. Calculate percentage of posts in each group that receive comments
There are 293,119 posts in our data set. However, there are only 80,401 posts with comments. We would like to know further the percentage of posts with comments for both *Ask HN* and *Show HN*.

In [11]:
# Find total number of posts for each group
ask_posts_all = []
show_posts_all = []
other_posts_all = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts_all.append(row)
    elif title.lower().startswith("show hn"):
        show_posts_all.append(row)
    else:
        other_posts_all.append(row)

assert len(hn) == len(ask_posts_all) + len(show_posts_all) + len(other_posts_all)

print(f"Total Ask HN : {len(ask_posts_all):,.0f} posts")
explore_data(ask_posts, 0, 3)
print()
print(f"Total Show HN : {len(show_posts_all):,.0f} posts")
explore_data(show_posts, 0, 3)
print()
print(f"Total Other posts : {len(other_posts_all):,.0f} posts")
explore_data(other_posts, 0, 3)
print()

Total Ask HN : 9,139 posts
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']

Total Show HN : 10,158 posts
['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']
['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']
['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']

Total Other posts : 273,822 posts
['12578975', 'Saving the Hassle of Shopping', 'https://blog.mensw

In [12]:
# Calculate percentage of posts with comment(s)

perc_ask_post = (len(ask_posts) / len(ask_posts_all)) * 100
perc_show_post = (len(show_posts) / len(show_posts_all)) * 100
perc_other_post = (len(other_posts) / len(other_posts_all)) * 100

print(f"Rate which Ask HN posts received comment(s): {perc_ask_post:,.2f}%")
print(f"Rate which Show HN posts received comment(s): {perc_show_post:,.2f}%")
print(f"Rate which other posts received comment(s): {perc_other_post:,.2f}%")

Rate which Ask HN posts received comment(s): 75.62%
Rate which Show HN posts received comment(s): 49.80%
Rate which other posts received comment(s): 24.99%


The percentages above strengthen our conclusion that an *Ask HN* post should be submitted. 75% of all Ask HN posts received comments, while only 50% of Show HN posts received comments. Surprisingly, only 25% of other posts got comments.

### b. Determine whether Ask HN or Show HN receive more points on average

In [13]:
# Remove posts with zero points
hn_points = []

for row in hn:
    if int(row[3]) != 0:
        hn_points.append(row)

print(f"Hacker News posts with point(s): {len(hn_points):,.0f} posts")

Hacker News posts with point(s): 293,119 posts


In [14]:
# Find total number of points for Ask HN posts
total_ask_points = 0

for row in ask_posts_all:
    points = int(row[3])
    total_ask_points += points

print(f"Total points in Ask HN: {total_ask_points:,.0f} points")

# Find average number of points for Ask HN posts
avg_ask_points = total_ask_points / len(ask_posts_all)

print(f"Average points per an Ask HN post: {avg_ask_points:,.0f} points")

Total points in Ask HN: 103,378 points
Average points per an Ask HN post: 11 points


In [15]:
# Find total number of points for show HN posts
total_show_points = 0

for row in show_posts_all:
    points = int(row[3])
    total_show_points += points

print(f"Total points in show HN: {total_show_points:,.0f} points")

# Find average number of points for show HN posts
avg_show_points = total_show_points / len(show_posts_all)

print(f"Average points per an show HN post: {avg_show_points:,.0f} points")

Total points in show HN: 150,781 points
Average points per an show HN post: 15 points


On average, *Show HN* posts receive more points than *Ask HN* posts.

In conclusion, if a user would like to have more comments, an *Ask HN* post should be created. If a user would like to have more votes, a *Show HN* post should be created.

### c. Determine at which time where posts receive more points

In [17]:
# Create a new list containing selected columns (created_at and num_points) for the analysis
import datetime as dt

# Ask posts
point_ask_list = []

for row in ask_posts_all:
    created_at = row[-1]
    points = int(row[3])
    result = [created_at, points]
    point_ask_list.append(result)

print("Ask posts:")
explore_data(point_ask_list, 0, 3)
print()

# Show posts
point_show_list = []

for row in show_posts_all:
    created_at = row[-1]
    points = int(row[3])
    result = [created_at, points]
    point_show_list.append(result)

print("Show posts:")
explore_data(point_show_list, 0, 3)


Ask posts:
['9/26/2016 2:53', 4]
['9/26/2016 1:17', 6]
['9/25/2016 22:57', 1]

Show posts:
['9/26/2016 0:36', 2]
['9/26/2016 0:01', 1]
['9/25/2016 23:44', 1]


In [28]:
# Ask posts

# Create 2 dictionaries to keep track of (1) number of posts created during each hour, and (2) number of points for posts created during each hour.
counts_ask_by_hour = {}
points_ask_by_hour = {}

for row in point_ask_list:
    # number of posts created during each hour
    created_dt = row[0]
    dt_format = "%m/%d/%Y %H:%M"
    created_dt = dt.datetime.strptime(created_dt, dt_format)
    created_hour = dt.datetime.strftime(created_dt, "%H")
    counts_ask_by_hour.setdefault(created_hour, 0)
    counts_ask_by_hour[created_hour] += 1
    # number of points for posts created during each hour
    points = row[1]
    points_ask_by_hour.setdefault(created_hour, 0)
    points_ask_by_hour[created_hour] += points

check_tot_posts = 0
for k, v in counts_ask_by_hour.items():
    check_tot_posts += v

check_tot_points = 0
for k, v in points_ask_by_hour.items():
    check_tot_points += v

assert check_tot_posts == len(ask_posts_all)
assert check_tot_points == total_ask_points

print("Number of ask posts created during each hour:")
print(counts_ask_by_hour)
print()
print("Number of points for ask posts created during each hour:")
print(points_ask_by_hour)

Number of ask posts created during each hour:
{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}

Number of points for ask posts created during each hour:
{'02': 2944, '01': 2662, '22': 3601, '21': 5042, '19': 4782, '17': 7155, '15': 13978, '14': 5390, '13': 7962, '11': 2856, '10': 3789, '09': 1763, '07': 2040, '03': 2539, '23': 2616, '20': 4491, '16': 5970, '08': 2744, '00': 2835, '18': 6850, '12': 4643, '04': 2650, '06': 2030, '05': 2046}


In [27]:
# Show posts

# Create 2 dictionaries to keep track of (1) number of posts created during each hour, and (2) number of points for posts created during each hour.
counts_show_by_hour = {}
points_show_by_hour = {}

for row in point_show_list:
    # number of posts created during each hour
    created_dt = row[0]
    dt_format = "%m/%d/%Y %H:%M"
    created_dt = dt.datetime.strptime(created_dt, dt_format)
    created_hour = dt.datetime.strftime(created_dt, "%H")
    counts_show_by_hour.setdefault(created_hour, 0)
    counts_show_by_hour[created_hour] += 1
    # number of points for posts created during each hour
    points = row[1]
    points_show_by_hour.setdefault(created_hour, 0)
    points_show_by_hour[created_hour] += points

check_tot_posts = 0
for k, v in counts_show_by_hour.items():
    check_tot_posts += v

check_tot_points = 0
for k, v in points_show_by_hour.items():
    check_tot_points += v

assert check_tot_posts == len(show_posts_all)
assert check_tot_points == total_show_points

print("Number of show posts created during each hour:")
print(counts_show_by_hour)
print()
print("Number of points for show posts created during each hour:")
print(points_show_by_hour)

Number of show posts created during each hour:
{'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172}

Number of points for show posts created during each hour:
{'00': 4291, '23': 5060, '20': 6948, '19': 8928, '18': 9935, '16': 11487, '14': 10503, '10': 4303, '09': 3762, '08': 4640, '06': 3071, '03': 2168, '21': 5990, '17': 10563, '15': 11657, '11': 7742, '07': 3303, '04': 2707, '13': 10381, '12': 10787, '01': 2931, '22': 5026, '02': 2764, '05': 1834}


In [29]:
# Ask posts

# Calculate average number of points for posts created during each hour
avg_ask_by_hour = []

for hour in points_ask_by_hour:
    avg = points_ask_by_hour[hour] / counts_ask_by_hour[hour]
    avg_ask_by_hour.append([hour, avg])

# Swap the row to sort by average number of points
swap_avg_ask_by_hour = []

for row in range(len(avg_ask_by_hour)):
    swap = avg_ask_by_hour[row][1], avg_ask_by_hour[row][0]
    swap_avg_ask_by_hour.append(swap)

sorted_swap_ask = sorted(swap_avg_ask_by_hour, reverse=True)

# Show Top 5 result
print("Top 5 hours for Ask HN comments (on average):")
for row in sorted_swap_ask[0:5]:
    dt_hour = row[1]
    hour_format = "%H"
    dt_hour1 = dt.datetime.strptime(dt_hour, hour_format)
    dt_hour2 = dt.datetime.strftime(dt_hour1, "%H:00")
    print(f"{dt_hour2} >> {row[0]:,.2f} points per post")

Top 5 hours for Ask HN points (on average):
15:00 >> 21.64 points per post
13:00 >> 17.93 points per post
12:00 >> 13.58 points per post
10:00 >> 13.44 points per post
17:00 >> 12.19 points per post


In [30]:
# Show posts

# Calculate average number of points for posts created during each hour
avg_show_by_hour = []

for hour in points_show_by_hour:
    avg = points_show_by_hour[hour] / counts_show_by_hour[hour]
    avg_show_by_hour.append([hour, avg])

# Swap the row to sort by average number of points
swap_avg_show_by_hour = []

for row in range(len(avg_show_by_hour)):
    swap = avg_show_by_hour[row][1], avg_show_by_hour[row][0]
    swap_avg_show_by_hour.append(swap)

sorted_swap_show = sorted(swap_avg_show_by_hour, reverse=True)

# Show Top 5 result
print("Top 5 hours for Show HN points (on average):")
for row in sorted_swap_show[0:5]:
    dt_hour = row[1]
    hour_format = "%H"
    dt_hour1 = dt.datetime.strptime(dt_hour, hour_format)
    dt_hour2 = dt.datetime.strftime(dt_hour1, "%H:00")
    print(f"{dt_hour2} >> {row[0]:,.2f} points per post")

Top 5 hours for Show HN points (on average):
12:00 >> 20.91 points per post
11:00 >> 19.26 points per post
13:00 >> 17.02 points per post
19:00 >> 16.06 points per post
06:00 >> 15.99 points per post


Per our results above, *Ask HN* posts created during 15:00hrs (EST) received more points on average. This is aligned with our result for comments. On the other hand, *Show HN* posts created during 12.00hrs received slightly more points than posts created during 11.00hrs.