<h1> Finding best time to post a new topic on Hacker News website </h1>

Hacker News is a popular site among startup and technology circles, where people can post stories and receive votes and comments from others. 
Users submit Ask HN posts to ask the Hacker News community a specific question. Similarly, users submit Show HN posts to share a project, product, or something interesting. All other types of posts are captured under Other.

In this project, we focus on answering the following two questions:
- Do Ask HN or Show HN receive more points and comments on average?
- Do posts created at a certain time receive more points and comments on average?

Source of data: https://www.kaggle.com/hacker-news/hacker-news-posts

---------------
Below are descriptions of the columns:

- id: the unique identifier from Hacker News for the post
- title: the title of the post
- url: the URL that the posts links to, if the post has a URL
- num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: the number of comments on the post
- author: the username of the person who submitted the post
- created_at: the date and time of the post's submission

## I. Reading and exploring data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [2]:
hn = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')

In [3]:
hn.shape

(293119, 7)

In [4]:
hn.columns

Index(['id', 'title', 'url', 'num_points', 'num_comments', 'author',
       'created_at'],
      dtype='object')

In [5]:
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


In [6]:
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            293119 non-null  int64 
 1   title         293119 non-null  object
 2   url           279256 non-null  object
 3   num_points    293119 non-null  int64 
 4   num_comments  293119 non-null  int64 
 5   author        293119 non-null  object
 6   created_at    293119 non-null  object
dtypes: int64(3), object(4)
memory usage: 15.7+ MB


## II. Clean data

#### While exploring the dataset, we can easily see that the "created_at" column is currently not DateTime type, so we'll start our cleaning by converting this field to the correct format.

In [7]:
# Convert "created_at" column to datetime 

hn.created_at = pd.to_datetime(hn.created_at)

hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   id            293119 non-null  int64         
 1   title         293119 non-null  object        
 2   url           279256 non-null  object        
 3   num_points    293119 non-null  int64         
 4   num_comments  293119 non-null  int64         
 5   author        293119 non-null  object        
 6   created_at    293119 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 15.7+ MB


#### For the purposes of our analysis, it makes more sense if we only focus on posts that recieved comments. So in this step, we'll remove all submissions that didn't get any comments. 

In [8]:
hn = hn[hn.num_comments > 0]

hn.shape

(80401, 7)

## III. Analyse data

#### Now that we have the cleaned list of 80401 records (reduced from 293119), let's start analysing data by separating it into 2 lists of Ask and Show posts first.

In [9]:
# Converting the title into lower case to avoid any possible mistake relating to case-sensitive

hn.title = hn.title.str.lower()

In [10]:
# Creating separate lists for Ask HN and Show HN posts

ask_hn = hn[hn.title.str.startswith('ask hn')== True]
show_hn = hn[hn.title.str.startswith('show hn')== True]

### 1. More engagement by post type

#### In this step, we'll calculate the average number of points and comments for each type of post to see which one received more

In [11]:
avg_ask_point = round(ask_hn.num_points.mean(),2)
avg_show_point = round(show_hn.num_points.mean(),2)

print('Average number of points on ask posts: ', avg_ask_point)
print('Average number of points on show posts: ', avg_show_point)

Average number of points on ask posts:  14.4
Average number of points on show posts:  26.62


In [12]:
avg_ask_comments = round(ask_hn.num_comments.mean(),2)
avg_show_comments = round(show_hn.num_comments.mean(),2)

print('Average number of comments on ask posts: ', avg_ask_comments)
print('Average number of comments on show posts: ', avg_show_comments)

Average number of comments on ask posts:  13.74
Average number of comments on show posts:  9.81


<p style='background:yellow'>We can see that ask posts received less points but more comments than shows posts on average. Our purpose is to identify which type of content gets more community engagement, so comment seem to be more important than point for this analysis. But for now, let's just continue by analyzing other factors in both lists first and decide when we have more insight.</p>

### 2. More engagement by day and time created

Moving forward, we'll determine if posts created at a certain day and time are more likely to attract people's engagement. 
We'll follow the below steps to perform this analysis:
- Identify the day of week and hour of the day a post was created
- Calculate the average number of points and comments each post received based on day and hour created

In [13]:
# Creating a new column to mark the day a post was created (with 0 and 6 representing Monday and Sunday respectively)
ask_hn['day_of_week'] = hn.created_at.dt.dayofweek
show_hn['day_of_week'] = hn.created_at.dt.dayofweek

# Creating a new column to identify which hour of the day each post was created
ask_hn['created_hour'] = hn.created_at.dt.hour
show_hn['created_hour'] = hn.created_at.dt.hour

ask_hn.head(3)
show_hn.head(3)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,day_of_week,created_hour
140,12577142,show hn: jumble essays on the go #paulinyourp...,https://itunes.apple.com/us/app/jumble-find-st...,1,1,ryderj,2016-09-25 20:06:00,6,20
177,12576813,show hn: learn japanese vocab via multiple cho...,http://japanese.vul.io/,1,1,soulchild37,2016-09-25 19:06:00,6,19
246,12576090,show hn: markov chain twitter bot. trained on ...,https://twitter.com/botsonasty,3,1,keepingscore,2016-09-25 16:50:00,6,16


In [15]:
# Calculating the average number of points Ask posts received by day and hour created.

ask_pointsByHour = ask_hn.groupby(['day_of_week','created_hour']).mean('num_points').sort_values(['day_of_week','num_points'], ascending= False).num_points
top_ask_pointsByHour = ask_pointsByHour.groupby('day_of_week').head(3)

top_ask_pointsByHour

day_of_week  created_hour
6            12              43.562500
             17              38.666667
             13              26.184211
5            2               33.800000
             17              28.531915
             22              26.821429
4            7               41.000000
             15              40.726027
             13              35.603774
3            15              26.975610
             11              22.296296
             18              22.257576
2            13              37.076923
             15              25.406977
             10              23.735294
1            15              22.776119
             13              22.636364
             10              17.758621
0            15              50.828125
             4               41.291667
             1               26.848485
Name: num_points, dtype: float64

<p style='background:yellow'>Interestingly, Monday was the day that received the highest interaction from people, and 15:00 was the best time for an ask post, based on the number of votes (represented by num_points). Coming second was the 12:00 timeslot on Sunday. </p>

In [19]:
# Calculating the average number of comments Ask posts received by day and hour created

ask_commentsByHour = ask_hn.groupby(['day_of_week','created_hour']).mean('num_comments').sort_values(['day_of_week','num_comments'], ascending= False).num_comments
top_ask_commentsByHour = ask_commentsByHour.groupby('day_of_week').head(3)

top_ask_commentsByHour

day_of_week  created_hour
6            12              36.718750
             20              24.250000
             22              21.675676
5            2               46.440000
             22              22.107143
             12              20.611111
4            15              52.561644
             13              39.283019
             7               25.666667
3            15              41.768293
             11              18.407407
             14              14.909091
2            15              29.674419
             13              20.980769
             10              20.823529
1            15              34.477612
             13              27.054545
             17              19.962500
0            15              79.703125
             4               34.208333
             5               27.111111
Name: num_comments, dtype: float64

<p style='background:yellow'>15:00 on Monday seems to be the golden time for asking questions on the Hacker News website. It again appears to be the time that received the highest average number of comments (79.7) from users. Following this was the 15:00 timeslot on Friday (52.56). </p>

In [21]:
# Calculating the average number of points Show posts received by day and hour created.

show_pointsByHour = show_hn.groupby(['day_of_week','created_hour']).mean('num_points').sort_values(['day_of_week','num_points'], ascending= False).num_points
top_show_pointsByHour = show_pointsByHour.groupby('day_of_week').head(3)

top_show_pointsByHour

day_of_week  created_hour
6            22              87.428571
             19              49.437500
             23              35.666667
5            8               61.250000
             20              48.185185
             17              36.444444
4            18              58.909091
             13              54.864865
             23              49.916667
3            2               61.818182
             6               51.300000
             7               47.142857
2            11              58.978723
             17              40.327273
             15              33.149254
1            21              63.800000
             6               50.722222
             16              43.367816
0            0               77.000000
             12              57.423077
             23              47.190476
Name: num_points, dtype: float64

<p style='background:yellow'>As we also noticed before, Show posts got higher points than Ask posts in general. Following are the five best timeslots to share your work/ story with the Hacker News community (in order from highest to lower): Sunday 22:00, Monday 00:00, Tuesday 21:00, Thursday 2:00, and Saturday 8:00. <p>

In [23]:
# Calculating the average number of points Show posts received by day and hour created.

show_commentsByHour = show_hn.groupby(['day_of_week','created_hour']).mean('num_comments').sort_values(['day_of_week','num_comments'], ascending= False).num_comments
top_show_commentsByHour=show_commentsByHour.groupby('day_of_week').head(3)

top_show_commentsByHour

day_of_week  created_hour
6            19              14.750000
             3               14.583333
             22              13.142857
5            8               31.812500
             16              14.576923
             15              13.000000
4            18              20.500000
             12              19.064516
             4               18.875000
3            7               35.500000
             2               16.090909
             19              13.934783
2            11              15.680851
             15              14.417910
             8               13.967742
1            6               15.888889
             21              14.300000
             14              13.983607
0            2               20.750000
             23              20.523810
             12              19.403846
Name: num_comments, dtype: float64

<p style='background:yellow'>The average number of comments for Shows posts was relatively much lower than Ask posts. For Show posts, the hour that received the most comments were Thursday 7:00 and Saturday 8:00. So if we want to balance between points and reviews, 8 AM on Saturday seems to be the best time for a Show post. <p>

## IV. Conclusion

Our main goal for this project is to compare two types of posts to answer the two questions:
- Do Ask HN or Show HN receive more points and comments on average?
- Do posts created at a certain time receive more points and comments on average?

Based on our analysis, we found out that on average, **Ask HN received more comments than Show HN** (13.74 vs 9.81), but **Show HN received more points (26.62) compared to Ask HN (14.4)**. So it depends on what type of engagement we want to achieve, we can choose to post an Ask post or a Show post accordingly. 

**Regarding the second question, we also found two timeslots that were likely to get the highest user engagement for each type of post:**
- Ask HN: Monday 15:00
- Show HN: Saturday 8:00

**Some other good days and timeslots to consider:**

To get a higher point:
- Ask HN: Sunday 12:00 (43.56), Monday 4:00 (41.29), and Friday 7:00 (41)
- Show HN: Sunday 22:00 (87.43), Monday 00:00 (77), Tuesday 21:00 (63.8), Thursday 2:00 (61.82)

To get more comments:
- Ask HN: Friday 15:00 (52.56), Saturday 2:00 (46.44), and Thursday 15:00 (41.77)
- Show HN: Thursday 7:00 (35.5), Monday 2:00 (20.75), and Monday 23:00 (20.53)
