## HackerNews Posts Analysis 💻 
[Dataset](https://www.kaggle.com/hacker-news/hacker-news-posts)
- Check if questions (Ask HN) or project presentation (Show HN) receives more attention
- Inspect if the post creation time influences on its relevance

In [10]:
from csv import reader
import datetime as dt

In [6]:
hacker_news_dataset = list(reader(open('../dataset/HN_posts_year_to_Sep_26_2016.csv')))

### Displaying the first five rows of the dataset

In [7]:
header = hacker_news_dataset.pop(0) if hacker_news_dataset[0][0] == 'id' else header
nl = "\n"
print(f"Columns: {header}\n")
for row in hacker_news_dataset[:5]:
    print(f'{row}\n')

Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']



### Filtering the dataset by separating into Ask HN, Show HN and other based on the post title

In [8]:
def filter_dataset_post(dataset: list):
    ask_posts = []
    show_posts = []
    other_posts = []
    
    for row in dataset:
        title = row[1]
        if title.startswith('Ask HN'):
            ask_posts.append(row)
        elif title.startswith('Show HN'):
            show_posts.append(row)
        else:
            other_posts.append(row)
            
    return ask_posts, show_posts, other_posts

ask_posts, show_posts, other_posts = filter_dataset_post(dataset=hacker_news_dataset)
print(f'Ask posts: {len(ask_posts)}\nShow posts: {len(show_posts)}\nOther: {len(other_posts)}')

Ask posts: 9122
Show posts: 10150
Other: 273847


### By averaging how many comments per post Ask HN and Show HN have we can see that Ask HN posts contains, on average, more than the twice the number of Show HN posts

In [9]:
def count_row(dataset: list):
    comments_sum = 0
    for row in dataset:
        comments = int(row[4])
        comments_sum += comments
        
    avg_comments = comments_sum/len(dataset)
    return comments_sum, round(avg_comments,2)

comments_ask_posts, avg_comments_ask_posts = count_comments(ask_posts)
commnets_show_posts, avg_comments_show_posts = count_comments(show_posts)

print(f'Ask HN average number of comments: {avg_comments_ask_posts}\n\
Show HN average number of comments: {avg_comments_show_posts}')

Ask HN average number of comments: 10.41
Show HN average number of comments: 4.89


### At first the quantity of posts made at that time and the amount of comments are obtained, then, fetching the hours that received the most comments and posts, afterwards average the amount of comments by the post comments and displays the Top 10 largest average comments hours 

In [54]:
def get_average_number_of_comments(post_comments: dict):
    avg_comments_per_hour = []
    for post_hour in post_comments:
        avg_comments = post_comments[post_hour]['comments_by_hour']/post_comments[post_hour]['quantity_of_posts_at_hour']
        avg_comments_per_hour.append([post_hour, round(avg_comments, 2)])
    return sorted(avg_comments_per_hour, key=lambda x: x[1], reverse=-1)

In [58]:
def extract_post_hours(dataset: list, question_type: str = "Ask HN"):
    posts_per_hour = {}
    for row in dataset:
        comments = int(row[4])
        post_creation_time = dt.datetime.strptime(row[-1], "%m/%d/%Y %H:%M")
        post_hour = post_creation_time.hour
        if post_hour in posts_per_hour:
            posts_per_hour[post_hour] = {
                'quantity_of_posts_at_hour': posts_per_hour[post_hour]['quantity_of_posts_at_hour']+1,
                'comments_by_hour': posts_per_hour[post_hour]['comments_by_hour'] \
                                    + comments
            }
        else:
            posts_per_hour[post_hour] = {
                'quantity_of_posts_at_hour': 1,
                'comments_by_hour': comments
            }
    
    max_quantity, max_comments = (0, 0), (0, 0)
    for k in posts_per_hour: 
        max_quantity = (k, max(posts_per_hour[k]['quantity_of_posts_at_hour'], max_quantity[1]))
        max_comments = (k, max(posts_per_hour[k]['comments_by_hour'], max_comments[1]))
    
    print(f"The most common hour of {question_type} posts is: {max_quantity[0]}h, containing {max_quantity[1]} posts 🧤\nThe hour that people comment the most is: {max_quantity[0]}h, with {max_comments[1]} comments 📦")
    avg_comments_per_hour = get_average_number_of_comments(posts_per_hour)
    print(f"\nBut by averaging the values we can see that the hour in which a post receives the largest amount of comments given the amount of posts is at: {avg_comments_per_hour[0][0]}h. With an average of {avg_comments_per_hour[0][1]} comments 🦥")
    print(f"\nTop 10 largest comments average:")
    for average_comments in avg_comments_per_hour[:10]:
        time = dt.datetime.strptime(str(average_comments[0]), "%H")
        time_to_str = dt.datetime.strftime(time, "%H:00")
        print(f"[{time_to_str}]: {average_comments[1]} average comments per post")

### As we can see the hours that people will most likely respond to your questions on Hacker News are after midday, yielding a larger comments average, notice that this time zone is Eastern Time US ⏱

In [59]:
extract_post_hours(dataset=ask_posts)

The most common hour of Ask HN posts is: 5h, containing 646 posts 🧤
The hour that people comment the most is: 5h, with 18525 comments 📦

But by averaging the values we can see that the hour in which a post receives the largest amount of comments given the amount of posts is at: 15h. With an average of 28.68 comments 🦥

Top 10 largest comments average:
[15:00]: 28.68 average comments per post
[13:00]: 16.35 average comments per post
[12:00]: 12.38 average comments per post
[02:00]: 11.14 average comments per post
[10:00]: 10.68 average comments per post
[04:00]: 9.74 average comments per post
[14:00]: 9.71 average comments per post
[17:00]: 9.45 average comments per post
[08:00]: 9.19 average comments per post
[11:00]: 9.01 average comments per post


### For the Show HN posts the pattern is similar, also having 5h as the hour with the most comments and posts, whereas the average is also close to midday (although not in the same pattern as the Ask HN posts)

In [62]:
extract_post_hours(dataset=show_posts, question_type='Show HN')

The most common hour of Show HN posts is: 5h, containing 834 posts 🧤
The hour that people comment the most is: 5h, with 3839 comments 📦

But by averaging the values we can see that the hour in which a post receives the largest amount of comments given the amount of posts is at: 12h. With an average of 6.99 comments 🦥

Top 10 largest comments average:
[12:00]: 6.99 average comments per post
[07:00]: 6.69 average comments per post
[11:00]: 6.0 average comments per post
[08:00]: 5.62 average comments per post
[14:00]: 5.52 average comments per post
[13:00]: 5.43 average comments per post
[02:00]: 5.15 average comments per post
[04:00]: 5.04 average comments per post
[19:00]: 5.02 average comments per post
[18:00]: 4.94 average comments per post
