# Exploring Hacker News Posts

## 1. Introduction 

In this project we'll explore a sample [datatset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) which shows various Hackerrank news posts and two statistics related to every post which are: 
- number of points : The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes.
- number of comments : The number of comments on the post.

For the purpose of this project, we shall only look at posts whose titles begin with 'Ask HN' or 'Show HN'. Users submit Ask HN posts to ask the Hacker News community a specific questions. Similarly, user submit 'Show HN' posts to show Hacker news community a project, product or just something interesting. We'll compare these two types of posts to determine the following:
- Do 'Ask HN' or "Show HN' recieve more comments on average?
- Do posts created at a certain time recieve more comments on average? 

We'll first start by opening and reading our dataset.


## 2. Data Cleaning - Filtering Ask HN and Show HN Posts 

In this step we will filter the dataset to only include the 'Ask HN' and 'Show HN' Posts. 

In [8]:
from csv import reader

opened_file = open("Downloads\hacker_news.csv", encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn_data = hn[1:]

print(headers)
print(hn_data[:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [9]:
ask_posts = []
show_posts = []
other_posts = []


for row in hn_data:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## 3. Average number of comments for Ask HN and Show HN posts 


In [12]:
ask_total_comments = 0
ask_total_posts = 0

for row in ask_posts:
    num_comments = int(row[4])
    ask_total_comments += int(num_comments)
    ask_total_posts += 1
    
avg_ask_comments = ask_total_comments/ask_total_posts
print('The average number of comments per Ask HN post are:',avg_ask_comments)
    

The average number of comments per Ask HN post are: 14.038417431192661


In [13]:
show_total_comments = 0
show_total_posts = 0

for row in show_posts:
    num_comments = int(row[4])
    show_total_comments += int(num_comments)
    show_total_posts += 1
    
avg_show_comments = show_total_comments/show_total_posts
print('The average number of comments per Show HN post are:',avg_show_comments)

The average number of comments per Show HN post are: 10.31669535283993


In [14]:
print(ask_total_posts)  #crosscheck 
print(show_total_posts) #these values should be same as the length of ask_posts and show_posts lengths.

1744
1162


On an average **Ask HN** posts recieve more comments than Show HN posts. 

## 4. Finding the number of Ask posts and comments by hour 

In [12]:
import datetime as dt 
result_list = []

for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    result_list.append([created_at, n_comments])
    
posts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")
    n_comments = row[1]
    
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
        
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments 
print("Number of posts by hour:", '\n', posts_by_hour)
"\n"
"\n"
print("Number of comments by hour:", '\n', comments_by_hour)

Number of posts by hour: 
 {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Number of comments by hour: 
 {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Peak activity of comments is at the hour **15** wher the number of comments made are **4477** and the least activity is at the hour **07** where the number of comments made are **267**.

## 5. Calculating the Average Number of Comments for Ask HN Posts by hour. 


In [27]:
avg_num_of_comments_by_hour = {}

for hour in comments_by_hour:
    avg_num_of_comments_by_hour[hour] = round(int(comments_by_hour[hour])/posts_by_hour[hour], 2)
    
print(avg_num_of_comments_by_hour)



{'09': 5.58, '13': 14.74, '10': 13.44, '14': 13.23, '16': 16.8, '23': 7.99, '12': 9.41, '17': 11.46, '15': 38.59, '21': 16.01, '20': 21.52, '02': 23.81, '18': 13.2, '03': 7.8, '05': 10.09, '19': 10.8, '01': 11.38, '22': 6.75, '08': 10.25, '04': 7.17, '00': 8.13, '06': 9.02, '07': 7.85, '11': 11.05}


In [29]:
from operator import itemgetter 

sorted_dict = dict(sorted(avg_num_of_comments_by_hour.items(), key=itemgetter(1), reverse=True))

print(sorted_dict)

{'15': 38.59, '02': 23.81, '20': 21.52, '16': 16.8, '21': 16.01, '13': 14.74, '10': 13.44, '14': 13.23, '18': 13.2, '17': 11.46, '01': 11.38, '11': 11.05, '19': 10.8, '08': 10.25, '05': 10.09, '12': 9.41, '06': 9.02, '00': 8.13, '23': 7.99, '07': 7.85, '03': 7.8, '04': 7.17, '22': 6.75, '09': 5.58}


In [30]:
for i, (hour, avg_comments) in enumerate(sorted_dict.items()):
  if i < 5:
    print(f"{hour}: {avg_comments:.2f} average comments per post")

15: 38.59 average comments per post
02: 23.81 average comments per post
20: 21.52 average comments per post
16: 16.80 average comments per post
21: 16.01 average comments per post


# Conclusion 

In this project, we analyzed user interaction of Hakcer News for two post categories - Ask HN and Show HN.

Focusing on "ask" posts we found out that the data indicates that the average number of comments per hour on a platform peaks between **15:00** and **16:00** with **38.59 comments**. Notably, the early hours of **02:00** also show a high activity level with an average of **23.81 comments**. It's important to note that the above overview is showing results as per Eastern Time in the US.