## Exploring Hacker News Posts<br>
In this project we are analyzing a dataset of submissions to the Hacker News site, which is a site where posts receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result. In this context we are interested to know the following:<br><br>
1) Do Ask HN (meaning users submiting a post to ask the Hacker News community a question) or Show HN (meaning users submiting a post to show the Hacker News community a project, product etc.) receive more comments on average?<br>
2) Do posts created at a certain time receive more comments on average<br><br>
The dataset can be found here [https://www.kaggle.com/datasets/hacker-news/hacker-news-posts], containing data about posts submitted to Hacker News for a 12 month period and was put together in 2016. However, in this project we are analyzing a sample of approximately 20,000 rows from 300,000 rows.

#### Reading the csv file into a list of lists

In [1]:
from csv import reader
file_1=open("C:/Users/Denisa/Desktop/Project Apps/project 2/hacker_news.csv")
file_read=reader(file_1)
hacker_list=list(file_read)

#### Displaying the first rows

In [2]:
print(hacker_list[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


### Reformatting and cleaning the data to prepare it for analysis

#### Removing the header

In [3]:
headers=hacker_list[0]
hacker_list=hacker_list[1:]
print(headers, "\n")
print(hacker_list[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


#### Separating ASK Hn and SHOW Hn posts

In [15]:
ask_posts=[]
show_posts=[]
other_posts=[]
for row in hacker_list:
    title=row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of ASK posts:",len(ask_posts))
print("Number of SHOW posts:",len(show_posts))
print("Number of OTHER posts:",len(other_posts))

Number of ASK posts: 1744
Number of SHOW posts: 1162
Number of OTHER posts: 17194


### Analyzing the data

#### Determining if ask posts or show posts receive more comments on average

In [13]:
total=0
for row in hacker_list:
    total+=int(row[4])
total_ask_comments=0
for row in ask_posts:
    total_ask_comments+=int(row[4])
avg_ask_comments=total_ask_comments/total
print("Average comments for Ask posts:",avg_ask_comments)
total_show_comments=0
for row in show_posts:
    total_show_comments+=int(row[4])
avg_show_comments=total_show_comments/total
print("Average comments for Show posts:",avg_show_comments)

Average comments for Ask posts: 0.04911077857523981
Average comments for Show posts: 0.02404689023240513


Our findings show that on average ask posts receive more comments comapred to show posts.

#### Calculating the number of ask posts and comments by the hour when they were created

After we concluded that on average ask posts receive more comments than show posts we will determine if ask posts that were created at a certain time during the day are more likely to receive comments. For this we will calculate the number of ask posts created in each hour of the day, along with the number of comments received and then we will calculate the average number of comments for ask posts by the hour when they were created.
The data in the created_at column is of type string, so we will use the datetime.strptime() constructor to parse dates stored as strings and return datetime objects.

In [6]:
print(type(hacker_list[1][6]))

<class 'str'>


In [7]:
import datetime as dt
result_list=[]
for row in ask_posts:
    result_list.append([row[6],int(row[4])])

counts_by_hour={}
comments_by_hour={}
                       
for row in result_list:
    dt_object=dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour_string=dt_object.strftime("%H")
    if hour_string not in counts_by_hour:
        counts_by_hour[hour_string]=1
        comments_by_hour[hour_string]=row[1]
    else:
        counts_by_hour[hour_string]+=1
        comments_by_hour[hour_string]+=row[1]

In [17]:
print("Number of posts by hour",counts_by_hour,"\n")
print("Number of comments by hour",comments_by_hour)

Number of posts by hour {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} 

Number of comments by hour {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


#### Calculating the average number of comments per posts created during each hour of the day

In [9]:
avg_by_hour=[]

for hour in counts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


For a clearer view we will swap the columns and use the sorted() function to sort the average of comments per post, alongside with the hour in descending order.

In [10]:
swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [11]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour=dt.datetime.strptime(row[1], "%H")
    h_string=hour.strftime("%H:%M: ")
    print(h_string, "{:.2f} average comments per post".format(row[0]))
    

Top 5 Hours for Ask Posts Comments
15:00:  38.59 average comments per post
02:00:  23.81 average comments per post
20:00:  21.52 average comments per post
16:00:  16.80 average comments per post
21:00:  16.01 average comments per post


### Conclusion

From our findings we came to the conclusion that:<br>
-ASK Hn posts are more likely to attract feedback compared to SHOW Hn posts
-users should create a ASK Hn post around 3 pm or around 2 am in order to increase their chance of receiving more comments.