## Exploring Hacker News Posts

This __project__ centers around Hacker News, working with a [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) of __submissions__ to the site. Specially, we are interested in posts with titles that _begin_ with either _Ask HN_ or _Show HN_:
- *Ask HN* is used when users ask the Hacker News community a question.
- *Show HN* is used when users want to show the community a project, product or something of the sort. 

The main __objetive__ of this project is two answer these questions.
1. Do Ask HN or Show HN receive more comments on average?
2. Do posts creader at a certain time receive more comments on average? 

### 01. Opening and Exploring the Data

In [1]:
#import libraries needed
from csv import reader

#open and read dataset
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file) #transform the read file into a list of lists

#explore data displaying five rows
for row in hn[:5]:
    print(row)
    print("\n")


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




In [2]:
#removing header and assinging it to a variable
header = hn[0]
hn=hn[1:]

#display to check if everything is correct

print(header)         #header
print("\n")

for row in hn[:5]:    #data
    print(row)
    print("\n")


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




### 02. Filtering Data 

Now that we understand how the data is saved, we need to classify the submissions according to its title. We are only interested in posts beginning with either "Ask HM" or "Show HM" as explained before. To do that, we will create __three lists__:
* A list to save submissions with a title beginning with "Ask HM" (ask_posts)
* A list to save submissions with a title beginnig with "Show HM" (show posts)
* A list to save other submissions (other_posts)


In [3]:
#creating the three lists
ask_posts=[]
show_posts=[]
other_posts=[] 

#classifying the submissions
for row in hn:
    
    title=row[1]                 #saving title into variable
    title=title.lower()          #making all the title lowercase to avoid problems using .startswith method
    
    if title.startswith("ask hn"):  
        ask_posts.append(row)   
    
    elif title.startswith("show hn"):
        show_posts.append(row)
    
    else:
        other_posts.append(row)
        
#cheking length
print("The number of 'Ask HM' posts is: ", len(ask_posts))
print("The number of 'Show HM' posts is: ", len(show_posts))
print("The number of other posts is: ", len(other_posts))


The number of 'Ask HM' posts is:  1744
The number of 'Show HM' posts is:  1162
The number of other posts is:  17194


### 03. Checking for Wrong Data 

Before continuing the analysis, we should check if there is any missing data. In order to do so, we can check if the number of elements/columns in a row equals the number of columns in the header. We'll only do that in the ask_posts and show_post lists: 

In [4]:
#create an empty list to save the rows with missing data 
missing_data_ask= [] 
missing_data_show = [] 

#check the length of every row in the lists 

total_length = len(header)  #save the total length of the header list
n_row = 0                  #to know in which row there's missing data 

for row in ask_posts:     #loop over ask_posts and check the length
    row_length=len(row)
    if row_length != total_length:
        missing_data_ask.append(n_row)
    n_row+=1
    
print("The number of rows with missing data on ask_posts list is:", missing_data_ask)

n_row = 0 
for row in show_posts:     #loop over ask_posts and check the length
    row_length=len(row)
    if row_length != total_length:
        missing_data_show.append(n_row)
    n_row+=1
    
print("The number of rows with missing data on show_posts list is:", missing_data_show)      

The number of rows with missing data on ask_posts list is: []
The number of rows with missing data on show_posts list is: []


Now we know there will be no problems further exploring the data. 

We also can check if there is no data in columns we are interested in (comments and time), we can repeat the same procedure. 

In [5]:
#create an empty list to save the rows with missing data in comments or time
missing_data_ask= [] 
missing_data_show = [] 

#check the length of every row in the lists 

n_row = 0                  #to know in which row there's missing data 

for row in ask_posts:     #loop over ask_posts and check the length
    time = row[6]
    comments = row[4]
    if time == "" or comments =="":
        missing_data_ask.append(n_row)
    n_row+=1
    
print("The number of rows with missing comments or datetime on ask_posts list is:", missing_data_ask)

n_row = 0 
for row in show_posts:     #loop over ask_posts and check the length
    time = row[6]
    comments = row[4]
    if time == "" or comments =="":
        missing_data_show.append(n_row)
    n_row+=1
    
print("The number of rows with missing comments or datetime on show_posts list is:", missing_data_show) 

The number of rows with missing comments or datetime on ask_posts list is: []
The number of rows with missing comments or datetime on show_posts list is: []


### 04. Popularity by number of comments 

In order to answer the first question proposed in this project, we need to know the number of comments on average the "Ask HM" posts and "Show HM" posts have.
* We will first loop over both lists to know how many comment in _total_ they have, saving the result in total_ask_comments and total_show_comments respectively. 
* The we will divided it by the number of posts of each type, easily given using the len() function.

In [6]:
#Ask comments
total_ask_comments=0

for row in ask_posts:
    n_comments=int(row[4])
    total_ask_comments+=n_comments
    
avg_ask_comments=total_ask_comments/len(ask_posts)

#Show comments
total_show_comments=0

for row in show_posts:
    n_comments=int(row[4])
    total_show_comments+=n_comments
    
avg_show_comments=total_show_comments/len(show_posts)  

#Show results
print("The average of comments on 'Ask HM' posts is %.2f"%(avg_ask_comments))
print("The average of comments on 'Show HM' posts is %.2f"%(avg_show_comments))

The average of comments on 'Ask HM' posts is 14.04
The average of comments on 'Show HM' posts is 10.32


#### Conclusions
Ask HM posts have more comments in average than Show HM posts. From this results we could affirm that many people interact more with the ask posts. However, we should see if the amount of comments is distributed evenly before making firmer conclusions. 

Let's plot the list for each type of post - __To do later__

 

### 05. Popularity by time posted 

We continue our analysis with only the 'Ask HM' submissions, since they had a high average number of comments. In order to do this, we need to calculate: 
1. Number of posts submitted in each hour of the day 
2. Number of comments  in each hour of the day
2. Average number of comments received by hour 

Before calculating all of these, we will use datatime.strptime.

_Note_: It is important to know that the format of the created_at is the following: MONTH/DAY/YEAR HOUR:MINUTE 

In [11]:
#import datatime 

import datetime as dt 

#Create a list of lists to save the number of comments and date 
result_list = [] 

for row in ask_posts:
    comments = row[4]
    date = row[6]
    result_list.append([date, comments])
    

#Create two dictionaries to save the number of comments and number of posts in each hour
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    #save de number of comments and date in variables
    comments = int(row[1])
    date = row[0]
    
    #Transform date using datatime  
    date= dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    date=date.strftime("%H")
    
    #count the number of comments and counts in each hour 
    if date in counts_by_hour:             #if hour is a key
        counts_by_hour[date]+=1            #add up one 
    else:
        counts_by_hour[date]=1             #if not, equal to one 
        
    if date in comments_by_hour:           #if hour is a key
        comments_by_hour[date]+=comments   #add up number of comments
    else:
        comments_by_hour[date]=comments    #equal to number of comments 




Now we will calculate the average of comments by hour. We will store the results in a list of lists called avg_by_hour.

In [22]:
avg_by_hour=[] #create empty list 

for hour in comments_by_hour: #this loops over every key in the dictionry 
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

#Once we have avg_by_hour, to improve visualization, we will sort the rows in descending order
#To do that we will first swap the columns

swap_avg_by_hour = []  #Empty list where we will store the swapped columns 

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)  #We put reverse aas true so the highest value appears the first.

#Display the results 
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    
    time = row[1] 
    time = dt.datetime.strptime(time, "%H") #Convert time to a datetime class
    time = time.strftime("%H:%M") #Format to show hours and miutes 
    
    print("{0}: {1:.2f} average comments per post".format(time, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Before making our conclusions, we'll convert this to CET (Central European Time) which is 6 hours more than EST.  

In [27]:
for row in sorted_swap[:5]:
    
    time = row[1] 
    time = dt.datetime.strptime(time, "%H") #Convert time to a datetime class
    time = time + timedelta(hours=6) #Use timedelta to define a period of time 
    time = time.strftime("%H:%M") #Format to show hours and miutes 
    
    print("{0}: {1:.2f} average comments per post".format(time, row[0]))

21:00: 38.59 average comments per post
08:00: 23.81 average comments per post
02:00: 21.52 average comments per post
22:00: 16.80 average comments per post
03:00: 16.01 average comments per post


### 05. Conclusions
We can say the best time to post an Ask Post is at 3 PM EST (9 PM CET), with an average of 38.59 per post. This value is way superior than the others, almost doubling them. So it would definetly be recommended to update at that time. 