                Guided Project: Exploring Hacker News Posts
              
In this project, I will look at the data set of submissions to the Hacker News website and analyse last 12 months (up to September 26, 2016) posts and their comments.

Dataset includes following columns:

title: title of the post

url: the url of the item being linked to

num_points: the number of upvotes the post received

num_comments: the number of comments the post received

author: the name of the account that made the post

created_at: the date and time the post was made
(the time zone is Eastern Time in the USA)

In this project I will anlise posts whose titles begin with Ask HN or Show HN. 
Ask HN posts to ask the Hacker News community a specific question. 
Show HN posts to show the Hacker News community a project or present something interesting to the public.

    Two tasks for this excersie are:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

In [2]:
import csv

#reading hacker_news.csv into a list of lists.
file = open('hacker_news.csv')
hn = list(csv.reader(file))

#displaying first five rows from the file.
hn[:5]


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [3]:
#extracting first row from the data, assigning it to a variable.
headers = hn[0]
#removing first row from hn.
hn = hn[1:]
#desplaing headers
print(headers)
#desplaing first five rows.
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [4]:
#creating empty lits for ask_posts, show_posts, and other_posts.

ask_posts = []
show_posts =[]
other_posts = []

#filtering posts from hn to the relevant lists using for loop.

for post in hn:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

#checking number of posts in each list.
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [5]:
#finding the total number of comments in ask posts.
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [6]:
#finding the total number of comments in show posts.
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
#computing the average number of comments on show posts
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)    


10.31669535283993


From the calculation above I can see that ask posts on average received 14 comments while show posts only 10 in comparison.

In [7]:
#importing datetime module
import datetime as dt
result_list = []

#iterating over ask posts to find amount of ask posts created in each hour
#and the number of comments received.

for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    result_list.append([created_at, int(num_comments)])

counts_by_hour = {}
comments_by_hour = {}

#changing the date format

date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    post_date = row[0]
    comments = row[1]
    date = dt.datetime.strptime(post_date, date_format)
    hour = dt.datetime.strftime(date, "%H")
    
#creating frequency table    

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

In [8]:
#calculating average number of comments ask posts received per hour created
avg_by_hour = []
for comment in comments_by_hour:
    avg_by_hour.append([comment, 
                            comments_by_hour[comment] / counts_by_hour[comment]])
print(avg_by_hour)

[['14', 13.233644859813085], ['21', 16.009174311926607], ['23', 7.985294117647059], ['10', 13.440677966101696], ['13', 14.741176470588234], ['05', 10.08695652173913], ['16', 16.796296296296298], ['02', 23.810344827586206], ['11', 11.051724137931034], ['20', 21.525], ['18', 13.20183486238532], ['00', 8.127272727272727], ['22', 6.746478873239437], ['17', 11.46], ['15', 38.5948275862069], ['04', 7.170212765957447], ['09', 5.5777777777777775], ['01', 11.383333333333333], ['19', 10.8], ['07', 7.852941176470588], ['06', 9.022727272727273], ['12', 9.41095890410959], ['03', 7.796296296296297], ['08', 10.25]]


In [15]:

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
    

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[0:4],"\n")

print("Average comments per post: \n")

#formating results display.

for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )
        
          
    
    

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16']] 

Average comments per post: 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Conclusion:
In this project, I worked with the data set from the Hacker News website forum where user posts are voted and commented. I was anaising two types of posts: ASK - where users ask Hacker News Community and SHOW posts where users submit a project to the HN or something else interesting.
During the analisis I found that ask posts receive more comments by the average 14 rather than show posts 10 comments on an average.
I also found that the number of comments per post can vary by the time when the post was submitted.For example at 15:00 there was an average of 38 comments per post where at 21:00 only 16 comments.
