# Data Analysis Project: Exploring Hacker News Submissions

This guided project brings together several essential skills, including working with strings, object-oriented programming, and handling dates and times. By applying these skills, we will conduct a data analysis on a dataset of submissions from the popular technology site Hacker News.

## Introduction

[Hacker News](https://news.ycombinator.com/) is a well-known platform where users submit stories that receive votes and comments, similar to Reddit. It is particularly popular within the technology and startup communities. Posts that reach the top of the Hacker News listings can attract hundreds of thousands of visitors.

## Dataset Overview

The dataset we will be using is a curated subset of Hacker News submissions. It includes approximately 20,000 rows, focusing on posts that received comments. The dataset contains the following columns:

- **id**: Unique identifier of the post on Hacker News.
- **title**: Title of the post.
- **url**: URL linked to the post (if available).
- **num_points**: Total points acquired by the post (calculated as upvotes minus downvotes).
- **num_comments**: Number of comments on the post.
- **author**: Username of the post submitter.
- **created_at**: Date and time of the post's submission.

## Project Objectives

In this project, we will:

1. Explore the dataset and familiarize ourselves with its structure.
2. Perform data cleaning and preparation as necessary.
3. Conduct various analyses to gain insights into the popularity and engagement of Hacker News posts.
4. Utilize string manipulation techniques to extract relevant information.
5. Apply object-oriented programming principles to create efficient and reusable code.
6. Use date and time functions to analyze temporal patterns in post submissions.

By completing this project, you will enhance your skills in data analysis, string manipulation, object-oriented programming, and working with dates and times. Get ready to delve into the fascinating world of Hacker News and uncover valuable insights from the data!

In [10]:
import csv

# Open the file
with open('hacker_news.csv', 'r') as file:

    # Create a reader object
    reader = csv.reader(file)
    
    # Convert the reader object into a list
    hn = list(reader)

print(hn[0:9])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'T

In [11]:
# set the headers
headers = hn[0]

#remove headers
del hn[0]

#display header and first few rows
print(headers)
print(hn[0:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


In [22]:
# create lists
ask_posts = []
show_posts = []
other_posts = []

#sort
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# print formatted result
print("Ask: {} , Show: {}, Other: {}".format(len(ask_posts), len(show_posts), len(other_posts)))

Ask: 1744 , Show: 1162, Other: 17194


In [25]:
#assign total
total_ask_comments = 0

#calculate total comments
for row in ask_posts:
    comment = int(row[4])
    total_ask_comments += comment

# calculate average
avg_ask_comments = total_ask_comments / len(ask_posts)

#assign total
total_show_comments = 0

#calculate total comments
for row in show_posts:
    comment = int(row[4])
    total_show_comments += comment

# calculate average
avg_show_comments = total_show_comments / len(show_posts)

print("Average of ask: {}, Average of show: {}".format(avg_ask_comments, avg_show_comments))


Average of ask: 14.038417431192661, Average of show: 10.31669535283993


In [28]:
import datetime as dt

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

# Calculate the amount of ask posts by the hour
import datetime as dt

result_list = []

#append to result_list
for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
#itereate through and update the counts
for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if time not in counts_by_hour:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1
    else:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [32]:
# calcualte acerage by hour

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [34]:
# sort
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [36]:
# print out our  results

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
