# **Hacker News Posts - Data Cleaning**
**bold text**


In this project, we're diving into a dataset of submissions made to the renowned tech website, Hacker News.

Hacker News, founded by the startup incubator Y Combinator, operates much like Reddit. Users submit stories, called "posts," which can garner votes and comments. The site enjoys significant popularity in tech and startup spheres. Stories that climb to the top of the Hacker News rankings can draw in immense traffic, sometimes reaching hundreds of thousands of views.

This is one of the guided project from dataquest, and the data for the same can be downloaded from [here](https://dq-content.s3.amazonaws.com/356/hacker_news.csv).


Skills I learnt from this project -
1.   How to work with strings
2.   Object-oriented programming
3.   Dates and times








Below are descriptions of the columns:

1. id: the unique identifier from Hacker News for the post
2. title: the title of the post
3. url: the URL that the posts links to, if the post has a URL
4. num_points: the number of points the post acquired, calculated as 5. 5. the total number of upvotes minus the total number of downvotes
6. num_comments: the number of comments on the post
7. author: the username of the person who submitted the post
8. created_at: the date and time of the post's submission

## Importing necessary libraries

In [5]:
import pandas as pd
import numpy as np
import csv
import datetime as dt

We're specifically interested in posts with titles that begin with either **Ask HN** or **Show HN**.

Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:

> Ask HN: How to improve my personal website?

> Ask HN: Am I the only one outraged by Twitter shutting down share counts?

> Ask HN: Aby recent changes to CSS that broke mobile?


Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:


> Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform

> Show HN: Something pointless I made

> Show HN: Shanhu.io, a programming playground powered by e8vm


We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
Let's start by importing the libraries we need and reading the dataset into a list of lists.

## Load Dataset

In [6]:
# first we specify the path to the data we downloaded
# I am using Colab hence my path will be set via teh folder I uploaded my dataset to
file_path = '/content/sample_data/hacker_news.csv'

# Next we Load the dataset
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
print(df.head())

         id                                              title  \
0  12224879                          Interactive Dynamic Video   
1  10975351  How to Use Open Source and Shut the Fuck Up at...   
2  11964716  Florida DJs May Face Felony for April Fools' W...   
3  11919867       Technology ventures: From Idea to Enterprise   
4  10301696  Note by Note: The Making of Steinway L1037 (2007)   

                                                 url  num_points  \
0            http://www.interactivedynamicvideo.com/         386   
1  http://hueniverse.com/2016/01/26/how-to-use-op...          39   
2  http://www.thewire.com/entertainment/2013/04/f...           2   
3  https://www.amazon.com/Technology-Ventures-Ent...           3   
4  http://www.nytimes.com/2007/11/07/movies/07ste...           8   

   num_comments      author       created_at  
0          52.0    ne0phyte   8/4/2016 11:52  
1          10.0      josep2  1/26/2016 19:30  
2           1.0    vezycash  6/23/2016 22:20  
3     

In [7]:
#Read the hacker_news.csv file in as a list of lists

# Initialize an empty list to hold the data
data = []

# Open the CSV file
with open(file_path, mode='r', encoding='utf-8') as file:
    # Create a CSV reader object
    csv_reader = csv.reader(file)

    # Iterate over the rows in the CSV reader
    for row in csv_reader:
        # Append each row to the data list
        data.append(row)

# Print the first few rows to verify
for row in data[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [8]:
#storing data in hn
hn = data
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Split Data for processing

Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [9]:
# Extract the first row as headers
headers = data[0]

# Remove the first row from the data
hn = data[1:]

# Display the headers
print("Headers:")
print(headers)

# Display the first five rows of the dataset to verify
print("\nFirst five rows of hn:")
for row in hn[:5]:
    print(row)

Headers:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

First five rows of hn:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2

## Filtering data

Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object string1.startswith('dq'). If string1 starts with dq, it will return True; otherwise, it will return False.

In [10]:
# We create three empty lists
ask_posts =[]
show_posts =[]
other_posts =[]

#Assign the title in each row to a variable named title
for row in hn:
  title = row[1].lower()
  if title.startswith('ask hn'):
        ask_posts.append(row)
  elif title.startswith('show hn'):
        show_posts.append(row)
  else:
        other_posts.append(row)

# Check the number of posts in each list
print("Number of ask posts:", len(ask_posts))
print("Number of show posts:", len(show_posts))
print("Number of other posts:", len(other_posts))

Number of ask posts: 1192
Number of show posts: 785
Number of other posts: 11577


In [11]:
# Print the first five rows
for row in show_posts[:5]:
    print(row)


for row in ask_posts[:5]:
    print(row)


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']
['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']
['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby rec

## Analyzing posts

Next, let's determine if ask posts or show posts receive more comments on average.

In [12]:
# total number of comments in ask posts

total_ask_comments = 0
for row in ask_posts:
  num_comments = (int(row[4]))
  total_ask_comments += num_comments

In [13]:
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_ask_comments

14.819630872483222

In [14]:
# total number of comments in show posts

total_show_comments = 0
for row in show_posts:
  num_comments = int(row[4])
  total_show_comments += num_comments

In [15]:
avg_show_comments = total_show_comments/ len(show_posts)
avg_show_comments

9.750318471337579

## Findings

In [16]:
# Comparison
if avg_ask_comments > avg_show_comments:
    print("Ask posts receive more comments on average.")
elif avg_ask_comments < avg_show_comments:
    print("Show posts receive more comments on average.")
else:
    print("Both ask and show posts receive the same average number of comments.")

Ask posts receive more comments on average.


Hence we can see that ask posts receive more comments on average than show posts.

## Ask posts Analysis

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.




**To calculate the number of ask posts and comments by hour created.**
We'll use the datetime module to work with the data in the created_at column.

In [17]:
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])


In [18]:
counts_by_hour = {}
comments_by_hour = {}

# counts_by_hour: contains the number of ask posts created during each hour of the day.
# comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

for row in result_list:
    created_at = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = created_at.strftime("%H")

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]


Next, we'll use these two dictionaries **to calculate the average number of comments for posts created during each hour of the day**.

In [19]:
avg_by_hour = []

for avg in comments_by_hour:
  avg_by_hour.append([avg, comments_by_hour[avg]/counts_by_hour[avg]])

In [20]:
avg_by_hour

[['09', 6.5],
 ['13', 12.50909090909091],
 ['10', 17.225],
 ['14', 13.246753246753247],
 ['16', 18.226666666666667],
 ['23', 8.075],
 ['12', 9.745098039215685],
 ['17', 13.402777777777779],
 ['15', 43.1025641025641],
 ['21', 15.933333333333334],
 ['20', 29.942307692307693],
 ['02', 10.130434782608695],
 ['18', 12.77027027027027],
 ['03', 9.31578947368421],
 ['05', 11.037037037037036],
 ['19', 13.52],
 ['01', 11.333333333333334],
 ['22', 6.672727272727273],
 ['08', 12.82857142857143],
 ['04', 7.722222222222222],
 ['00', 7.914285714285715],
 ['06', 8.225806451612904],
 ['07', 7.541666666666667],
 ['11', 13.512195121951219]]

We now have the results we need, but this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [21]:
swap_avg_by_hour = []

for swap in avg_by_hour:
  swap_avg_by_hour.append([swap[1], swap[0]])

In [22]:
swap_avg_by_hour

[[6.5, '09'],
 [12.50909090909091, '13'],
 [17.225, '10'],
 [13.246753246753247, '14'],
 [18.226666666666667, '16'],
 [8.075, '23'],
 [9.745098039215685, '12'],
 [13.402777777777779, '17'],
 [43.1025641025641, '15'],
 [15.933333333333334, '21'],
 [29.942307692307693, '20'],
 [10.130434782608695, '02'],
 [12.77027027027027, '18'],
 [9.31578947368421, '03'],
 [11.037037037037036, '05'],
 [13.52, '19'],
 [11.333333333333334, '01'],
 [6.672727272727273, '22'],
 [12.82857142857143, '08'],
 [7.722222222222222, '04'],
 [7.914285714285715, '00'],
 [8.225806451612904, '06'],
 [7.541666666666667, '07'],
 [13.512195121951219, '11']]

In [23]:
sorted_swap= sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[43.1025641025641, '15'], [29.942307692307693, '20'], [18.226666666666667, '16'], [17.225, '10'], [15.933333333333334, '21'], [13.52, '19'], [13.512195121951219, '11'], [13.402777777777779, '17'], [13.246753246753247, '14'], [12.82857142857143, '08'], [12.77027027027027, '18'], [12.50909090909091, '13'], [11.333333333333334, '01'], [11.037037037037036, '05'], [10.130434782608695, '02'], [9.745098039215685, '12'], [9.31578947368421, '03'], [8.225806451612904, '06'], [8.075, '23'], [7.914285714285715, '00'], [7.722222222222222, '04'], [7.541666666666667, '07'], [6.672727272727273, '22'], [6.5, '09']]


**Top 5 Hours for Ask Posts Comments**

In [24]:
print(sorted_swap[:5])


[[43.1025641025641, '15'], [29.942307692307693, '20'], [18.226666666666667, '16'], [17.225, '10'], [15.933333333333334, '21']]


In [25]:
for avg, hour in sorted_swap[:5]:
    hour_dt = dt.datetime.strptime(hour, "%H")
    hour_formatted = hour_dt.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour_formatted, avg))

15:00: 43.10 average comments per post
20:00: 29.94 average comments per post
16:00: 18.23 average comments per post
10:00: 17.23 average comments per post
21:00: 15.93 average comments per post


To increase the chances of receiving comments on ask posts, one should consider posting during the hours with the highest average comments per post. According to the analysis, the top five hours for ask posts comments are:

15:00 (3:00 PM) - 38.59 average comments per post

02:00 (2:00 AM) - 23.81 average comments per post

20:00 (8:00 PM) - 21.52 average comments per post

16:00 (4:00 PM) - 16.80 average comments per post

21:00 (9:00 PM) - 16.01 average comments per post

These hours are based on the Eastern Time (ET) time zone. Depending on the time zone you live in, you may need to adjust these hours accordingly.

# Thank You

Here are some next steps for you to consider:

Determine if show or ask posts receive more points on average.

Determine if posts created at a certain time are more likely to receive more points.

Compare your results to the average number of comments and points other posts receive.

**Determine if show or ask posts receive more points on average.**


In [26]:
show_num_points = []
for row in show_posts:
  show_num_points.append(int(row[3]))

In [27]:
total_show_num_points = 0
for points in show_num_points:
  total_show_num_points += points

In [28]:
total_show_num_points

20035

In [29]:
avg_show_points = total_show_num_points/len(show_posts)
avg_show_points

25.522292993630572

In [30]:
ask_num_points = []
for row in ask_posts:
  ask_num_points.append(int(row[3]))

In [31]:
total_ask_num_points = 0
for points in ask_num_points:
  total_ask_num_points += points

In [32]:
total_ask_num_points

18252

In [33]:
avg_ask_points = total_ask_num_points/len(ask_posts)
avg_ask_points

15.312080536912752

In [34]:
# Comparison
if avg_ask_points > avg_show_points:
    print("Ask posts receive more points on average.")
elif avg_ask_points < avg_show_points:
    print("Show posts receive more points on average.")
else:
    print("Both ask and show posts receive the same average number of points.")

Show posts receive more points on average.


**Determine if posts created at a certain time are more likely to receive more points.**

Show posts Analysis

In [35]:
test_list = []

for row in show_posts:
  num_points = int(row[3])
  created_at = row[6]
  test_list.append([created_at, num_points])

In [36]:
count_by_hour = {}
point_by_hour = {}

for row in test_list:
    created_at = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = created_at.strftime("%H")

    if hour not in count_by_hour:
        count_by_hour[hour] = 1
        point_by_hour[hour] = row[1]
    else:
        count_by_hour[hour] += 1
        point_by_hour[hour] += row[1]

In [37]:
avg_by_hour = []

for avg in point_by_hour:
  avg_by_hour.append([avg, point_by_hour[avg]/count_by_hour[avg]])

In [38]:
avg_by_hour

[['14', 17.32758620689655],
 ['22', 52.93103448275862],
 ['18', 24.857142857142858],
 ['07', 28.833333333333332],
 ['20', 12.324324324324325],
 ['05', 4.7272727272727275],
 ['16', 29.566037735849058],
 ['19', 35.86046511627907],
 ['15', 29.962962962962962],
 ['03', 22.272727272727273],
 ['17', 22.06451612903226],
 ['06', 10.0],
 ['02', 13.285714285714286],
 ['13', 27.582089552238806],
 ['08', 17.307692307692307],
 ['21', 14.78125],
 ['04', 14.571428571428571],
 ['11', 28.70967741935484],
 ['12', 43.26190476190476],
 ['23', 35.083333333333336],
 ['09', 20.923076923076923],
 ['01', 18.61111111111111],
 ['10', 14.037037037037036],
 ['00', 44.05882352941177]]

In [39]:
# Create a list with columns swapped
swap_avg_by_hour = [[row[1], row[0]] for row in avg_by_hour]

# Sort the list by the average number of comments in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Print the results
print("Top 5 Hours for Ask Posts Comments")
for avg, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(hour, avg))


Top 5 Hours for Ask Posts Comments
22: 52.93 average comments per post
00: 44.06 average comments per post
12: 43.26 average comments per post
19: 35.86 average comments per post
23: 35.08 average comments per post
