# Analyzing Hacker News

The goal of the project is to explore the hacker_news dataset and capture some insights about:

- Engagement (comments on average) across "Ask HN" and "Show HN"
- Most engaged users (based on commeting activity) in both 'Ask HN' and 'Show HN'
- Engagement (comments on average) based on post time

## 1. Exploring the dataset

The dataset used for this project can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

In [1]:
from csv import reader

f = open("hacker_news.csv", encoding="utf-8")
hn_raw = list(reader(f))
hn = hn_raw[1:]
hn_header = hn_raw[0]

In [2]:
# just in case I need to explore shape of subset of data
def explore_dataset_shape(dataset):
    rows = len(dataset)
    columns = len(dataset[0])
    return (rows, columns)

hn_shape = explore_dataset_shape(hn)
print(f"The dataset has {hn_shape[0]:,} rows and {hn_shape[1]:,} columns")

The dataset has 293,119 rows and 7 columns


### 1.1. Mapping the header

In [3]:
# useful in case I need a refresher on the label and index
def print_header_info():
    print("These are the columns labels of the dataset")
    for index, column_label in enumerate(hn_header):
        print(index, column_label)
        
print_header_info()

These are the columns labels of the dataset
0 id
1 title
2 url
3 num_points
4 num_comments
5 author
6 created_at


### 1.2. Creating the data subset

Since, the focus of the analysis is on either the 'Ask HN' or 'Show HN' categories, we isolate these posts from the rest of the dataset.

The title of these posts begin with either 'Ask HN' or 'Show HN'. We can use this information to filter out the information.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

ask = "ask hn"
show = "show hn"

for post in hn:
    title = post[1].lower()
    
    if title.startswith(ask):
        ask_posts.append(post)
    elif title.startswith(show):
        show_posts.append(post)
        
    else:
        other_posts.append(post)

Now, let's explore the first 3 entries for the ask_posts and show_posts lists.

In [5]:
ask_posts[:3]

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578522',
  'Ask HN: How do you pass on your work when you die?',
  '',
  '6',
  '3',
  'PascLeRasc',
  '9/26/2016 1:17'],
 ['12577908',
  'Ask HN: How a DNS problem can be limited to a geographic region?',
  '',
  '1',
  '0',
  'kuon',
  '9/25/2016 22:57']]

In [6]:
show_posts[0:3]

[['12578335',
  'Show HN: Finding puns computationally',
  'http://puns.samueltaylor.org/',
  '2',
  '0',
  'saamm',
  '9/26/2016 0:36'],
 ['12578182',
  'Show HN: A simple library for complicated animations',
  'https://christinecha.github.io/choreographer-js/',
  '1',
  '0',
  'christinecha',
  '9/26/2016 0:01'],
 ['12578098',
  'Show HN: WebGL visualization of DNA sequences',
  'http://grondilu.github.io/dna.html',
  '1',
  '0',
  'grondilu',
  '9/25/2016 23:44']]

By comparing the entries, we can observe the that ask posts *don't have* an associated URL with the post. The data types for the other columns seems to be consistent.

## 2. Cleaning the data

Now, that the data is isolated we can proceed with the cleaning process.

In [7]:
print_header_info()

These are the columns labels of the dataset
0 id
1 title
2 url
3 num_points
4 num_comments
5 author
6 created_at


Every datapoint is currently stored as a string. We'll a subset of the data points in the appropriate data type. This to faciliate further analysis.

In [8]:
import datetime as dt # make the package available globally

In [9]:
def converting_data_types(dataset):
    for entry in dataset:
        post_id = entry[0]
        post_points = entry[3]
        post_comments = entry[4]
    
        date_format = "%m/%d/%Y %H:%M"
        post_date = entry[6]
        post_date = dt.datetime.strptime(post_date, date_format)
    
        entry[0] = int(post_id)
        entry[3] = int(post_points)
        entry[4] = int(post_comments)
        entry[6] = post_date
    return dataset

In [10]:
ask_posts = converting_data_types(ask_posts)
show_posts = converting_data_types(show_posts)

In [11]:
print(f"There are {len(ask_posts):,} ask posts.")
print(f"And, {len(show_posts):,} show posts.")

There are 9,139 ask posts.
And, 10,158 show posts.


In [12]:
difference_ask_show = len(ask_posts) - len(show_posts)
print("The difference between ask and show posts is {}".format(difference_ask_show))

The difference between ask and show posts is -1019


## 3. Comparing Data

### 3.1. Comments between Ask HN and Show HN

In [13]:
def average_value(a_list,index): # I assume that the data is in string format
    list_length = len(a_list)
    total_value = 0
    for item in a_list:
        total_value += item[index]
    average = total_value/list_length
    return average

In [14]:
average_ask_comments = average_value(ask_posts,4)
average_ask_comments

10.393478498741656

In [15]:
average_show_comments = average_value(show_posts,4)
average_show_comments

4.886099625910612

In [16]:
# show posts receive more engagements that ask
difference_average = average_ask_comments - average_show_comments
difference_average

5.507378872831044

**RESULT OF COMMENTS COMPARISON**: On average, the ask posts receive more comments compared to show posts.

### 3.2. Points between Ask HN and Show HN

In [17]:
ask_average_points = average_value(ask_posts,3)
show_average_points = average_value(show_posts, 3)

In [18]:
ask_average_points

11.31174089068826

In [19]:
show_average_points

14.843571569206537

### 3.3. Most engaged users

In [20]:
from operator import itemgetter

def most_engaged_user(dataset):
    most_engaged_table = {}
    for post in dataset:
        author = post[5]
        if author in most_engaged_table:
            most_engaged_table[author] += 1
        else:
            most_engaged_table[author] = 1
            
    most_engaged_ranking = most_engaged_table.items()
    most_engaged_ranking = sorted(most_engaged_ranking, key=itemgetter(1), reverse=True)
    return most_engaged_ranking

In [21]:
ask_most_engaged_user = most_engaged_user(ask_posts)

print("The TOP 10 Most engaged (comments) users in the Ask HN are:")
for rank, user in enumerate(ask_most_engaged_user[:10], start=1):
    print(rank, user[0], f"with {user[1]} comments")

The TOP 10 Most engaged (comments) users in the Ask HN are:
1 hoodoof with 70 comments
2 tmaly with 48 comments
3 tixocloud with 41 comments
4 a_lifters_life with 37 comments
5 whoishiring with 36 comments
6 sharemywin with 31 comments
7 philippnagel with 31 comments
8 chirau with 27 comments
9 rayalez with 26 comments
10 baccheion with 23 comments


In [22]:
show_most_engaged_user = most_engaged_user(show_posts)

print("The TOP 10 Most engaged (comments) users in the Ask HN are:")
for rank, user in enumerate(show_most_engaged_user[:10], start=1):
    print(rank, user[0], f"with {user[1]} comments")

The TOP 10 Most engaged (comments) users in the Ask HN are:
1 bdehaaff with 36 comments
2 soheil with 33 comments
3 brakmic with 30 comments
4 fiatjaf with 29 comments
5 tonyspiro with 28 comments
6 nealmydataorg with 25 comments
7 max0563 with 23 comments
8 bucaran with 21 comments
9 alexellisuk with 20 comments
10 afshinmeh with 19 comments


### 3.4. Time with the greatest engagment in the Ask HN dataset (on average)

The goal of this part of the analysis is to **find the time with the highest numbers of comments on average**.

In [23]:
print_header_info()

These are the columns labels of the dataset
0 id
1 title
2 url
3 num_points
4 num_comments
5 author
6 created_at


In [24]:
hours_dictionary = {}

for post in ask_posts:
    
    comments = post[4]
    hour = post[-1]
    hour = hour.strftime("%H")
    
    if hour in hours_dictionary:
        hours_dictionary[hour]['counts'] += 1
        hours_dictionary[hour]['comments'] += comments
    else:
        hours_dictionary[hour] = {}
        hours_dictionary[hour]["counts"] = 1
        hours_dictionary[hour]["comments"] = comments

In [25]:
for hour in hours_dictionary:
    hours_dictionary[hour]['average'] = hours_dictionary[hour]['comments'] / hours_dictionary[hour]['counts']
    
hours_dictionary

{'02': {'counts': 269, 'comments': 2996, 'average': 11.137546468401487},
 '01': {'counts': 282, 'comments': 2089, 'average': 7.407801418439717},
 '22': {'counts': 383, 'comments': 3372, 'average': 8.804177545691905},
 '21': {'counts': 518, 'comments': 4500, 'average': 8.687258687258687},
 '19': {'counts': 552, 'comments': 3954, 'average': 7.163043478260869},
 '17': {'counts': 587, 'comments': 5547, 'average': 9.449744463373083},
 '15': {'counts': 646, 'comments': 18525, 'average': 28.676470588235293},
 '14': {'counts': 513, 'comments': 4972, 'average': 9.692007797270955},
 '13': {'counts': 444, 'comments': 7245, 'average': 16.31756756756757},
 '11': {'counts': 312, 'comments': 2797, 'average': 8.96474358974359},
 '10': {'counts': 282, 'comments': 3013, 'average': 10.684397163120567},
 '09': {'counts': 222, 'comments': 1477, 'average': 6.653153153153153},
 '07': {'counts': 226, 'comments': 1585, 'average': 7.013274336283186},
 '03': {'counts': 271, 'comments': 2154, 'average': 7.9483394

In [26]:
ask_most_engaged_time_ranking = []
for hour, details in hours_dictionary.items():
    ask_most_engaged_time_ranking.append([hour, details['average']])

ranking_avg_by_hour = sorted(ask_most_engaged_time_ranking, key=itemgetter(1), reverse=True)

In [27]:
print("TOP 5 Hours For ASK HN Post Comments")

for rank, hour_details in enumerate(ask_most_engaged_time_ranking[:5], start=1):
    hour = hour_details[0]
    avg_comments = hour_details[1]
    ranking_message = "{}. {}:00 with {:.2f} average comments per post".format(rank,hour,avg_comments)
    print(ranking_message)

TOP 5 Hours For ASK HN Post Comments
1. 02:00 with 11.14 average comments per post
2. 01:00 with 7.41 average comments per post
3. 22:00 with 8.80 average comments per post
4. 21:00 with 8.69 average comments per post
5. 19:00 with 7.16 average comments per post
