Project guide: https://www.dataquest.io/m/63/guided-project%3A-transforming-data-with-python

<br /><br />

The purpose of this project is to demonstrate working with Python scripts (`.py`) to be run on a shell command line.

# 1. The dataset
The dataset is from https://github.com/arnauddri/hn and contains info on submissions to [Hacker News](https://news.ycombinator.com/) from 2006 to 2015.

Following is an excerpt from DataQuest.
***
We've sampled 10000 rows from the data randomly, and removed all extraneous columns. Our dataset only has four columns:

*   `submission_time` -- when the story was submitted.
*   `upvotes` -- number of upvotes the submission got.
*   `url` -- the base domain of the submission.
*   `headline` -- the headline of the submission. Users can edit this, and it doesn't have to match the headline of the original article.
***

Each of the following section comprises two parts:

1. Code for Python scripts which is to be run from a shell command line.
2. Within-notebook demonstration of function inside the script.


# 2. Reading the data

# 2.1. Code for `read.py`

`read.py` will read in data and assign column labels. `read.py` provides a helper function `load_data` which will be imported by other Python scripts.

In [1]:
import pandas as pd

def load_data():
    """
    Read in data and add column labels
    """
    
    import pandas as pd
    from IPython.display import display

    # read in dataset
    df = pd.read_csv("hn_stories.csv", header=None)

    # add columns
    df.columns = columns
    
    return df

columns = ["submission_time", "upvotes", "url", "headline"]

if __name__ == "__main__":
    load_data()

## 2.2. Demonstration

In [2]:
# read in data
df = load_data()

# display first 5 rows of data
df.head()

Unnamed: 0,submission_time,upvotes,url,headline
0,2014-06-24T05:50:40.000Z,1,flux7.com,8 Ways to Use Docker in the Real World
1,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
2,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
3,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
4,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures


# 3. Which words appear in the headlines often?

# 3.1. Code

`count.py` contains following code. It will (1) count the number of occurences for each word in headlines and (2) print 100 most-often-occuring words.

To run it, type `python count.py` into a shell command line.

[`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) class will be used.

In [3]:
from collections import Counter, OrderedDict
from read import load_data
from sys import argv

import pandas as pd
import re

def printTop100HeadlineWords():
    """
    Prints 100 words which appear most often
    in headlines, in a descending order
    """
    
    # ensure no argument is provided in command line
    if len(argv) > 1:
        print("Usage: python count.py")
        return

    # read in dataset
    df = load_data()

    # concatenate all headlines separated by a space
    h_con = ""

    for val in df[df["headline"].notnull()]["headline"].values:
        h_con += val.lower() + " "

    # split concatenated headlines into a list of strings
    h_con_l = h_con.split()

    # strip non-word characters (including underscore)
    h_con_l = [string for string in h_con_l if re.match("\W|_", string) is None]

    # count number of occurences of each word in headlines
    count = Counter(h_con_l)

    # print 100 most often occuring words
    count_ordered = OrderedDict(sorted(count.items(), key=lambda t: t[1], reverse=True))

    c = 0
    for word, count in list(count_ordered.items()):
        print(word + ": " + str(count))
        c += 1
        if c == 100:
            break

if __name__ == "__main__":
    printTop100HeadlineWords()

Usage: python count.py


# 3.2. Demonstration

Below is equilvallent to running `python count.py` from a shell command line.

In [4]:
%run count.py

the: 2046
to: 1642
a: 1278
of: 1170
for: 1140
in: 1037
and: 936
is: 620
on: 568
hn:: 537
with: 537
how: 526
your: 480
you: 392
ask: 371
from: 310
new: 304
google: 303
why: 262
what: 258
an: 243
are: 223
by: 219
at: 213
show: 205
web: 192
it: 192
do: 183
app: 178
i: 173
as: 161
not: 160
that: 160
data: 157
about: 154
be: 154
facebook: 150
startup: 147
my: 131
using: 125
free: 125
online: 123
apple: 123
get: 122
can: 115
open: 114
will: 112
android: 110
this: 110
out: 109
we: 106
its: 102
now: 101
best: 101
up: 100
code: 98
have: 97
or: 96
one: 95
more: 93
first: 93
all: 93
software: 93
make: 92
iphone: 91
twitter: 91
should: 91
video: 90
social: 89
internet: 88
us: 88
mobile: 88
use: 87
has: 84
world: 80
just: 80
design: 79
business: 79
apps: 78
5: 78
source: 77
cloud: 76
into: 76
api: 75
top: 74
tech: 73
javascript: 73
like: 72
programming: 72
windows: 72
when: 71
ios: 70
live: 69
future: 69
most: 68
company: 68
startups: 67
project: 67
news: 67
game: 67


# 4. Which domains were submitted most often?

## 4.1. Code

`domain.py` contains the following code. It will (1) count number of occurences of each domain in headlines and (2) print 100 most-often-occuring domains.

To run it, type `python domain.py` into a shell command line.

Mainly [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) and [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html#pandas.Series.sort_values) of `pandas.Series` class will be used.

In [5]:
import pandas as pd
from read import load_data, columns
from sys import argv

def printTop100urls():
    """
    Prints 100 most often appearing domains
    in a descending order
    """

    # ensure no argument is provided in command line
    if len(argv) > 1:
        print("Usage: python count.py")
        return

    # read in dataset
    df = load_data()

    # get domains
    urls = df["url"]

    # remove null values
    urls.dropna(inplace=True)
    
    # count domains and sort results by count then by domain
    url_val_c = urls.value_counts()
    url_val_c = url_val_c.reset_index()
    url_val_c.rename(columns={"index": "url", "url": "count"}, inplace=True)
    url_val_c.sort_values(by=["count", "url"], ascending=[False, True], inplace=True)

    # print 100 most often occuring domains
    for index, row in url_val_c[["url", "count"]].iloc[:100].iterrows():
        print(row["url"] + ": " + str(row["count"]))

if __name__ == "__main__":
    # print 100 most-often-appearing domains        
    printTop100urls()

Usage: python count.py


## 4.2. Demonstration

Below is equivallent to running `python domain.py` from a shell command line.

In [6]:
%run domain.py

github.com: 174
techcrunch.com: 172
youtube.com: 142
nytimes.com: 109
medium.com: 88
wired.com: 76
arstechnica.com: 73
bbc.co.uk: 53
en.wikipedia.org: 49
online.wsj.com: 41
businessinsider.com: 37
forbes.com: 36
theverge.com: 36
venturebeat.com: 36
readwriteweb.com: 35
gigaom.com: 34
mashable.com: 34
thenextweb.com: 34
bloomberg.com: 32
google.com: 32
theatlantic.com: 32
twitter.com: 31
engadget.com: 28
washingtonpost.com: 26
bit.ly: 25
news.cnet.com: 25
networkworld.com: 24
npr.org: 24
reddit.com: 24
economist.com: 23
kickstarter.com: 22
plus.google.com: 21
quora.com: 21
slate.com: 20
theguardian.com: 20
blogs.wsj.com: 19
theregister.co.uk: 19
gizmodo.com: 18
spectrum.ieee.org: 18
stackoverflow.com: 18
bbc.com: 17
businessweek.com: 17
technologyreview.com: 17
guardian.co.uk: 16
zdnet.com: 16
blogs.hbr.org: 15
facebook.com: 15
itworld.com: 15
sfgate.com: 15
techdirt.com: 15
vimeo.com: 15
bits.blogs.nytimes.com: 14
computerworld.com: 14
dailymail.co.uk: 14
huffingtonpost.com: 14
newscie

<span style="color:red">Note</span>: DataQuest also suggested counting core domains (e.g. count "google.com" rather than "groups.google.com"). However, this has not been done because, currently, it extracting core domains is too tricky a task for me.

# 5. When are the most articles submitted?

`times.py` contains the following code. It will (1) extract and count each hour from timestamps and (2) print results.

This can be run from a shell command line. E.g. `python times.py hour` where `hour` can be replaced by `year`, `month` or `day`.

Time will be extracted using [`dateutil.parser.parse`](http://dateutil.readthedocs.io/en/stable/parser.html#dateutil.parser.parse)

In [7]:
from dateutil.parser import parse
from read import load_data
from sys import argv

def extractUnit(timestamp, unit):
    """
    return unit from timestamp in UTC format.
    A unit can be year, month, day or hour.
    
    procs is a dictionary where key is a unit
    and values is a command string for the unit.
    """

    return eval("parse(timestamp)." + unit)


def printSubmissionTimes(unit=None):
    
    err_mes = "Usage: python times.py [unit]\n\
        \n\
        A unit can be year, month, day or hour.\n\
        \n\
        e.g. python times.py hour"

    # stop if no unit is given by user
    if len(argv) == 1:
        print(err_mes)
        return

    procs = {"year", "month", "day", "hour"}
    
    unit = argv[1] if argv[1] != "-f" else unit

    if unit not in procs:
        print(err_mes)
        return        
    
    # read in dataset
    df = load_data()

    # get hour from each row of submission_time column
    h = df["submission_time"].apply(extractUnit, args=(unit, ))

    # get value counts and sort it in descending order
    h = h.value_counts().sort_values(ascending=False)

    # print series
    print("{}: Occurences".format(unit))
    for index, value in h.iteritems():
        print(str(index) + ": " + str(value))

# note: In case another script imports a function,
# printSubmissionTimes() from running automatically
# if this line ('f __name__ == "__main__":) is missing.
if __name__ == "__main__":
    printSubmissionTimes()

Usage: python times.py [unit]
        
        A unit can be year, month, day or hour.
        
        e.g. python times.py hour


Following is equivallent to running `python times.py hour` from a shell command line.

In [8]:
%run times.py hour

hour: Occurences
17: 646
16: 627
15: 618
14: 602
18: 575
19: 563
20: 538
13: 531
21: 497
12: 398
23: 394
22: 386
11: 347
10: 324
7: 320
0: 317
1: 314
2: 298
9: 298
3: 296
4: 282
6: 279
5: 276
8: 274


# 6. More questions

## 6.1. What leads to the most upvotes?

`upvotes.py` contains the following code. It will (1) count mean number of upvotes per entry in specified variable and (2) display each mean upvotes for each entry.

This can be run from a shell command line. E.g. `python upvotes.py headline` where `headline` can be replaced by `submission_time` or `url`.

In [9]:
from read import load_data
from times import extractUnit
from read import columns

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sys import argv

def meanUpvotesForVar(var=None):
    """
    var is a column name in dataframe.
    varLabel is a column name for new dataframe
        aggregated upon var.

    Prints up to 100 var entries with most votes,
    in descending order.
    """

    err_mes = "Usage: python upvotes.py [variable]\n\
        \n\
        A variable can be headline, submission_time or url.\n\
        \n\
        e.g. python times.py headline"

    # stop if no unit is given by user
    if len(argv) == 1:
        print(err_mes)
        return

    var = argv[1] if argv[1] != "-f" else var

    if var not in columns:
        print(err_mes)
        return
    
    # load data
    df = load_data()

    # process var data
    # 1. headline length
    if var == "headline":
        x = df[var].apply(lambda x: len(str(x)))
        var_new = "Headline length"
    # 2. submission time
    elif var == "submission_time":
        x = df[var].apply(extractUnit, args=("hour", ))
        var_new = "Submission time"
    elif var == "url":
        x = df[var]
        var_new = "Domain"

    # create dataframe of upvotes and var data
    up_label = "upvotes"
    up_x = pd.concat([x, df[up_label]], axis=1)

    # average number of upvotes across var data
    up_x_piv = up_x.pivot_table(index=var, values=up_label, aggfunc=np.mean)

    # rename column and index of new dataframe
    up_label_new = "Mean number of upvotes"
    up_x_piv = up_x_piv.rename(columns={up_label: up_label_new})
    #up_x_piv.set_index(var, inplace=True)
    up_x_piv.index.name = var_new
    

    # show mean number of upvotes per var data entry
    pd.set_option("display.max_row", len(up_x_piv))
    print(up_x_piv.sort_values(by=up_label_new, ascending=False).iloc[:100])

if __name__ == "__main__":
    meanUpvotesForVar()

Usage: python upvotes.py [variable]
        
        A variable can be headline, submission_time or url.
        
        e.g. python times.py headline


## 6.1. What headline length leads to the most upvotes?

Following will be equivellent to running `python upvotes.py headline` from a shell command line.

In [10]:
%run upvotes.py headline

                 Mean number of upvotes
Headline length                        
67                            19.535433
6                             19.181818
23                            18.991667
79                            18.836066
83                            18.400000
7                             18.136364
22                            17.018182
27                            16.237762
21                            15.609524
70                            14.227273
16                            14.119048
86                            13.500000
90                            13.500000
73                            13.277778
57                            12.961538
33                            12.265193
77                            12.244681
75                            12.000000
32                            11.898305
51                            11.801047
68                            11.793103
55                            11.643216
30                            10.839572


## 6.2. What submission time leads to the most upvotes?

Following will be equivellent to running `python upvotes.py submission_time` from a shell command line.

In [11]:
%run upvotes.py submission_time

                 Mean number of upvotes
Submission time                        
21                            13.368209
16                            11.524721
18                            11.412174
4                             11.120567
17                            11.034056
0                             10.823344
3                             10.756757
1                             10.643312
13                            10.069680
19                             9.856128
23                             9.687817
2                              9.342282
11                             9.046110
20                             9.018587
14                             8.418605
12                             8.238693
5                              7.713768
15                             7.443366
22                             6.300518
6                              6.025090
9                              5.825503
10                             5.657407
7                              5.490625


# 7. TODO

DataQuest has also suggested tackling the following question. This is to be done later.

* How are the total numbers of upvotes changing over time?