This notebook refactors
[05-pipeline/tf-05.py](https://github.com/crista/exercises-in-programming-style/blob/master/05-pipeline/tf-05.py),
starting with the version from
[d521abd 2016-05-21 08:14:41 -0700 Merge pull request #29 from bmistree/master (HEAD, origin/master, origin/HEAD, master, cohpy-20160829) [crista]](https://github.com/crista/exercises-in-programming-style/blob/d521abd5d7aac14af19aa7794aca9ee23c0f8cc5/05-pipeline/tf-05.py).

The original code runs only in Python 2.

The license in the following cell covers only this notebook
and is in addition to the LICENSE file in the parent directory
of this notebook.

The MIT License (MIT)

Copyright (c) 2016 James Prior

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


First we run the program, and save its output.

In [1]:
!python2 tf-05.py ../pride-and-prejudice.txt | tee good_output

mr  -  786
elizabeth  -  635
very  -  488
darcy  -  418
such  -  395
mrs  -  343
much  -  329
more  -  327
bennet  -  323
bingley  -  306
jane  -  295
miss  -  283
one  -  275
know  -  239
before  -  229
herself  -  227
though  -  226
well  -  224
never  -  220
sister  -  218
soon  -  216
think  -  211
now  -  209
time  -  203
good  -  201


When creating new cells interactively with someone,
they know exactly what the changes are because they just did them.
But when one looks at the cells later,
how does one know what all the little changes were?
It would be nice the see the differences
between one cell and another as we refactor.
So I use some cell magic to show the difference
between a cell and the previously executed cell.

After that, any difference between what the output should be
and what is actually is, is shown.

One complication is that since my trickery
executes cells outside Jupyter notebook,
the cells do not have access to variables
from Jupyter notebook and vice versa.
When cells are executed externally,
they will read that file for the assert comparison.

One nice thing about running the cells outside Jupyter,
is that we know each cell has all the stuff it needs
and does not rely on some result from a previous cell.

---
Create the script that will be executed by %%script magic to show
differences between cells, and differences in output from what it should be.

In [2]:
%%script bash

# As we refactor, it would be nice to see the difference between
# one cell and the previously executed cell.
# This script creates a shell script that
# does that when executed with the %%script diff_python
# at the beginning of a cell.
#
# To disable the diff command,
# Put a : and space in front of it. I.e.,
#     : diff old.py new.py
#
# meld yields a beautiful diff,
# but pops up a window for each cell executed.

program_name="${PATH%%:*}/diff_python"

cat >"$program_name" <<EOF
#!/usr/bin/env bash
cat >new.py
chmod +x new.py
if [ -a old.py ]; then
    diff old.py new.py
fi
chmod +x new.py
time ./new.py ../pride-and-prejudice.txt >new_output
echo
if cmp -s new_output good_output; then
    echo GOOD: the output is good
else
    echo ERROR: new_output is different from good_output
    # md5sum good_output new_output
    diff good_output new_output
fi
mv new.py old.py
EOF
rm -f old.py
chmod +x "$program_name"

From now on,
each cell will start with the %%script diff_python magic.
The original code is repeated below with the addition
of the %%script diff_python magic at the beginning,
changing the #!/usr/bin/env python to #!/usr/bin/env python2,
and a change to deliberately cause a bug for the cmp to catch.
It also initializes the code differences.

In [3]:
%%script diff_python
#!/usr/bin/env python2
import sys, re, operator, string

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = {}
    for w in word_list:
        if w in word_freqs:
            word_freqs[w] += 1
        else:
            word_freqs[w] = 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.iteritems(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print word_freqs[0][0], ' - ', word_freqs[0][1]
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[2:25])



ERROR: new_output is different from good_output
1,2d0
< mr  -  786
< elizabeth  -  635



real	0m0.351s
user	0m0.324s
sys	0m0.024s


diff_python correctly detected the change in output,
so we know it works. 

So next we undo that change so the output is good.

In [4]:
%%script diff_python
#!/usr/bin/env python2
import sys, re, operator, string

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = {}
    for w in word_list:
        if w in word_freqs:
            word_freqs[w] += 1
        else:
            word_freqs[w] = 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.iteritems(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print word_freqs[0][0], ' - ', word_freqs[0][1]
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


74c74
< print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[2:25])
---
> print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])

GOOD: the output is good



real	0m0.378s
user	0m0.352s
sys	0m0.012s


---
Now we start refactoring, one thing at a time.

Python 2 is [scheduled to retire in 2020](https://pythonclock.org/),
so let's port it to Python 3.

In [5]:
%%script diff_python
#!/usr/bin/env python
import sys, re, operator, string

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = {}
    for w in word_list:
        if w in word_freqs:
            word_freqs[w] += 1
        else:
            word_freqs[w] = 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.iteritems(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print word_freqs[0][0], ' - ', word_freqs[0][1]
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


1c1
< #!/usr/bin/env python2
---
> #!/usr/bin/env python

ERROR: new_output is different from good_output
1,25d0
< mr  -  786
< elizabeth  -  635
< very  -  488
< darcy  -  418
< such  -  395
< mrs  -  343
< much  -  329
< more  -  327
< bennet  -  323
< bingley  -  306
< jane  -  295
< miss  -  283
< one  -  275
< know  -  239
< before  -  229
< herself  -  227
< though  -  226
< well  -  224
< never  -  220
< sister  -  218
< soon  -  216
< think  -  211
< now  -  209
< time  -  203
< good  -  201


  File "./new.py", line 68
    print word_freqs[0][0], ' - ', word_freqs[0][1]
                   ^
SyntaxError: Missing parentheses in call to 'print'

real	0m0.033s
user	0m0.024s
sys	0m0.004s
