# Week 3 lecture notes


## Assignment 1 review

### Including directories in paths

If you create a file in a lower directory, then want to modify, move, or delete it, you have to use the directory to refer to it.

In [None]:
!mkdir mydirectory

In [None]:
!ls > mydirectory/myfiles.txt

In [None]:
!rm myfiles.txt

In [None]:
!rm mydirectory/myfiles.txt

In [None]:
!ls mydirectory

### ">" vs ">>"

Both ">" and ">>" redirect output from the screen to a file.  Both will create new files if none yet exists.  Only ">" will overwrite an existing file; ">>" will append to an existing file.

In [None]:
!date > datefile.txt

In [None]:
!cat datefile.txt

In [None]:
!date > datefile.txt

In [None]:
!cat datefile.txt

In [None]:
!date >> datefile.txt

In [None]:
!date >> datefile.txt

In [None]:
!cat datefile.txt

### "." and ".."

These are shorthand names for the current working directory and the parent of the current directory.  You can use them in a variety of ways.

In [None]:
!ls .

In [None]:
!ls ..

In [None]:
%cd .

In [None]:
%cd ..

In [None]:
%cd lecture/

In [None]:
!ls ../lecture

In [None]:
!ls .././lecture

In [None]:
!ls .././lecture/../lecture/../lecture/.

### lower|sort|uniq or sort|lower|uniq

Order matters!  Consider the text.

In [None]:
!wget https://s3.amazonaws.com/2017-dmfa/assignment-1/siddhartha.txt

In [None]:
!grep -oE '\w{{2,}}' siddhartha.txt | grep -v '^[0-9]' | uniq -c | head

Among the set of three functions: {uniq, lower, sort} there are six orderings.  Which produce which results, and why?

 * uniq, lower, sort
 * uniq, sort, lower
 * sort, lower, uniq
 * sort, uniq, lower
 * lower, sort, uniq
 * lower, uniq, sort

In [None]:
!grep -oE '\w{{2,}}' siddhartha.txt | grep -v '^[0-9]' | uniq -c | tr '[:upper:]' '[:lower:]' | sort | head

In [None]:
!grep -oE '\w{{2,}}' siddhartha.txt | grep -v '^[0-9]' | uniq -c | sort | tr '[:upper:]' '[:lower:]' | head

In [None]:
!grep -oE '\w{{2,}}' siddhartha.txt | grep -v '^[0-9]' | sort | tr '[:upper:]' '[:lower:]' | uniq -c | head

In [None]:
!grep -oE '\w{{2,}}' siddhartha.txt | grep -v '^[0-9]' | sort | uniq -c | tr '[:upper:]' '[:lower:]' | head

In [None]:
!grep -oE '\w{{2,}}' siddhartha.txt | grep -v '^[0-9]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | head

In [None]:
!grep -oE '\w{{2,}}' siddhartha.txt | grep -v '^[0-9]' | tr '[:upper:]' '[:lower:]' | uniq -c | sort | head 

### Reviewing grep

`grep` is a lot more powerful than what you've seen so far.  More than anything else, it's commonly used to find text within files.  For example, to find lines with "Romeo" in Romeo and Juliet:

In [None]:
!wget https://s3.amazonaws.com/2017-dmfa/week-3/romeo.txt

In [None]:
!grep Romeo romeo.txt | head

There are many, many options, such as case-insensitivity:

In [None]:
!grep -i what romeo.txt | head

Another useful one is to print line numbers for matching lines:

In [None]:
!grep -n Juliet romeo.txt | head

We can also negate certain terms - show non-matches.

In [None]:
!grep -n Juliet romeo.txt | grep -v Romeo | head

Here's an alternate version of the "one word at a time" pattern.  It splits lines into words that are at least two characters long, leaving out "I" and "a" and contractions like "s" and "t".

In [None]:
!cat romeo.txt | grep -oE '\w{{2,}}' | head

And one more useful tip is to match more than one thing:

In [None]:
!grep "Romeo\|Juliet" romeo.txt | head

### Reviewing wildcards

Sometimes you need to perform a task with a set of files that share a characteristic like a file extension.  The shell lessons had examples with `.pdb` files.  This is common.

The `*` (asterisk, or just "star") is a wildcard, which matches zero-to-many characters.

In [None]:
!ls *.txt

The `?` (question mark) is a wildcard that matches exactly one character.

In [None]:
!cp romeo.txt womeo.txt

In [None]:
!ls ?omeo.txt

In [None]:
!ls wome?.txt

The difference is subtle - these two would have worked interchangeably on the above.  But note:

In [None]:
!ls wo*.txt

In [None]:
!ls wo?.txt

See the difference?  The `*` can match more than one character; `?` only matches one. 

### Reviewing csvkit

`csvkit` is a dream to work with.  It's well-documented, does a bunch of things we need to do all the time with data, and its interface is consistent across its many commands.  Please explore its documentation when you can, it does more than just what you'll see in class.

Let's look at a large dataset with csvkit.  This is trip data from Capital Bikeshare:

In [None]:
!wget https://s3.amazonaws.com/2017-dmfa/week-3/2017-Q1-cabi-trips-history-data.zip

In [None]:
!unzip 2017-Q1-cabi-trips-history-data.zip

In [None]:
!mv 2017-Q1-Trips-History-Data.csv q1.csv

What shape does the data take?

In [None]:
!csvcut -n q1.csv

What do a few records look like?

In [None]:
!head q1.csv | csvlook

Let's do that again with a cleaner view of just a few columns:

In [None]:
!head q1.csv | csvcut -c2-3,5,7 | csvlook

We can specify the same thing with the opposite set of columns and `-C` instead of `-c`:

In [None]:
!head q1.csv | csvcut -C1,4,6,8-9| csvlook

`csvgrep` combines the beauty of `grep` with csv files:

In [None]:
!csvgrep -c5 -r "8th & F St NE" q1.csv | head | csvcut -c2-3,5,7 | csvlook

We can combine csvkit commands with everything else we've learned so far:

In [None]:
!csvgrep -c5 -r "8th & F St NE" q1.csv | csvcut -c2-3,5,7 | csvgrep -c1 -r "3/31/2017" | wc -l

In [None]:
!csvgrep -c7 -r "Eastern Market" q1.csv | csvcut -c2-3,5,7 | csvgrep -c1 -r "1/1/2017" | wc -l

## Writing Python filters

Starting with the `samplefilter.py` filter, let's write some of our own.

In [None]:
!wget https://s3.amazonaws.com/2017-dmfa/week-3/simplefilter.py

In [None]:
!chmod +x simplefilter.py

In [None]:
!head romeo.txt | ./simplefilter.py

At first, our filter does nothing.  Let's use it as a template to lower-case text.  We'll copy it to a better-named version to be clear what it's for, and we'll add comments that match within the script.

In [None]:
!cp simplefilter.py lower.py

In [None]:
!chmod +x lower.py

In [None]:
!head romeo.txt | ./lower.py

What if we wanted to eliminate some words from our counting?

## Working with GNU Parallel

GNU Parallel is an easy to use but very powerful tool with a lot of options.  You can use it to process a lot of data easily and it can also make a big mess in a hurry.  For more examples, see the [tutorial page](https://www.gnu.org/software/parallel/parallel_tutorial.html).

Let's start with something we've seen before:  splitting a text file up and counting its unique words.  But let's use a lot of files to do it.

**Note**: `parallel` sometimes won't work through the notebook due to an interactive request it makes regarding citation.  If this happens to you, use the terminal directly instead of the notebook if you want to try it out.

In [None]:
!wget https://s3.amazonaws.com/2017-dmfa/week-3/texts.zip

In [None]:
!unzip -l texts.zip

In [None]:
!mkdir many-texts

In [None]:
%cd many-texts/

In [None]:
!unzip ../texts.zip

In [None]:
!wc *.txt

That's 668,517 lines and 5,607,822 words from over 100 texts.

We can split them up into word counts one at a time like we did in exercise-02:

In [None]:
!grep -oE '\w{{2,}}' *.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -10

Note that I've wrapped lines around by using the `\` character.  To me, this looks easier to read - you can see each step of the pipeline one at a time.  The `\` only means "this shell line continues on the next line".  The `|` still acts as the pipe.

### Hold on there

That's a lot more (100x) data, and we have to be thoughtful about how we approach it.  

Before going any further, let's look at two specific texts first, Romeo and Juliet and Little Women.

In [None]:
!wget https://s3.amazonaws.com/2017-dmfa/week-3/romeo.txt

In [None]:
!wget https://s3.amazonaws.com/2017-dmfa/week-3/women.txt

In [None]:
!grep -oE '\w{{2,}}' romeo.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -10

In [None]:
!grep -oE '\w{{2,}}' women.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -10

It looks like Little Women is much longer, which makes sense - it's a novel, not a play.  More text!

To compare the two directly:

In [None]:
!wc -l romeo.txt women.txt

We can run through both files at once by giving both file names to `grep`:

In [None]:
!grep -oE '\w{{2,}}' romeo.txt women.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -10

Do those numbers look right?  

Let's take a closer look at what's going on.

In [None]:
!grep -oE '\w{{2,}}' romeo.txt women.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | grep "and" \
    | tail -10

Aha!  `grep` is not-so-helpfully including the second filename on the lines matched from the second file, but not on the first.  That's why the counts are off.

There's probably an option to tell `grep` not to do that.  But let's try something completely different.

First, let's break the step into the **data parallel** piece.  For which part of this pipeline is completely data parallel?

In [None]:
!rm all-words.txt

In [None]:
!ls women.txt romeo.txt \
    | parallel --will-cite -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' >> all-words.txt"

In [None]:
!sort all-words.txt \
    | uniq -c \
    | sort -rn \
    | head -10

See what we did there?  We parallelized the data, then brought it back together for the rest of the pipeline.

Okay, now let's try it on that bigger dataset.

In [None]:
!wc *.txt

In [None]:
!rm all-words.txt

In [None]:
!ls *.txt \
    | parallel --will-cite --eta -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' >> all-words.txt"

In [None]:
!sort all-words.txt \
    | uniq -c \
    | sort -rn \
    | head -10

Let's say that in words:

* Time this;
* Get a list of all the `*.txt` files in `many-texts/`;
* In parallel, extract their words, lower case them, and append them to `many-texts/all-words.txt`;
* Sort, find unique words, and get a reverse numeric rank of the top 10 most frequently occurring words.

More precisely on that parallel step:

* Among all those files listed;
* Whenever there is an available core for processing, give it one file to process through the pipeline;
* When each job is done, the core is available for processing again;
* Continue until there are no jobs waiting.

That's data parallelism.

### Questions for you

How much faster or slower would we go if we did each file one at a time?

What's the bottleneck here?