# Week 3 lecture notes


## Exercise 2 review - common mistakes

### Including directories in paths

If you create a file in a lower directory, then want to modify, move, or delete it, you have to use the directory to refer to it.

In [None]:
!mkdir mydirectory

In [None]:
!ls > mydirectory/myfiles.txt

In [None]:
!rm myfiles.txt

In [None]:
!rm mydirectory/myfiles.txt

In [None]:
!ls mydirectory

### ">" vs ">>"

Both ">" and ">>" redirect output from the screen to a file.  Both will create new files if none yet exists.  Only ">" will overwrite an existing file; ">>" will append to an existing file.

In [None]:
!date > datefile.txt

In [None]:
!cat datefile.txt

In [None]:
!date > datefile.txt

In [None]:
!cat datefile.txt

In [None]:
!date >> datefile.txt

In [None]:
!date >> datefile.txt

In [None]:
!cat datefile.txt

### lower|sort|uniq or sort|lower|uniq

Order matters!  Consider the text from exercise-02.

In [None]:
!wget https://github.com/gwsb-istm-6212-fall-2016/syllabus-and-schedule/raw/master/exercises/pg2500.txt

In [None]:
!grep -oE '\w{{2,}}' pg2500.txt | grep -v '^[0-9]' | uniq -c | head

Among the set of three functions: {uniq, lower, sort} there are six orderings.  Which produce correct results, and why?

 * uniq, lower, sort
 * uniq, sort, lower
 * sort, lower, uniq
 * sort, uniq, lower
 * lower, sort, uniq
 * lower, uniq, sort

In [None]:
!grep -oE '\w{{2,}}' pg2500.txt | grep -v '^[0-9]' | uniq -c | tr '[:upper:]' '[:lower:]' | sort | head

In [None]:
!grep -oE '\w{{2,}}' pg2500.txt | grep -v '^[0-9]' | uniq -c | sort | tr '[:upper:]' '[:lower:]' | head

In [None]:
!grep -oE '\w{{2,}}' pg2500.txt | grep -v '^[0-9]' | sort | tr '[:upper:]' '[:lower:]' | uniq -c | head

In [None]:
!grep -oE '\w{{2,}}' pg2500.txt | grep -v '^[0-9]' | sort | uniq -c | tr '[:upper:]' '[:lower:]' | head

In [None]:
!grep -oE '\w{{2,}}' pg2500.txt | grep -v '^[0-9]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | head

In [None]:
!grep -oE '\w{{2,}}' pg2500.txt | grep -v '^[0-9]' | tr '[:upper:]' '[:lower:]' | uniq -c | sort | head 

### More about grep

`grep` is a lot more powerful than what you've seen so far.  More than anything else, it's commonly used to find text within files.  For example, to find lines with "Romeo" in Romeo and Juliet:

In [None]:
!grep Romeo romeo.txt | head

There are many, many options, such as case-insensitivity:

In [None]:
!grep -i what romeo.txt | head

Another useful one is to print line numbers for matching lines:

In [None]:
!grep -n Juliet romeo.txt | head

We can also negate certain terms - show non-matches.

In [None]:
!grep -n Juliet romeo.txt | grep -v Romeo | head

And one more useful tip is to match more than one thing:

In [None]:
!grep "Romeo\|Juliet" romeo.txt | head

### Wildcards with "*"

Sometimes you need to perform a task with a set of files that share a characteristic like a file extension.  The shell lessons had examples with `.pdb` files.  This is common.

The `*` (asterisk, or just "star") is a wildcard, which matches zero-to-many characters.

In [None]:
!ls *.txt

The `?` (question mark) is a wildcard that matches exactly one character.

In [None]:
!cp romeo.txt womeo.txt

In [None]:
!ls ?omeo.txt

In [None]:
!ls wome?.txt

The difference is subtle - these two would have worked interchangeably on the above.  But note:

In [None]:
!ls wo*.txt

In [None]:
!ls wo?.txt

See the difference?  The `*` can match more than one character; `?` only matches one. 

## Writing Python filters

Starting with the `samplefilter.py` filter, let's write some of our own.

In [None]:
!chmod +x simplefilter.py

In [None]:
!head pg2500.txt | ./simplefilter.py

In [None]:
!cp simplefilter.py lower.py

In [None]:
!head pg2500.txt | ./lower.py

## Working with GNU Parallel

GNU Parallel is an easy to use but very powerful tool with a lot of options.  You can use it to process a lot of data easily and it can also make a big mess in a hurry.  For more examples, see the [tutorial page](https://www.gnu.org/software/parallel/parallel_tutorial.html).

Let's start with something we've seen before:  splitting a text file up and counting its unique words.

In [None]:
!wc *.txt

That's 25,875 lines and 218,062 words in the texts of Romeo and Juliet and Little Women.

We can split them up into word counts one at a time like we did in exercise-02:

In [None]:
!grep -oE '\w{{2,}}' romeo.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -10

Note that I've wrapped lines around by using the `\` character.  To me, this looks easier to read - you can see each step of the pipeline one at a time.  The `\` only means "this shell line continues on the next line".  The `|` still acts as the pipe.

Let's look at a second book, Little Women.  We'll add `time` to get a sense of how long it takes.

In [None]:
!time grep -oE '\w{{2,}}' women.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -10

It looks like Little Women is much longer, which makes sense - it's a novel, not a play.  More text!

To compare the two directly:

In [None]:
!wc *.txt

We can run through both files at once by giving both file names to `grep`:

In [None]:
!time grep -oE '\w{{2,}}' romeo.txt women.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -10

Do those numbers look right?  

Let's take a closer look at what's going on.

In [None]:
!time grep -oE '\w{{2,}}' romeo.txt women.txt \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | grep "and" \
    | tail -10

Aha!  `grep` is not-so-helpfully including the second filename on the lines matched from the second file, but not on the first.  That's why the counts are off.

There's probably an option to tell `grep` not to do that.  But let's try something completely different.

First, let's break the step into the **data parallel** piece.  For which part of this pipeline is completely data parallel?

In [None]:
!time ls *.txt \
    | parallel -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' >> all-words.txt"

In [None]:
!time sort all-words.txt \
    | uniq -c \
    | sort -rn \
    | head -10

See what we did there?  We parallelized the data, then brought it back together for the rest of the pipeline.

Let's try it on a much bigger dataset.  (Note that we're unzipping into a new directory with `unzip -d`.)

In [None]:
!unzip -d many-texts texts.zip

In [None]:
!ls -l many-texts | wc -l

In [None]:
!wc many-texts/*.txt

In [None]:
!time ls many-texts/*.txt \
    | parallel --eta -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' >> many-texts/all-words.txt"

In [None]:
!time sort many-texts/all-words.txt \
    | uniq -c \
    | sort -rn \
    | head -10

Let's say that in words:

* Time this;
* Get a list of all the `*.txt` files in `many-texts/`;
* In parallel, extract their words, lower case them, and append them to `many-texts/all-words.txt`;
* Sort, find unique words, and get a reverse numeric rank of the top 10 most frequently occurring words.

More precisely on that parallel step:

* Among all those files listed;
* Whenever there is an available core for processing, give it one file to process through the pipeline;
* When each job is done, the core is available for processing again;
* Continue until there are no jobs waiting.

That's data parallelism.

### Questions for you

How much faster or slower would we go if we did each file one at a time?

What's the bottleneck here?