# Exercise 2 - shell, pipes, and csvkit

In this exercise, we'll review a few shell commands and explore working with pipes and csvkit.

You will need to fill in commands whenever prompted.  Please replace the text with your solution.

Remember to submit your completed `.ipynb` file to Blackboard and to add/commit it to your Git repository and push it to GitHub.


## Part 1 - shell commands, redirection, and pipes

### Basic shell commands and redirection

Create a directory called `part1` using `mkdir`.

In [None]:
!mkdir part1

Rename `part1` to `partone` using `mv`.

In [None]:
!mv part1 partone
!ls

Create a file named `filelist.txt` using the output from `ls` and the output redirector `>`.

In [None]:
!ls > filelist.txt

In [None]:
!cat filelist.txt

Append to `filelist.txt` using the output appending redirector `>>`.  Note the difference between the single `>` and double `>>`.

In [None]:
!ls >> filelist.txt
!cat filelist.txt

In [None]:
!ls > filelist.txt
!cat filelist.txt

What's the difference between `>` and `>>`?


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Your turn

Complete the following tasks in the cells provided.  All the tests in the testing cells (with the `assert` statements) below should pass without error - be sure to execute those as well, and if you see errors, fix your answer and try testing again until there are no errors.

Create a directory called `mydirectory`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
import os
assert 'mydirectory' in os.listdir('.')

Using `ls` and output redirection, create a file called `myfiles.txt` in the directory `mydirectory` that contains the list of files in the current directory.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'myfiles.txt' in os.listdir('mydirectory')

In [None]:
myfiles = open('mydirectory/myfiles.txt').read()
assert 'exercise-02.ipynb' in myfiles

Clean up the directory you just created by removing its contents (the file you created) using `rm`.

In [None]:
assert 'myfiles.txt' not in os.listdir('mydirectory')

Now remove the directory itself using `rmdir`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'mydirectory' not in os.listdir('.')

### Filters and pipes

Let's look at something a little more interesting.  Download the text of Herman Hesse's *Siddhartha* from [Project Gutenberg](http://www.gutenberg.org/):

In [None]:
!wget https://www.gutenberg.org/cache/epub/2500/pg2500.txt

*Note*: sometimes Project Gutenberg restricts access.  If that creates an error for you, you should be able to `wget` the same file from our class repository on GitHub at the url https://github.com/gwsb-istm-6212-fall-2016/syllabus-and-schedule/raw/master/exercises/pg2500.txt.

However you get the file, let's rename it to something easier to remember.

In [None]:
!mv pg2500.txt siddhartha.txt

`head` and `tail` are very useful.  They let you take a quick peek at the start and end of files.

In [None]:
!head siddhartha.txt

In [None]:
!tail siddhartha.txt

`grep` is one of the most useful filters.  It lets you search for and match lines that contain specific expressions.  For example, to find mentions of "copyright":

In [None]:
!grep copyright siddhartha.txt

Notice anything that those lines have in common?

Let's add a little more information by including the `-n` flag to add matching line numbers.

In [None]:
!grep -n copyright siddhartha.txt

Now let's look for any mention of "river".  This will match a lot of text, so we'll just take the first 10 matching lines by *piping* the output from `grep` into `head`.

In [None]:
!grep -n river siddhartha.txt | head

How many lines contain "river"?  We can count by piping into the word count tool `wc`.

In [None]:
!grep river siddhartha.txt | wc

That's 109 matching lines, containing 1365 words and 7979 characters.  If you just wanted the lines by themselves, use `wc -l`:

In [None]:
!grep river siddhartha.txt | wc -l

What if we want to match both upper- and lower-case text?  Use `grep -i`:

In [None]:
!grep time siddhartha.txt | wc -l

In [None]:
!grep -i time siddhartha.txt | wc -l

### Your turn

How many lines in *Siddhartha* contain "other" (just lower-case)?  Start by using `grep` to extract lines that match the word "other" in `siddhartha.txt` and redirecting it to a file called `other-lines.txt`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
%sc h_other = head -1 other-lines.txt
assert "other" in h_other

In [None]:
%sc t_other = tail -1 other-lines.txt
assert "other" in t_other

Now count up the lines in the file you created using `wc`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert "127" in _

Your answer should be 127!

### Counting words with `grep`

By piping commands together we can do a lot of powerful things right at the command line.  Let's create a count of the most commonly occurring words in *Siddhartha*.  To do that, we could write a Python script that just counts words, but with the command line shell tools we only need to put a proper pipeline together and we can often accomplish tasks like this in one line.

First we need to split up the text lines into a word per line.  There are `grep` flags for that!

In [None]:
!cat siddhartha.txt | grep -oE '\w{2,}' | head -10

Now we need to sort them and count the unique tokens.  `sort` solves the first problem.

In [None]:
!cat siddhartha.txt | grep -oE '\w{2,}' | sort | head -10

And `uniq -c` solves the second problem.

In [None]:
!cat siddhartha.txt | grep -oE '\w{2,}' | sort | uniq -c | head -25

But there's a catch... do you see it?

We need to convert all the words down into lower case so that we are correctly counting unique words.  There's another command, `tr`, for that.

In [None]:
!cat siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | head -25

...and if we want to know only the top 10 words in Siddhartha, we need to sort the output.

In [None]:
!cat siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort | head -10

But that sorts by character, not number.  Fortunately, `sort -n` does what we want.

In [None]:
!cat siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n | head -10

But that's the wrong end of the list!  Two ways to fix that:  (a) use `tail` instead of `head`; (b) use `sort -rn`, which will sort in reverse order.  Let's try the latter.

In [None]:
!cat siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn | head -10

### Your turn

Download *Alice in Wonderland* from http://www.gutenberg.org/cache/epub/11/pg11.txt (or https://github.com/gwsb-istm-6212-fall-2016/syllabus-and-schedule/raw/master/exercises/pg11.txt if the first url doesn't work).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'pg11.txt' in os.listdir('')

Now rename `pg11.txt` to `alice.txt`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'alice.txt' in os.listdir('')

Take a look at the next cell.  Will it find the top 25 unique words in *Alice in Wonderland* successfully?

In [None]:
!cat alice.txt | grep -oE '\w{2,}' | sort | uniq -c | head -25

Describe what needs to be done to the previous cell to get it to work correctly.  Describe it using words, explaining the issues, rather than using shell commands!

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Okay, now implement your solution using shell commands with a pipeline.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part 2 - csvkit basics

Let's look at some CSV data using csvkit.  Download the 2015 fourth quarter trip dataset from [Capital Bikeshare's trip history data](https://www.capitalbikeshare.com/trip-history-data):

In [None]:
!wget https://s3.amazonaws.com/capitalbikeshare-data/2015-Q4-cabi-trip-history-data.zip

Let's unzip it, rename it to something short, and take a look.

In [None]:
!unzip 2015-Q4-cabi-trip-history-data.zip

In [None]:
!mv 2015-Q4-Trips-History-Data.csv q4.csv

In [None]:
!head q4.csv

csvkit gives us great tools for examining and working with CSV data.  We start by looking at the columns:

In [None]:
!csvcut -n q4.csv

We can also extract just a few columns with `csvcut`:

In [None]:
!csvcut -c1,5,7 q4.csv | head -10

...and make it look better with `csvlook`:

In [None]:
!csvcut -c1,5,7 q4.csv | head -10 | csvlook

It gets even better.  Try `csvgrep`:

In [None]:
!csvcut -c1,5,7 q4.csv | csvgrep -c3 -m '21st & I St NW' | head -10 | csvlook

But wait, there's more:

In [None]:
!csvcut -c1,5,7 q4.csv | csvgrep -c3 -m '21st & I St NW' | csvsort -c2 | head -10 | csvlook

And you can perform basic statistics very easily:

In [None]:
!csvcut -c1,5,7 q4.csv | csvgrep -c3 -m '21st & I St NW' | csvcut -c1 | csvstat

### Your turn

Which set of trips had the longer average trip duration:  trips *starting* at "Massachusetts Ave & Dupont Circle NW", or trips *ending* at "Massachusetts Ave & Dupont Circle NW"?

Use as many new cells as you need to compute the answer, and then write in your answer below in the "YOUR ANSWER HERE" cell.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()