Redirection
-----------

### The `>` operator

A very important command-line operator is the “redirection” operator “`>`”.  With “`>`” you can send the result of your command-line processing to a file.  So if you’re using curl to get your current location, using the ip-api.com service (see the previous section) and want to store the output into a file, you can create a new file with just these lines using redirection:

In [None]:
!curl 'http://www.telize.com/geoip' 

In [None]:
!curl -s 'http://www.telize.com/geoip' > location.json

In [None]:
!curl -s 'http://www.telize.com/geoip' -o location.json

In [None]:
!ls -lA

To see the content of the file, we can use the command `cat` (described below)

In [None]:
!cat location.json

### The `>>` operator

If we want to append to a file (instead of creating a new file from scratch), then we can use the `>>` operator. The operator is useful in cases where we want to collect data over time (e.g., by setting up a script that runs every hour, and appends the data in the file, instead of overwriting what is there)

In [None]:
!curl 'http://www.telize.com/geoip' > alldata.txt

In [None]:
!cat alldata.txt

In [None]:
!curl "http://api.openweathermap.org/data/2.5/weather?lat=40.72&lon=-73.98&units=imperial&mode=json" >> alldata.txt 

In [None]:
!cat alldata.txt

In [None]:
!echo 'Open Weather Map Data! Haha!' >>alldata.txt

In [None]:
!cat alldata.txt

### Exercise

* Let's get now the "Restaurant Inspection Results_" results from the [NYC Open Data](https://nycopendata.socrata.com/) website. 

* Click on the top "1100+ Data Sets available" and then search for the term "_Restaurant Inspection Results_".

* If you see multiple data sets with the same title, pick the one with the most views.

* Go to the data set and get the link for downloading the ZIP file. (It is under "Export")


In [None]:
#your code here: download the file and store it under /home/ubuntu/data/restaurants.csv


Note 1: This dataset is approximately 180Mb, so it can take 2-3 minutes to download; a URL to download a zipped version (~9.5Mb) is at https://dl.dropboxusercontent.com/u/16006464/IPDS/restaurant.zip

Note 2: You need to unzip the restaurant.zip file to get the contents. (If needed, the `unzip` command can be installed using `sudo apt-get install unzip`.)

Pipes
-----

Pipes provide a way of connecting the output of one unix program or utility to the input of another, through standard input and output. 

Unix pipes give you the power to compose various utilities into a data flow and use your creativity to solve problems. Utilities are connected together ("piped" together) via the pipe operator, |. 

We will give more examples that use pipes later, after covering a few useful utilities first.

Filters
-------

### `cat`:

Prints the contents of the specified files to standard output. Example:

In [None]:
!curl -s -L 'https://dl.dropboxusercontent.com/u/16006464/IPDS/sample.txt' -o sample.txt #retrieve the file
!cat sample.txt

_Note_: The -L flag tells curl to follow "redirects" and -s tells curl not to print any output or statistics but rather store the file in the file specified by the -o flag.)

If we also want to number the lines, we use the `-n` option:

In [None]:
!cat -n sample.txt

### `less`:

** _For use within the UNIX shell; not really useful to use it from iPython notebook_ **

The command `cat` lets you see the contents of the file, but it is not convenient when the file is big. For that, it is better to use the command `less` which allows you to scroll and navigate through the contents of a file. When invoked like: `less [some big file]`. `less` enters an interactive mode. In this mode, several keys help you navigate the input file. Some key commands are:

+ `(space)`: space navigates forward one screen.
+ `(enter)`: enter navigates forward one line.
+ `b`: navigates backwards one screen
+ `y`: navigates backwards one line.
+ `/[pattern]`: search forwards for the next occurrence of `[pattern]`
+ `?[pattern]`: search backwards for the previous occurrence of `[pattern]`

Where `[pattern]` can be a basic string or a regular expression. (We will cover regular expressions in the next session)

### `head/tail`:

The `cat` command above lists the full file. This is problematic when dealing with big files, as it can often block the terminal or the iPython notebook. 

The `head` and `tail` commands can be used instead to output the first (last) lines of a file. Typically used like:


In [None]:
#Prints the first five lines of a file
!head -n 5 sample.txt 

In [None]:
#Prints the last five lines of a file
!tail -n 5 sample.txt

The -n option specifies the number of lines to be output, the default value is 10. 

For more advanced usage, `tail`, when used with the `-f` option, will output the end of a file _as it is written to_. This is useful is a program is writing output or logging progress to a file, and you want to read it _live_ as it is happening.

### `cut`:

The `cat` command prints the full file. The `head` and `tail` commands select the first and last lines of a file. The `cut` command complements these commands by allowing us to select (or “cut”) certain fields (usually columns) from input. 

Cut is typically used with the `-f` option to specify a comma-separated list of columns to be emitted. Example:

In [None]:
!cat sample.txt

In [None]:
#Selects the first and fourth column. Assumes tab-separated columns.
!cut -f1,4 sample.txt

##### Specifying alternative delimeter instead 

`-d` option: To specify the string used to separate the fields use the `-d` option. For example, if spaces were used instead of tabs, we could change the above command to:

In [None]:
# Selects the second column, if the delimeter is space
# Notice that we put space within quotes -d' '; alternatively we could use the escape character and write -d\ 
# Notice that the only space characters appear between the words "foo bar", "biz baz", etc. The earlier separators are tabs.
!cut -f2 -d' ' sample.txt

### Exercise:

* Use the `cut` command to extract the restaurant names from the NYC Restaurant dataset. It is a comma separated file, so remember to specify correctly the separator. The restaurant name is the second column, and it is called "DBA" (doing business as)
* Use the redirect operator > to save the outcome into a file named "rest-names.txt"

In [None]:
#Your code here

### Using pipes 

Now, let's use pipes for the first time. Pipes are being used to "pipe" the output of one program into another. For example, select the second column, and then list only the first 3 entries:

In [None]:
!cut -f 2 sample.txt | head -3

#### Exercise

* Select the restaurant names column, and then list the last 10 entries

In [None]:
#your code here

### `sort`

An extremely efficient implementation of [external merge sort](http://dzmitryhuba.blogspot.com/2010/08/external-merge-sort.html). In a nutshell, this means the sort utility can order a dataset far larger than can fit in a system’s main memory. While sorting extremely large files does drastically increase the runtime, smaller files are sorted quickly. Useful both as a component of larger shell scripts, and independently, as a tool to, say, quickly find the most active users, or to see the  most frequently loaded pages on a domain. 

Typically called like: `sort [options] [file]`. Example:

In [None]:
!cat sample.txt

In [None]:
!sort sample.txt



Some useful options:

+ `-r`: reverse order. Sort the input in descending order:

In [None]:
!sort sample.txt

+ `-n`: numeric order. Sort the input in numerical order as opposed to the default lexicographical order:

In [None]:
!sort -n sample.txt

+ `-k n`: sort the input according to the values in the n-th column. Useful for columnar  data. 

In [None]:
!sort -k 2 sample.txt

* `-t` the option to specify the text used to specify columns. Notice that `sort` uses the `-t` option to specify the delimiter character, while `cut` used `-d`. Yes, it is confusing. (This is mainly the result of developers creating

### Exercise

* Sort the NYC Restaurant dataset by restaurant name and store the result in a separate file `/home/ubuntu/data/sorted.csv`. You will see that this is a comma separated file therefore the character that separates columns is the `,` (comma) character. The restaurant name is the second column in the dataset.
* Repeat the sorting process above, but instead of storing the file, display the first 5 entries using the `head` command and a pipe.

In [None]:
#your code here: Get the restaurant names (the "DBA" column)
#and store the result in a separate file /home/ubuntu/data/sorted.csv

In [None]:
#your code here: Use now the head command and a pipe.

### `uniq`

Removes *sequential* duplicates: prints only those unique sequential lines from a file. For example, our sample.txt file contains a duplicate line at the end. See:

In [None]:
!cat sample.txt

By running `uniq` we can remove the duplicate line:

In [None]:
!uniq sample.txt

Used with the `-c` option, uniq will report the number of duplicates of each line in the sequence. Example:

In [None]:
!uniq -c sample.txt

#### Exercise

For the following three exercises, perform them first by having multiple commands and storing them output of each command in a separate file. Then consolidate the process by using pipes.

* Count the number of times that a restaurant name appears in the dataset. Use the `cut`, `sort`, `uniq` commands.
* Count the number of times that a restaurant name appears in the dataset, and display in descending order of frequency the count and the restaurant name.
* List the ten most frequent restaurant names, *without* displaying their frequency in the dataset.

In [21]:
#your code here
!cat /home/ubuntu/data/restaurants.csv | cut -f2 -d, | \
sort | uniq -c | sort -r -n | head -n 20 

   4600 SUBWAY
   3775 MCDONALD'S
   2646 DUNKIN' DONUTS
   2083 DUNKIN DONUTS
   1906 STARBUCKS COFFEE
   1500 CROWN FRIED CHICKEN
   1482 KENNEDY FRIED CHICKEN
   1413 BURGER KING
   1294 DOMINO'S PIZZA
    973 "DUNKIN' DONUTS
    850 CHIPOTLE MEXICAN GRILL
    763 POPEYES CHICKEN & BISCUITS
    668 GOLDEN KRUST CARIBBEAN BAKERY & GRILL
    545 WENDY'S
    523 AU BON PAIN
    447 LITTLE CAESARS
    444 CARVEL ICE CREAM
    437 PAPA JOHN'S
    435 IHOP
    395 PRET A MANGER
sort: write failed: standard output: Broken pipe
sort: write error


### `wc`: 
Compute word, line, and byte counts for specified files or output of other scripts. Particularly useful when used in concert with other utilities such as grep, sort, and uniq. Example usage:

In [22]:
!cat sample.txt

123	1346699925	11122	foo bar
222	1346699955	11145	biz baz
140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
146	1346699999	11123	foo bar
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop


In [23]:
!wc sample.txt

  7  35 201 sample.txt


Indicating the number of lines, words, and bytes in the file respectively. There are some useful flags for wc that will help you answer specific questions quickly:

+ `-l`: get the number of lines from the input. Example:

In [24]:
!wc -l sample.txt

7 sample.txt


+ `-w`: get the number of words in the input. Example:

In [25]:
!wc -w sample.txt

35 sample.txt


+ `-m`: the number of characters in the input. Example:

In [26]:
!wc -m sample.txt

201 sample.txt


+ `-c`: the number of bytes in the input. Example:

In [27]:
!wc -c sample.txt

201 sample.txt


Here, the number of bytes and characters are the same; all characters used are just one byte.

#### Exercise

* List how many entries (inspections) there are in the restaurant data set
* Remove any duplicate names
* Report how many unique restaurants are in the dataset
* "Tricky" questions: 
    * On average, how many words there are in a NYC restaurant name? Compute the answer both with, and without duplicate names. 
    * On average, how many characters in a NYC restaurant name?

In [32]:
!head /home/ubuntu/data/restaurant.csv

CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE
30075445,MORRIS PARK BAKE SHOP,BRONX,1007      ,MORRIS PARK AVE                                   ,10462,7188924968,Bakery,03/03/2014,Violations were cited in the following area(s).,10F,"Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.",Not Critical,2,A,03/03/2014,01/14/2015,Cycle Inspection / Initial Inspection
30075445,MORRIS PARK BAKE SHOP,BRONX,1007      ,MORRIS PARK AVE                                   ,10462,7188924968,Bakery,10/10/2013,No violations were recorded at the time of this inspection.,,,Not Applicable,,,,01/14/2015,Trans Fat / Second Compliance Inspection
3007

In [33]:
#your code here
!cat /home/ubuntu/data/restaurant.csv | cut -f2 -d, | sort | uniq | wc -l 

20583


### `find` 

Search directories for matching files. Useful when you know the name of a file (or part of the name), but do not know the file’s location in a directory. Example:

In [None]:
!find ~ -name 'sample.txt'

### `grep`:
A utility for pattern matching. grep is by far the most useful unix utility. While grep is conceptually very simple, an effective developer or data scientist will no doubt find themselves using grep dozens of times a day. grep is typically called like this: `grep [options] [pattern] [files]`. With no options specified, this simply looks for the specified pattern in the given files, printing to the console only those lines that match the given pattern. Example:

This in itself can be very useful, scraping large volumes of data to find what you’re looking for. 

The power of grep really shows when different command options are specified. Below are just a sample of the more useful grep options:

+ `-v`: Inverted matching. In this setting, grep will return all the input lines that do not match the specified pattern. Example

In [None]:
!cat sample.txt

In [None]:
!grep -v 'biz baz' sample.txt

+ `-R`: Recursive matching. Here grep descends sub folders, applying the pattern on all files encountered. Very useful if you’re looking to see if any logs have lines that you’re interested in, or to find the source code file containing the function you’re interested in. Example:

In [None]:
!cd /home/ubuntu/data/; grep -R 'MORIMOTO' .

More options:

* -c	Print only a count of matched lines.
* -i 	Ignore lowercase and uppercase distinctions
* -n	Print matching line with its line number
* -v  	Negate matches; print lines that do not match the regex
* -r	Recursively Search subdirectories listed
* -l 	List only filenames
* -o	prints only the matching part of the line


We will get back to grep once we learn regular expressions. You will see that grep can be extremely useful for searching through data.

### `jq`

The [jq](http://stedolan.github.io/jq/) is not one of the "standard" UNIX tools but will be useful for us, to be able to parse the JSON responses of the Web API calls.

Since it is not installed by default, we need to first install it:

In [None]:
!sudo apt-get install jq

The `jq` command has the format

`jq [filters] filename`



The absolute simplest (and least interesting) filter is `.` 

This is a filter that takes its input and produces it unchanged as output.

Since jq by default "pretty-prints" all output, this trivial program can be a useful way of formatting JSON output from, say, curl.



In [35]:
!curl -s 'http://www.telize.com/geoip' > location.json

In [36]:
!jq . location.json

[37m{
  [0m[34;1m"country_code3"[0m[37m: [0m[32m"USA"[0m[37m,
  [0m[34;1m"country"[0m[37m: [0m[32m"United States"[0m[37m,
  [0m[34;1m"offset"[0m[37m: [0m[32m"-4"[0m[37m,
  [0m[34;1m"country_code"[0m[37m: [0m[32m"US"[0m[37m,
  [0m[34;1m"latitude"[0m[37m: [0m[0m39.0437[0m[37m,
  [0m[34;1m"city"[0m[37m: [0m[32m"Ashburn"[0m[37m,
  [0m[34;1m"asn"[0m[37m: [0m[32m"AS14618"[0m[37m,
  [0m[34;1m"ip"[0m[37m: [0m[32m"54.174.159.22"[0m[37m,
  [0m[34;1m"dma_code"[0m[37m: [0m[32m"0"[0m[37m,
  [0m[34;1m"region_code"[0m[37m: [0m[32m"VA"[0m[37m,
  [0m[34;1m"isp"[0m[37m: [0m[32m"Amazon.com, Inc."[0m[37m,
  [0m[34;1m"timezone"[0m[37m: [0m[32m"America/New_York"[0m[37m,
  [0m[34;1m"area_code"[0m[37m: [0m[32m"0"[0m[37m,
  [0m[34;1m"continent_code"[0m[37m: [0m[32m"NA"[0m[37m,
  [0m[34;1m"longitude"[0m[37m: [0m[0m-77.4875[0m[37m,
  [0m[34;1m"region"[0m[37m: [0m[32m"Virg

Having learned pipes, we can now avoid storing the output of curl into a file, and instead pass it directly through `jq`: 

In [37]:
!curl -s 'http://www.telize.com/geoip' | jq . 

[37m{
  [0m[34;1m"country_code3"[0m[37m: [0m[32m"USA"[0m[37m,
  [0m[34;1m"country"[0m[37m: [0m[32m"United States"[0m[37m,
  [0m[34;1m"offset"[0m[37m: [0m[32m"-4"[0m[37m,
  [0m[34;1m"country_code"[0m[37m: [0m[32m"US"[0m[37m,
  [0m[34;1m"latitude"[0m[37m: [0m[0m39.0437[0m[37m,
  [0m[34;1m"city"[0m[37m: [0m[32m"Ashburn"[0m[37m,
  [0m[34;1m"asn"[0m[37m: [0m[32m"AS14618"[0m[37m,
  [0m[34;1m"ip"[0m[37m: [0m[32m"54.174.159.22"[0m[37m,
  [0m[34;1m"dma_code"[0m[37m: [0m[32m"0"[0m[37m,
  [0m[34;1m"region_code"[0m[37m: [0m[32m"VA"[0m[37m,
  [0m[34;1m"isp"[0m[37m: [0m[32m"Amazon.com, Inc."[0m[37m,
  [0m[34;1m"timezone"[0m[37m: [0m[32m"America/New_York"[0m[37m,
  [0m[34;1m"area_code"[0m[37m: [0m[32m"0"[0m[37m,
  [0m[34;1m"continent_code"[0m[37m: [0m[32m"NA"[0m[37m,
  [0m[34;1m"longitude"[0m[37m: [0m[0m-77.4875[0m[37m,
  [0m[34;1m"region"[0m[37m: [0m[32m"Virg

(The -s option for curl stands for "silent" and prevents the status messages from appearing in the output).

The simplest useful filter is `.foo`. When given a JSON object as input, it produces the value at the attribute `foo`, or null if there’s none present.

Now, let's try to use such a filter for selecting the "city" attribute listed in the JSON output: 

In [38]:
!curl -s 'http://www.telize.com/geoip' | jq '.city'

[32m"Ashburn"[0m


And we can also combine multiple attributes, using the addition operator `+`: 

In [40]:
!curl -s 'http://www.telize.com/geoip' | jq '.city + ", " + .region + ", " + .country'

[32m"Ashburn, Virginia, United States"[0m


Now, let's try something more complicated: We will use the jq command to read the location from the output of the ip-api.com API, and then create the URL for calling the OpenWeathermap API (see the previous session for details):

In [41]:
!curl -s 'http://www.telize.com/geoip' | \
jq '"http://api.openweathermap.org/data/2.5/weather?q=" + .city + "," + .region + "&mode=json&units=imperial"'

[32m"http://api.openweathermap.org/data/2.5/weather?q=Ashburn,Virginia&mode=json&units=imperial"[0m


In [None]:
!curl -s 'http://www.telize.com/geoip' | \
jq '"http://api.openweathermap.org/data/2.5/weather?q=" + .lat + "," + .region + "&mode=json&units=imperial"'

This is our first interaction with **variables**. You will notice that we used the `city` and `region` output from one service (Telize), in order to reuse these values later, in the OpenWeatherMap service.

#### And now let's get advanced with pipes

Now, we will get into a little more advanced topic. No worries if you feel lost.

The key trick we will use is the `xargs` command. The `xargs` takes its input and passes it as a parameter to the command that follows.

So, what we do below: 
* We first generate the URL, using the commands described above. 
* Then, we use the `xargs` command to pass the URL as a parameter to curl
* Curl can then use this URL, and get the weather in our current location:

In [47]:
!curl -s 'http://www.telize.com/geoip' | \
jq '"http://api.openweathermap.org/data/2.5/weather?q=" + .city + "," + .region + "&mode=json&units=imperial"'  | \
xargs curl -s  | jq .

[37m{
  [0m[34;1m"cod"[0m[37m: [0m[0m200[0m[37m,
  [0m[34;1m"name"[0m[37m: [0m[32m"Ashburn"[0m[37m,
  [0m[34;1m"id"[0m[37m: [0m[0m4744870[0m[37m,
  [0m[34;1m"sys"[0m[37m: [0m[37m{
    [0m[34;1m"sunset"[0m[37m: [0m[0m1442531664[0m[37m,
    [0m[34;1m"sunrise"[0m[37m: [0m[0m1442487209[0m[37m,
    [0m[34;1m"country"[0m[37m: [0m[32m"US"[0m[37m,
    [0m[34;1m"message"[0m[37m: [0m[0m0.0109[0m[37m,
    [0m[34;1m"id"[0m[37m: [0m[0m2856[0m[37m,
    [0m[34;1m"type"[0m[37m: [0m[0m1[0m[37m
  [37m}[0m[37m,
  [0m[34;1m"coord"[0m[37m: [0m[37m{
    [0m[34;1m"lat"[0m[37m: [0m[0m39.04[0m[37m,
    [0m[34;1m"lon"[0m[37m: [0m[0m-77.49[0m[37m
  [37m}[0m[37m,
  [0m[34;1m"weather"[0m[37m: [0m[37m[
    [37m{
      [0m[34;1m"icon"[0m[37m: [0m[32m"10d"[0m[37m,
      [0m[34;1m"description"[0m[37m: [0m[32m"moderate rain"[0m[37m,
      [0m[34;1m"main"[0m[37m: [0m

In [48]:
!curl -s 'http://www.telize.com/geoip' | \
jq '"http://api.openweathermap.org/data/2.5/weather?q=" + .city + "," + .region + "&mode=json&units=imperial"' | \
xargs curl -s | jq '.main.temp'

[0m78.28[0m


Instead of pipes, we can also write and read files (although it is slower and more cumbersome, it may be useful while debugging):

In [None]:
!curl -s 'http://www.telize.com/geoip' > location.json
!jq '"http://api.openweathermap.org/data/2.5/weather?q=" + .city + "," + .region + "&mode=json&units=imperial"' location.json > openweathermap.url
!cat openweathermap.url | xargs curl -s > weather.json
!jq '.main.temp' weather.json

Look at the power of pipes and filters: In three lines and in less than 200 characters, we created a service that reads out current location, using the API at ip-api.com, parses the output, creates a new API call for OpenWeatherMap, and then gets the data from that service, to give us back the temperature in our current location!

The full manual of `jq` is available at http://stedolan.github.io/jq/manual/ and you can use the live demo at https://jqplay.org/

There are numerous options in the manual. For now, you can restrict yourself to the basic operations that we covered.

### Exercise

* Instead of reading the city and region, read instead the log/lat coordinates from the Telize API, and modify the API call to OpenWeatherMap to use long/lat instead. (See http://openweathermap.org/current for the details API calls.)

* Print the description of the weather, instead of the temperature.

In [None]:
# your code here

More examples of using pipes
----------------------------

We discussed earlier in the session the usage of the `|` operator to connect (aka "pipe") the output of one utility and direct it as input in another. Now that we have learned a few tools, lets use these in some examples. For instance, if you want to know how many records in the sample data file do not contain "foo bar", you can compose a data flow like this:

In [None]:
!cat sample.txt | grep -v 'foo bar' | wc -l

Using `wc` at the end of a pipe to count the number of matching output records is a common pattern. Recalling that `uniq` removes any sequential duplicates, we can count the number of unique users making purchases in our file by composing a data flow like this:

In [None]:
!cat sample.txt | cut -f3 | sort | uniq  | wc -l

Or, if you want count how many transactions each user has appeared in:

In [None]:
!cat sample.txt | cut -f3 | sort | uniq -c

To now order the users by number of transactions made, you can try something like:

In [None]:
!cat sample.txt | cut -f3 | sort | uniq -c | sort -nr

Notice here, that the `-r` and `-n` flags for the sort command are combined. This is common shorthand and is acceptable for any unix utility.