# Shell

How does the shell compare to a desktop interface?

An operating system like Windows, Linux, or Mac OS is a special kind of program. It controls the computer's processor, hard drive, and network connection, but its most important job is to run other programs.

Since human beings aren't digital, they need an interface to interact with the operating system. The most common one these days is a graphical file explorer, which translates clicks and double-clicks into commands to open files and run programs. Before computers had graphical displays, though, people typed instructions into a program called a **command-line shell**. Each time a command is entered, the shell runs some other programs, prints their output in human-readable form, and then displays a prompt to signal that it's ready to accept the next command. (Its name comes from the notion that it's the "outer shell" of the computer.)

Typing commands instead of clicking and dragging may seem clumsy at first, but as you will see, once you start spelling out what you want the computer to do, you can combine old commands to create new ones and automate repetitive operations with just a few keystrokes.

**Contents:**

- Manipulating Files and Directories
    - Working Directory
    - Files and Directories
    - Move Files
    - Copy Files
    - Create Files
    - Delete Files
- Manipulating Data
- Combining Tools
- Batch Processing
- Creating New Tools

## Manipulating Files and Directories

### Working Directory

The filesystem manages files and directories (or folders). Each is identified by an absolute path that shows how to reach it from the filesystem's root directory: `/home/repl` is the directory `repl` in the directory `home`, `while /home/repl/course.txt` is a file `course.txt` in that directory, and `/` on its own is the root directory.

To find out where you are in the filesystem, run the command `pwd` (short for "print working directory"). This prints the absolute path of your **current working directory**, which is where the shell runs commands and looks for files by default.

In [1]:
pwd

'/Users/stb/Documents/GitHub/Shell'

### List/Identify Files and Directories

`pwd` tells you where you are. To find out what's there, type `ls` (which is short for "listing") and press the enter key. On its own, `ls` lists the contents of your current directory (the one displayed by `pwd`). If you add the names of some files, `ls` will list them, and if you add the names of directories, it will list their contents. For example, `ls /home/repl` shows you what's in your starting directory (usually called your home directory).

In [4]:
ls

[34mcourses[m[m/               shell - intro.ipynb
[34mprogramming-languages[m[m/ shell - intro.sh


### Relative and Absolute Path

An absolute path is like a latitude and longitude: it has the same value no matter where you are. A relative path, on the other hand, specifies a location starting from where you are: it's like saying "20 kilometers north".

For example, if you are in the directory `/home/repl`, the relative path seasonal specifies the same directory as `/home/repl/seasonal`, while `seasonal/winter.csv` specifies the same file as `/home/repl/seasonal/winter.csv`. The shell decides if a path is absolute or relative by looking at its first character: if it begins with `/`, it is absolute, and if it doesn't, it is relative.

In [7]:
# Absolute path
ls /Users/stb/Documents/GitHub/Shell/courses/

data-analysis.txt     probability.txt
machine-learning.txt  statistics.txt


In [8]:
# Relative path
ls courses/

data-analysis.txt     probability.txt
machine-learning.txt  statistics.txt


In [9]:
ls programming-languages/

python.txt  r.txt


### Move to Another Directory

Just as you can move around in a file browser by double-clicking on folders, you can move around in the filesystem using the command cd (which stands for "change directory").

If you type cd seasonal and then type pwd, the shell will tell you that you are now in `/home/repl/seasonal`. If you then run `ls` on its own, it shows you the contents of `/home/repl/seasonal`, because that's where you are. If you want to get back to your home directory `/home/repl`, you can use the command `cd /home/repl`.

You are in /home/repl/. Change directory to /home/repl/seasonal using a relative path.

In [1]:
pwd

'/Users/stb/Documents/GitHub/Shell'

In [2]:
cd courses

/Users/stb/Documents/GitHub/Shell/courses


Use `pwd` to check that you're there.

In [3]:
pwd

'/Users/stb/Documents/GitHub/Shell/courses'

Use `ls` without any paths to see what's in that directory.

In [4]:
ls

data-analysis.txt     probability.txt
machine-learning.txt  statistics.txt


### Move Up a Directory

The parent of a directory is the directory above it. For example, /home is the parent of /home/repl, and /home/repl is the parent of /home/repl/seasonal. You can always give the absolute path of your parent directory to commands like `cd` and ls. More often, though, you will take advantage of the fact that the special path `..` (two dots with no spaces) means "the directory above the one I'm currently in". If you are in /home/repl/seasonal, then `cd ..` moves you up to /home/repl. If you use `cd ..` once again, it puts you in /home. One more `cd ..` puts you in the root directory `/`, which is the very top of the filesystem. (Remember to put a space between `cd` and `..` - it is a command and a path, not a single four-letter command.)

A single dot on its own, `.`, always means "the current directory", so `ls` on its own and `ls` . do the same thing, while `cd .` has no effect (because it moves you into the directory you're currently in).

One final special path is ~ (the tilde character), which means "your home directory", such as /home/repl. No matter where you are, `ls` ~ will always list the contents of your home directory, and `cd ~` will always take you home.

If you are in `/home/repl/seasonal`, where does `cd ~/../.` take you?

In [10]:
cd ..

/Users/stb/Documents/GitHub/Shell


In [14]:
cd ~

/Users/stb


In [15]:
cd Documents/GitHub/Shell/courses

/Users/stb/Documents/GitHub/Shell/courses


### Copy Files

You will often want to copy files, move them into other directories to organize them, or rename them. One command to do this is `cp`, which is short for "copy". If `original.txt` is an existing file, then:
`cp original.txt duplicate.txt`
creates a copy of `original.txt` called `duplicate.txt`. If there already was a file called `duplicate.txt`, it is overwritten. If the last parameter to `cp` is an existing directory, then a command like:

`cp seasonal/autumn.csv seasonal/winter.csv backup`
copies all of the files into that directory.

In [16]:
ls

data-analysis.txt     probability.txt
machine-learning.txt  statistics.txt


In [17]:
cp probability.txt probability_duplicate.txt

In [20]:
ls

[34mcourses[m[m/               [34mprogramming-languages[m[m/ shell - intro.sh
[34mdata[m[m/                  shell - intro.ipynb    [34mtmp[m[m/


In [19]:
cd ..

/Users/stb/Documents/GitHub/Shell


In [21]:
cp courses/probability.txt courses/probability_duplicate.txt tmp

In [22]:
ls tmp/

probability.txt            tmp_file.txt
probability_duplicate.txt


### Move Files

While `cp` copies a file, `mv` moves it from one directory to another, just as if you had dragged it in a graphical file browser. It handles its parameters the same way as `cp`, so the command:

`mv autumn.csv winter.csv ..`
moves the files `autumn.csv` and `winter.csv` from the current working directory up one level to its parent directory (because .. always refers to the directory above your current location).

You are in /home/repl, which has sub-directories seasonal and backup. Using a single command, move spring.csv and summer.csv from seasonal to backup.

In [23]:
mv tmp/probability.txt tmp/probability_duplicate.txt .

### Rename Files

`mv` can also be used to rename files. If you run:

`mv course.txt old-course.txt`
then the file course.txt in the current working directory is "moved" to the file old-course.txt. This is different from the way file browsers work, but is often handy.

One warning: just like `cp`, `mv` will overwrite existing files. If, for example, you already have a file called old-course.txt, then the command shown above will replace it with whatever is in course.txt.

Go into the seasonal directory. Rename the file winter.csv to be winter.csv.bck. Run `ls` to check that everything has worked.

In [25]:
mv probability_duplicate.txt probability_2.txt

In [26]:
ls

[34mcourses[m[m/               probability_2.txt      shell - intro.sh
[34mdata[m[m/                  [34mprogramming-languages[m[m/ [34mtmp[m[m/
probability.txt        shell - intro.ipynb


### Delete Files

We can copy files and move them around; to delete them, we use `rm`, which stands for "remove". As with `cp` and `mv`, you can give `rm` the names of as many files as you'd like, so:

`rm thesis.txt backup/thesis-2017-08.txt`
removes both thesis.txt and backup/thesis-2017-08.txt

rm does exactly what its name says, and it does it right away: unlike graphical file browsers, the shell doesn't have a trash can, so when you type the command above, your thesis is gone for good.

In [27]:
rm probability.txt probability_2.txt

In [28]:
ls

[34mcourses[m[m/               [34mprogramming-languages[m[m/ shell - intro.sh
[34mdata[m[m/                  shell - intro.ipynb    [34mtmp[m[m/


### Create and Delete Directories

`mv` treats directories the same way it treats files: if you are in your home directory and run `mv seasonal by-season`, for example, `mv` changes the name of the seasonal directory to by-season. However, `rm` works differently.

If you try to `rm` a directory, the shell prints an error message telling you it can't do that, primarily to stop you from accidentally deleting an entire directory full of work. Instead, you can use a separate command called `rmdir`. For added safety, it only works when the directory is empty, so you must delete the files in a directory before you delete the directory. (Experienced users can use the `-r` option to `rm` to get the same effect; we will discuss command options in the next chapter.)

Since a directory is not a file, you must use the command `mkdir directory_name` to create a new (empty) directory. Use this command to create a new directory called yearly below your home directory.

In [38]:
mkdir thrash

Without changing directories, delete the file agarwal.txt in the people directory.

In [30]:
rm thrash

rm: thrash: is a directory


In [31]:
rmdir thrash

rmdir: thrash: Directory not empty


In [35]:
rm thrash/thrash_file.txt

rm: thrash/thrash_file.txt: No such file or directory


Now that the people directory is empty, use a single command to delete it.

In [36]:
rmdir thrash
# git files inside?

rmdir: thrash: Directory not empty


### Wrapping up

You will often create intermediate files when analyzing data. Rather than storing them in your home directory, you can put them in /tmp, which is where people and programs often keep files they only need briefly. (Note that /tmp is immediately below the root directory /, not below your home directory.) This wrap-up exercise will show you how to do that.

Use `cd` to go into /tmp. List the contents of /tmp without typing a directory name. Make a new directory inside /tmp called scratch. Move /home/repl/people/agarwal.txt into /tmp/scratch. We suggest you use the ~ shortcut for your home directory and a relative path for the second rather than the absolute path.

## Manipulating Data

### View a File's Contents

Before you rename or delete files, you may want to have a look at their contents. The simplest way to do this is with `cat`, which just prints the contents of files onto the screen. (Its name is short for "concatenate", meaning "to link things together", since it will print all the files whose names you give it, one after the other.)

In [14]:
cat courses/statistics.txt

Statistics is a branch of mathematics working with data collection, organization, analysis, interpretation and presentation. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied.

### View a File's Content Piece by Piece

You can use `cat` to print large files and then scroll through the output, but it is usually more convenient to page the output. The original command for doing this was called `more`, but it has been superseded by a more powerful command called `less`. (This kind of naming is what passes for humor in the Unix world.) When you `less` a file, one page is displayed at a time; you can press `spacebar` to page down or type `q` to quit.

If you give `less` the names of **several files**, you can type `:n` (colon and a lower-case 'n') to move to the next file, `:p` to go back to the previous one, or `:q` to quit.

Use less seasonal/spring.csv seasonal/summer.csv to view those two files in that order. Press spacebar to page down, :n to go to the second file, and :q to quit.

In [23]:
!less courses/machine-learning.txt
# Press 'q' to quit

Machine learning is a scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. 

Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.[1][2]:2 Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. 

Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine lea

In [22]:
!less courses/machine-learning.txt courses/statistics.txt
# Type ':n' to move to the next file, ':p' for the previous file. Type ':q' to quit.

Machine learning is a scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. 

Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.[1][2]:2 Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. 

Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine lea

### Head of a File

The first thing most data scientists do when given a new dataset to analyze is figure out what fields it contains and what values those fields have. If the dataset has been exported from a database or spreadsheet, it will often be stored as comma-separated values (CSV). A quick way to figure out what it contains is to look at the first few rows.

We can do this in the shell using a command called `head`. As its name suggests, it prints the first few lines of a file (where "a few" means 10)

In [28]:
!head courses/machine-learning.txt

Machine learning is a scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. 

Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.
Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. 

Machine learning is closely related to computational statistics, which focuses on making predictions using computers. 
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, 

### Typing Less with `tab`

One of the shell's power tools is tab completion. If you start typing the name of a file and then press the tab key, the shell will do its best to auto-complete the path. For example, if you type sea and press `tab`, it will fill in the directory name seasonal/ (with a trailing slash). If you then type `a` and `tab`, it will complete the path as seasonal/autumn.csv.

If the path is ambiguous, such as seasonal/s, pressing tab a second time will display a list of possibilities. Typing another character or two to make your path more specific and then pressing tab will fill in the rest of the name.

### Flag: Control What Commands Do

You won't always want to look at the first 10 lines of a file, so the shell lets you change head's behavior by giving it a command-line flag (or just "flag" for short). If you run the command:

`head -n 3 seasonal/summer.csv`
`head` will only display the first three lines of the file. If you `run head -n 100`, it will display the first 100 (assuming there are that many), and so on.

A flag's name usually indicates its purpose (for example, `-n` is meant to signal "number of lines"). Command flags don't have to be a `-` followed by a single letter, but it's a widely-used convention.

Note: it's considered good style to put all flags before any filenames, so in this course, we only accept answers that do that.

In [29]:
!head -n 3 courses/machine-learning.txt

Machine learning is a scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. 

Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.


### List everything below a directory

In order to see everything underneath a directory, no matter how deeply nested it is, you can give ls the flag `-R` (which means "recursive"). This shows every file and directory in the current level, then everything in each sub-directory, and so on.

In [30]:
ls -R

[34mcourses[m[m/               shell - intro.ipynb    [34mtmp[m[m/
[34mprogramming-languages[m[m/ shell - intro.sh

./courses:
data-analysis.txt     probability.txt
machine-learning.txt  statistics.txt

./programming-languages:
python.txt  r.txt

./tmp:
tmp_file.txt


`ls` has another flag `-F` that prints a `/` after the name of every directory and a `*` after the name of every runnable program. Run `ls` with the two flags, `-R` and `-F`, and the absolute path to your home directory to see everything it contains. (The order of the flags doesn't matter, but the directory name must come last.)

In [31]:
ls -R -F

[34mcourses[m[m/               shell - intro.ipynb    [34mtmp[m[m/
[34mprogramming-languages[m[m/ shell - intro.sh

./courses:
data-analysis.txt     probability.txt
machine-learning.txt  statistics.txt

./programming-languages:
python.txt  r.txt

./tmp:
tmp_file.txt


### Get Help for a Command

To find out what commands do, people used to use the `man` command (short for "manual"). For example, the command `man head` brings up this information:

In [32]:
man head


HEAD(1) 		  BSD General Commands Manual		       HEAD(1)

NAME
     head -- display first lines of a file

SYNOPSIS
     head [-n count | -c bytes] [file ...]

DESCRIPTION
     This filter displays the first count lines or bytes of each of the speci-
     fied files, or of the standard input if no files are specified.  If count
     is omitted it defaults to 10.

     If more than a single file is specified, each file is preceded by a
     header consisting of the string ``==> XXX <=='' where ``XXX'' is the name
     of the file.

EXIT STATUS
     The head utility exits 0 on success, and >0 if an error occurs.

SEE ALSO
     tail(1)

HISTORY
     The head command appeared in PWB UNIX.

BSD				 June 6, 1993				   BSD


`man` automatically invokes `less`, so you may need to press `spacebar` to page through the information and `:q` to quit.

The one-line description under NAME tells you briefly what the command does, and the summary under SYNOPSIS lists all the flags it understands. Anything that is optional is shown in square brackets `[...]`, either/or alternatives are separated by `|`, and things that can be repeated are shown by `...`, so head's manual page is telling you that you can either give a line count with `-n` or a byte count with `-c`, and that you can give it any number of filenames.

Read the manual page for the tail command to find out what putting a `+` sign in front of the number used with the `-n` flag does. (Remember to press spacebar to page down and/or type `q` to quit.)

Use tail with the flag `-n +10` to display all but the first six lines of seasonal/spring.csv.

In [38]:
# man tail

!tail -n +10 courses/machine-learning.txt

Overview
====
The name machine learning was coined in 1959 by Arthur Samuel. 
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: 
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."
This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms.
This follows Alan Turing's proposal in his paper "Computing Machinery and Intelligence", in which the question "Can machines think?" is replaced with the question "Can machines do what we (as thinking entities) can do?".
In Turing's proposal the various characteristics that could be possessed by a thinking machine and the various implications in constructing one are exposed.

### Select Columns from a File

`head` and `tail` let you select rows from a text file. If you want to select columns, you can use the command `cut`. It has several options (use `man cut` to explore them), but the most common is something like:

`cut -f 2-5,8 -d , values.csv`

which means "select columns 2 through 5 and columns 8, using comma as the separator". `cut` uses `-f` (meaning "fields") to specify columns and `-d` (meaning "delimiter") to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.

In [41]:
cat data/course_list.csv

course, programming language, semester, year
Probability, R, fall, 2018
Statistics, R, winter, 2018-2019
Data Analysis, R and Python, spring, 2019
Machine Learning, Python, summer, 2019

In [40]:
!cut -f 1-2,4 -d , data/course_list.csv

course, programming language, year
Probability, R, 2018
Statistics, R, 2018-2019
Data Analysis, R and Python, 2019
Machine Learning, Python, 2019


Adding a space after the flag is good style, but not compulsory.

In [39]:
!cut -f1-2,4 -d, data/course_list.csv

course, programming language, year
Probability, R, 2018
Statistics, R, 2018-2019
Data Analysis, R and Python, 2019
Machine Learning, Python, 2019


In [41]:
!cut -d, -f 1-2,4 data/course_list.csv

course, programming language, year
Probability, R, 2018
Statistics, R, 2018-2019
Data Analysis, R and Python, 2019
Machine Learning, Python, 2019


### Repeat Commands

One of the biggest advantages of using the shell is that it makes it easy for you to do things over again. If you run some commands, you can then press the **up-arrow** key to cycle back through them. You can also use the left and right arrow keys and the delete key to edit them. Pressing return will then run the modified command.

Even better, `history` will print a list of commands you have run recently. Each one is preceded by a serial number to make it easy to re-run particular commands: just type `!55` to re-run the 55th command in your history (if you have that many). You can also re-run a command by typing an exclamation mark followed by the command's name, such as `!head` or `!cut`, which will re-run the most recent use of that command.

In [45]:
%history # This is history of the commands used in this notebook

# !head >>> runs the most recent head command
# !1 >>> runs the first command

pwd
cd courses
pwd
ls
cd ~
ls ..
ls
cd Documents/GitHub/Shell/
cd Courses
cd ..
cd ~
cd Documents/GitHub/Shell/
cd Documents/GitHub/Shell/courses
cd ~
cd Documents/GitHub/Shell/courses
ls
cp probability.txt probability_duplicate.txt
ls
cd ..
ls
cp courses/probability.txt courses/probability_duplicate.txt tmp
ls tmp/
mv tmp/probability.txt tmp/probability_duplicate.txt .
ls
mv probability_duplicate.txt probability_2.txt
ls
rm probability.txt probability_2.txt
ls
ls
rm thrash
rmdir thrash
rm thrash/thrash_file.txt
rm thrash
rmdir thrash
rm thrash/thrash_file.txt
rmdir thrash
mkdir thrash
mkdir thrash
!cut -f 1-2,4 -d, data/course_list.csv
!cut -d, -f 1-2,4 data/course_list.csv
!cut -d, -f 1-2,4 data/course_list.csv
history

# !head >>> runs the most recent head command
# !1 >>> runs the first command
!history

# !head >>> runs the most recent head command
# !1 >>> runs the first command
history

# !head >>> runs the most recent head command
# !1 >>> runs the first command
%history

# !he

### Select Lines Containing Specific Values

`head` and `tail` select rows, `cut` selects columns, and `grep` selects lines according to what they contain. In its simplest form, `grep` takes a piece of text followed by one or more filenames and prints all of the lines in those files that contain that text. For example, `grep bicuspid seasonal/winter.csv` prints lines from `winter.csv` that contain "bicuspid".

`grep` can search for patterns as well; we will explore those in the next course. What's more important right now is some of grep's more common flags:

* `-c:` print a count of matching lines rather than the lines themselves
* `-h:` do not print the names of files when searching multiple files
* `-i:` ignore case (e.g., treat "Regression" and "regression" as matches)
* `-l:` print the names of files that contain matches, not the matches
* `-n:` print line numbers for matching lines
* `-v:` invert the match, i.e., only show lines that don't match

Print the contents of all of the lines containing the word molar in seasonal/autumn.csv by running a single command while in your home directory. Don't use any flags.

In [50]:
!grep Python data/course_list.csv

Data Analysis, R and Python, spring, 2019
Machine Learning, Python, summer, 2019


Invert the match to find all of the lines that don't contain the word molar in seasonal/spring.csv, and show their line numbers. Remember, it's considered good style to put all of the flags before other values like filenames or the search term "molar".

In [51]:
!grep -v -n Python data/course_list.csv

1:course, programming language, semester, year
2:Probability, R, fall, 2018
3:Statistics, R, winter, 2018-2019


Count how many lines contain the word incisor in autumn.csv and winter.csv combined. (Again, run a single command from your home directory.)

In [52]:
!grep -c Python data/course_list.csv

2


Why isn't it always safe to treat data as text?

The `SEE ALSO` section of the manual page for `cut` refers to a command called `paste` that can be used to combine data files instead of cutting them up.

Read the manual page for `paste`, and then run paste to combine the autumn and winter data files in a single table using a `comma` as a separator. What's wrong with the output from a **data analysis** point of view?

--The last few rows have the wrong number of columns. Joining the lines with columns creates only one empty column at the start, not two.

In [47]:
# man paste

## Combining Tools

### Redirecting Operators

### Store a Command's Output in a File

All of the tools you have seen so far let you name input files. Most don't have an option for naming an output file because they don't need one. Instead, you can use **redirection** to save any command's output anywhere you want. If you run this command:

`head -n 5 seasonal/summer.csv`
it prints the first 5 lines of the summer data on the screen. If you run this command instead:

`head -n 5 seasonal/summer.csv > top.csv`
nothing appears on the screen. Instead, head's output is put in a new file called top.csv. You can take a look at that file's contents using cat:

`cat top.csv`

The greater-than sign `>` tells the shell to redirect head's output to a file. It isn't part of the `head` command; instead, it works with every shell command that produces output.

In [52]:
ls data

course_list.csv


In [56]:
cat data/course_list.csv

course, programming language, semester, year
Probability, R, fall, 2018
Statistics, R, winter, 2018-2019
Data Analysis, R and Python, spring, 2019
Machine Learning, Python, summer, 2019

In [58]:
!head -n 2 data/course_list.csv

course, programming language, semester, year
Probability, R, fall, 2018


In [66]:
!head -n 2 data/course_list.csv > data/top.csv

In [67]:
cat data/top.csv

course, programming language, semester, year
Probability, R, fall, 2018


### Use a command's output as an input

Suppose you want to get lines from the middle of a file. More specifically, suppose you want to get lines 3-5 from one of our data files. You can start by using head to get the first 5 lines and redirect that to a file, and then use tail to select the last 3:

head -n 5 seasonal/winter.csv > top.csv
tail -n 3 top.csv
A quick check confirms that this is lines 3-5 of our original file, because it is the last 3 lines of the first 5.

### What's a better way to combine commands? (Pipe Operator)

Using redirection to combine commands has two drawbacks:

It leaves a lot of intermediate files lying around (like top.csv).
The commands to produce your final result are scattered across several lines of history.
The shell provides another tool that solves both of these problems at once called a pipe. Once again, start by running head:

`head -n 5 seasonal/summer.csv`
Instead of sending head's output to a file, add a vertical bar and the tail command without a filename:

`head -n 5 seasonal/summer.csv | tail -n 3`

The pipe symbol tells the shell to use the output of the command on the left as the input to the command on the right.

In [69]:
!head -n 2 data/course_list.csv | tail -n 1

Probability, R, fall, 2018


In [70]:
cat data/course_list.csv

course, programming language, semester, year
Probability, R, fall, 2018
Statistics, R, winter, 2018-2019
Data Analysis, R and Python, spring, 2019
Machine Learning, Python, summer, 2019

In [74]:
!cut -d, -f 3 data/course_list.csv

 semester
 fall
 winter
 spring
 summer


In [77]:
!cut -d, -f 3 data/course_list.csv | grep -v semester

 fall
 winter
 spring
 summer


### Combining Many Commands

You can chain any number of commands together. For example, this command:

`cut -d , -f 1 seasonal/spring.csv | grep -v Date | head -n 10`
will:

select the first column from the spring data;
remove the header line containing the word "Date"; and
select the first 10 lines of actual data.

In [78]:
!cut -d, -f 3 data/course_list.csv | grep -v semester | head -n 1

 fall


### Count the records in a file

The command `wc` (short for "word count") prints the number of characters, words, and lines in a file. You can make it print only one of these using `-c`, `-w`, or `-l` respectively.


Count how many records in seasonal/spring.csv have dates in July 2017. To do this, use grep with a partial date to select the lines and pipe this result into wc with an appropriate flag to count the lines.

In [79]:
!grep Python data/course_list.csv | wc -l

       2


### specifying many files at once (wildcards)

Most shell commands will work on multiple files if you give them multiple filenames. For example, you can get the first column from all of the seasonal data files at once like this:

cut -d , -f 1 seasonal/winter.csv seasonal/spring.csv seasonal/summer.csv seasonal/autumn.csv
But typing the names of many files over and over is a bad idea: it wastes time, and sooner or later you will either leave a file out or repeat a file's name. To make your life better, the shell allows you to use **wildcards** to specify a list of files with a single expression. The most common wildcard is `*`, which means "match zero or more characters". Using it, we can shorten the `cut` command above to this:

cut -d , -f 1 seasonal/*
or:

cut -d , -f 1 seasonal/*.csv

In [81]:
!cut -d, -f 1 data/*

course
Probability
Statistics
Data Analysis
Machine Learning
course
Probability


### Other Wildcards

The shell has other wildcards as well, though they are less commonly used:

- `?` matches a single character, so `201?.txt` will match 2017.txt or 2018.txt, but not 2017-01.txt.
- `[...]` matches any one of the characters inside the square brackets, so `201[78].txt` matches 2017.txt or 2018.txt, but not 2016.txt.
- `{...}` matches any of the comma-separated patterns inside the curly brackets, so `{*.txt, *.csv}` matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.

Which expression would match singh.pdf and johel.txt but not sandhu.pdf or sandhu.txt?

`{singh.pdf, j*.txt}`

### Sorting Lines of Text

As its name suggests, `sort` puts data in order. By default it does this in ascending alphabetical order, but the flags `-n` and `-r` can be used to sort numerically and reverse the order of its output, while `-b` tells it to ignore leading blanks and `-f` tells it to fold case (i.e., be case-insensitive). Pipelines often use `grep` to get rid of unwanted records and then `sort` to put the remaining records in order.

In [3]:
!cut -d, -f 3 data/course_list.csv | grep -v semester | sort -r

 winter
 summer
 spring
 fall


### Removing Duplicate Lines

Another command that is often used with `sort` is `uniq`, whose job is to remove duplicated lines. More specifically, it removes adjacent duplicated lines. If a file contains:
```
2017-07-03
2017-07-03
2017-08-03
2017-08-03
```
then `uniq` will produce:
```
2017-07-03
2017-08-03
```
but if it contains:
```
2017-07-03
2017-08-03
2017-07-03
2017-08-03
```
then `uniq` will print all four lines. The reason is that `uniq` is built to work with very large files. In order to remove non-adjacent lines from a file, it would have to keep the whole file in memory (or at least, all the unique lines seen so far). By only removing adjacent duplicates, it only has to keep the most recent unique line in memory.

Write a pipeline to:

get the second column from seasonal/winter.csv,
remove the word "Tooth" from the output so that only tooth names are displayed,
sort the output so that all occurrences of a particular tooth name are adjacent; and
display each tooth name once along with a count of how often it occurs.
The start of your pipeline is the same as the previous exercise:

cut -d , -f 2 seasonal/winter.csv | grep -v Tooth
Extend it with a sort command, and use uniq -c to display unique lines with a count of how often each occurs rather than using uniq and wc.

In [4]:
!cat data/course_list.csv

course, programming language, semester, year
Probability, R, fall, 2018
Statistics, R, winter, 2018-2019
Data Analysis, R and Python, spring, 2019
Machine Learning, Python, summer, 2019

In [9]:
!cut -d, -f 4 data/course_list.csv | grep -v year | sort -r

 2019
 2019
 2018-2019
 2018


In [11]:
!cut -d, -f 4 data/course_list.csv | grep -v year | sort -r | uniq -c

   2  2019
   1  2018-2019
   1  2018


### Saving the Output of a Pipe

The shell lets us redirect the output of a sequence of piped commands:

cut -d , -f 2 seasonal/*.csv | grep -v Tooth > teeth-only.txt
However, > must appear at the end of the pipeline: if we try to use it in the middle, like this:

cut -d , -f 2 seasonal/*.csv > teeth-only.txt | grep -v Tooth
then all of the output from cut is written to teeth-only.txt, so there is nothing left for grep and it waits forever for some input.

What happens if we put redirection at the front of a pipeline as in:

`> result.txt head -n 3 seasonal/winter.csv`

The command's output is redirected to the file as usual.

### Stop a Running Program

The commands and scripts that you have run so far have all executed quickly, but some tasks will take minutes, hours, or even days to complete. You may also mistakenly put redirection in the middle of a pipeline, causing it to hang up. If you decide that you don't want a program to keep running, you can type `Ctrl` + `C` to end it. This is often written `^C` in Unix documentation; note that the 'c' can be lower-case.

Run the command: `head`
with no arguments (so that it waits for input that will never come) and then stop it by typing `Ctrl + C`.

### Wrapping up

To wrap up, you will build a pipeline to find out how many records are in the shortest of the seasonal data files.

Use wc with appropriate parameters to list the number of lines in all of the seasonal data files. (Use a wildcard for the filenames instead of typing them all in by hand.)

In [14]:
!wc -l data/*

       4 data/course_list.csv
       2 data/top.csv
       6 total


Add another command to the previous one using a pipe to remove the line containing the word "total".

In [16]:
!wc -l data/* | grep -v top

       4 data/course_list.csv
       6 total


Add two more stages to the pipeline that use sort -n and head -n 1 to find the file containing the fewest lines.

In [17]:
!wc -l data/* | sort -n | head -n 1

       2 data/top.csv
