- Download the data for today's lesson and move it to your desktop.
$ PS1='$ '
cd
-
At a high level, computers do four things:
- Run programs
- Store data
- Communicate with each other
- Interact with us
-
Interacting with computers
- Prior to 1950's people interacted with computers by rewiring them
- Since the 1980's, we have used icons windows, pointers, and similar technologies (GUI - graphical user interface)
- Today speech recognition like Alexa or Google Home are becoming popular
- Between 1950's and 1980's, most people used line printers
-
Line printers are limited to allowing input and output with letters, numbers, and punctuation found on a standard keyboard.
- This kind of interface is known as a command-line interface (CLI)
- This interface works on a read-evaluate-print loop.
- User types a command and presses
Enter
. - The computer reads, executes, and prints output
- User types a new command.
-
A shell is a program that runs on your computer and gives you a CLI to interact with the computer.
-
The shell primarily calls other programs on the computer, but gives you a way to interact with those programs.
-
Most modern Unix systems use the Bash shell by default.
-
Why bother?
- With a few keystrokes, we can combine tools to create powerful workflows.
- It's easy to reproduce work that we have done before.
- It's often the easiest way to connect to a remote computer, including supercomputers.
-
Nelle's Pipeline
- Nelle Nemo is a marine biologist who has just returned from the six-month survey of the North Pacific Gyre
- She has collected 1,520 samples and needs to ...
- Run each sample through an assay machine that will measure the relative abundance of 300 different proteins
- Calculate statistics for each protein separately using a program called
goostats
- Compare statistics using a program called
goodiff
- Write up the results for the upcoming issue of Aquatic Goo Letters, which is due at the end of the month.
- It takes half an hour for the assay machine to process each sample, and two minutes to set each one up.
- With 8 assay machines in the lab, this step will only take two weeks.
- If she runs
goostats
andgoodiff
by hand, she will need to enter file names and clickOK
46,370 times.- At 30 seconds each, she will need more than two weeks.
- Nelle is not going to make her deadline.
- Let's see if the Bash terminal can help her.
-
The
$
sign is the prompt. It tells us the shell if waiting for input. -
We can find out who we are.
$ whoami
nelle
$ mycommand
mycommand: command not found
-
The shell runs other commands or programs. If the program doesn't exist, it gives us an error.
-
We can find out where we are.
$ pwd
/home/nelle/Desktop
-
We see the name of the directory where we are at and all the directories that it is contained within.
-
Each directory is separated by a
/
. -
If you are using a Mac or Linux based OS, the first
/
is considered the "root" of the file system. -
We call this an absolute path because our location is described relative to the whole file system.
-
I can use this path to move to this location from any location on the computer.
-
To change locations, use
cd
(change directory).
$ cd
$ pwd
/home/nelle
-
cd
by itself always takes us home. -
Give
cd
a path to change locations to a specific directory.
$ cd Desktop/data-shell/
$ pwd
/home/nelle/Desktop/data-shell
-
Here we gave
cd
a relative path since the path was based on our current location.- Notice the path started with
Downloads
instead of/
(or the C drive). - When using relative paths, our commands can only "see" the directories and files within the directory we are currently in.
- Absolute paths are like street addresses, you can find the location from anywhere else.
- Relative paths are like getting local street directions: "from the park, turn south, then take the third left."
- Notice the path started with
-
We can "list" all the files within the current directory.
$ ls
creatures data molecules north-pacific-gyre notes.txt pizza.cfg solar.pdf writing
$ ls -F
creatures/ molecules/ notes.txt solar.pdf
data/ north-pacific-gyre/ pizza.cfg writing/
-
-F
is a flag/option. It acts like a switch that changes some aspect of a commands behavior. -
ls -F
puts a/
mark behind all directories so we can distinguish them from files. -
We can use the
--help
flag to find out aboutls
's flags.
$ ls --help
Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
Mandatory arguments to long options are mandatory for short options too.
-a, --all do not ignore entries starting with .
-A, --almost-all do not list implied . and ..
--author with -l, print the author of each file
-b, --escape print C-style escapes for nongraphic characters
--block-size=SIZE scale sizes by SIZE before printing them; e.g.,
'--block-size=M' prints sizes in units of
1,048,576 bytes; see SIZE format below
-B, --ignore-backups do not list implied entries ending with ~
.
.
.
- What happens if we use an option that doesn't exist?
$ ls -j
ls: invalid option -- 'j'
Try 'ls --help' for more information.
Challenge: What do the
-R
and-l
options do withls
?
$ ls -R
.:
creatures data molecules north-pacific-gyre notes.txt pizza.cfg solar.pdf writing
./creatures:
basilisk.dat unicorn.dat
./data:
amino-acids.txt animals.txt morse.txt planets.txt sunspot.txt
animal-counts elements pdb salmon.txt
./data/animal-counts:
animals.txt
./data/elements:
Ac.xml Bi.xml Co.xml Fm.xml H.xml Md.xml Np.xml Pt.xml Sc.xml Te.xml Y.xml
Ag.xml Bk.xml Cr.xml Fr.xml In.xml Mg.xml N.xml Pu.xml Se.xml Th.xml Zn.xml
Al.xml Br.xml Cs.xml F.xml Ir.xml Mn.xml Os.xml P.xml Si.xml Ti.xml Zr.xml
Am.xml B.xml Cu.xml Ga.xml I.xml Mo.xml O.xml Ra.xml Sm.xml Tl.xml
Ar.xml Ca.xml C.xml Gd.xml Kr.xml Na.xml Pa.xml Rb.xml Sn.xml Tm.xml
As.xml Cd.xml Dy.xml Ge.xml K.xml Nb.xml Pb.xml Re.xml Sr.xml U.xml
At.xml Ce.xml Er.xml He.xml La.xml Nd.xml Pd.xml Rh.xml S.xml V.xml
Au.xml Cf.xml Es.xml Hf.xml Li.xml Ne.xml Pm.xml Rn.xml Ta.xml W.xml
Ba.xml Cl.xml Eu.xml Hg.xml Lr.xml Ni.xml Po.xml Ru.xml Tb.xml Xe.xml
Be.xml Cm.xml Fe.xml Ho.xml Lu.xml No.xml Pr.xml Sb.xml Tc.xml Yb.xml
./data/pdb:
aldrin.pdb cyclopropane.pdb methane.pdb quinine.pdb
ammonia.pdb ethane.pdb methanol.pdb strychnine.pdb
ascorbic-acid.pdb ethanol.pdb mint.pdb styrene.pdb
benzaldehyde.pdb ethylcyclohexane.pdb morphine.pdb sucrose.pdb
camphene.pdb glycol.pdb mustard.pdb testosterone.pdb
cholesterol.pdb heme.pdb nerol.pdb thiamine.pdb
cinnamaldehyde.pdb lactic-acid.pdb norethindrone.pdb tnt.pdb
citronellal.pdb lactose.pdb octane.pdb tuberin.pdb
codeine.pdb lanoxin.pdb pentane.pdb tyrian-purple.pdb
cubane.pdb lsd.pdb piperine.pdb vanillin.pdb
cyclobutane.pdb maltose.pdb propane.pdb vinyl-chloride.pdb
cyclohexanol.pdb menthol.pdb pyridoxal.pdb vitamin-a.pdb
./molecules:
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
./north-pacific-gyre:
2012-07-03
./north-pacific-gyre/2012-07-03:
goodiff NENE01736A.txt NENE01843A.txt NENE01978B.txt NENE02040Z.txt
goostats NENE01751A.txt NENE01843B.txt NENE02018B.txt NENE02043A.txt
NENE01729A.txt NENE01751B.txt NENE01971Z.txt NENE02040A.txt NENE02043B.txt
NENE01729B.txt NENE01812A.txt NENE01978A.txt NENE02040B.txt
./writing:
data haiku.txt thesis tools
./writing/data:
LittleWomen.txt one.txt two.txt
./writing/thesis:
empty-draft.md
./writing/tools:
format old stats
./writing/tools/old:
oldtool
$ ls -l
total 76
drwxrwxr-x 2 nelle nelle 4096 Sep 15 10:00 creatures
drwxrwxr-x 5 nelle nelle 4096 Sep 15 10:00 data
drwxrwxr-x 2 nelle nelle 4096 Sep 15 10:00 molecules
drwxrwxr-x 3 nelle nelle 4096 Sep 15 10:00 north-pacific-gyre
-rw-rw-r-- 1 nelle nelle 86 Sep 15 10:00 notes.txt
-rw-rw-r-- 1 nelle nelle 32 Sep 15 10:00 pizza.cfg
-rw-rw-r-- 1 nelle nelle 21583 Sep 15 10:00 solar.pdf
drwxrwxr-x 5 nelle nelle 4096 Sep 15 10:00 writing
-
If you want to find out more about a command's options, you can use
man
or Google. -
We can use
ls
to peek inside of a directory.
$ ls molecules/
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
$ cd data/
$ pwd
/home/nelle/Desktop/data-shell/data
- We know how to move down a level, but how do we go back up to the containing directory?
/home/phillip/Desktop/data-shell/data
$ cd data-shell
bash: cd: data-shell: No such file or directory
- We need to use a shell shortcut.
$ cd ..
$ pwd
/home/nelle/Desktop/data-shell
..
is shorthand for "the directory above where I am now."
$ ls -a -F
./ .bash_profile data/ north-pacific-gyre/ pizza.cfg writing/
../ creatures/ molecules/ notes.txt solar.pdf
-
If we use the
-a
flag to display hidden files, we see that there is also.
. This is shorthand for the current directory. -
We can also put all the flags together when we type a command.
$ ls -aF
./ .bash_profile data/ north-pacific-gyre/ pizza.cfg writing/
../ creatures/ molecules/ notes.txt solar.pdf
-
Other useful shortcuts when using the terminal:
Tab
auto-complete~
stands for your home directory- The up arrow cycles through your history of commands
Ctrl
-C
stops execution and returns your promptcd -
goes back to the previous directory you were in
-
Let's take a look at Nelle's data
$ ls north-pacific-gyre/2012-07-03/
goodiff NENE01736A.txt NENE01843A.txt NENE01978B.txt NENE02040Z.txt
goostats NENE01751A.txt NENE01843B.txt NENE02018B.txt NENE02043A.txt
NENE01729A.txt NENE01751B.txt NENE01971Z.txt NENE02040A.txt NENE02043B.txt
NENE01729B.txt NENE01812A.txt NENE01978A.txt NENE02040B.txt
- Note the date convention for the directory with the data files.
$ pwd
/home/nelle/Desktop/data-shell
- We're now ready to begin creating and destroying directories.
$ ls
creatures data molecules north-pacific-gyre notes.txt pizza.cfg solar.pdf writing
$ mkdir thesis
$ ls
creatures molecules notes.txt solar.pdf writing
data north-pacific-gyre pizza.cfg thesis
-
Don't use spaces in your file names.
-
You can use
_
or-
instead. -
There is nothing in
thesis
yet.
$ ls thesis/
nano
is a text-based text editor that we can use to create a file.
$ cd thesis
$ nano draft.txt
GNU nano 2.5.3 File: dreaft.txt Modified
It's not "publish or perish" anymore,
it's "share and thrive."
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify ^C Cur Pos
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Spell ^_ Go To Line
- Type
Ctrl
-X
to exit. - Type
Y
to save changes. - Hit
Enter
to keep the file name the same.
$ ls
draft.txt
- To see what's inside the file use
cat
.
$ cat draft.txt
It's not "publish or perish" anymore,
it's "share and thrive."
- To delete a file use
rm
(remove).
$ rm draft.txt
$ ls
-
Remove is not like the recycle bin, it's permanent, so be careful!
-
Let's move up a directory and recreate the file.
$ cd ..
$ nano draft.txt
GNU nano 2.5.3 File: draft.txt Modified
"The problem with quotes on the Internet is that it is hard to verify their authenticity."
~ Abraham Lincoln
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify ^C Cur Pos
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Spell ^_ Go To Line
- since
draft.txt
is not longer inthesis
, we can remove the directory.
$ rm thesis/
rm: cannot remove 'thesis/': Is a directory
-
rm
is designed to remove individual files. -
To remove directories, we need to use the recursive option.
$ rm -r thesis/
$ ls
creatures draft.txt north-pacific-gyre pizza.cfg writing
data molecules notes.txt solar.pdf
-
This removes directories and everything in them.
-
For extra safety I can, and should, use the interactive option.
$ mkdir thesis
$ rm -ri thesis/
rm: remove directory 'thesis/'? n
$ ls
creatures draft.txt north-pacific-gyre pizza.cfg thesis
data molecules notes.txt solar.pdf writing
- Let's move
draft.txt
back into thethesis
directory.
$ mv draft.txt thesis/
$ ls thesis/
draft.txt
draft.txt
is not an informative name.mv
also allows us to rename files.
$ mv thesis/draft.txt thesis/quotes.txt
$ ls thesis/
quotes.txt
-
mv
will silently overwrite files, but you can also use the-i
option withmv
as well. -
Let's move
quotes.txt
back to the current working directory.
$ mv thesis/quotes.txt .
$ ls
creatures molecules notes.txt quotes.txt thesis
data north-pacific-gyre pizza.cfg solar.pdf writing
$ ls thesis/
- Copy works similarly to
mv
, except it makes a copy of the file.
$ cp quotes.txt thesis/quotations.txt
$ ls
creatures molecules notes.txt quotes.txt thesis
data north-pacific-gyre pizza.cfg solar.pdf writing
$ ls thesis/
quotations.txt
Challenge: Use what you have learned to create a file-tree of all the files and directories in
data-shell
. Feel free to summarize portions of your tree with...
.
$ tree .
.
├── creatures
│ ├── basilisk.dat
│ └── unicorn.dat
├── data
│ ├── amino-acids.txt
│ ├── animal-counts
│ │ └── animals.txt
│ ├── animals.txt
│ ├── elements
│ │ ├── Ac.xml
│ │ ├── Ag.xml
│ │ ├── ...
│ ├── morse.txt
│ ├── pdb
│ │ ├── aldrin.pdb
│ │ ├── ammonia.pdb
│ │ ├── ascorbic-acid.pdb
│ │ ├── ...
│ ├── planets.txt
│ ├── salmon.txt
│ └── sunspot.txt
├── molecules
│ ├── cubane.pdb
│ ├── ethane.pdb
│ ├── methane.pdb
│ ├── octane.pdb
│ ├── pentane.pdb
│ └── propane.pdb
├── north-pacific-gyre
│ └── 2012-07-03
│ ├── goodiff
│ ├── goostats
│ ├── NENE01729B.txt
│ ├── NENE01736A.txt
│ ├── ...
├── notes.txt
├── pizza.cfg
├── quotes.txt
├── solar.pdf
├── thesis
│ └── quotations.txt
└── writing
├── data
│ ├── LittleWomen.txt
│ ├── one.txt
│ └── two.txt
├── haiku.txt
├── thesis
│ └── empty-draft.md
└── tools
├── format
├── old
│ └── oldtool
└── stats
14 directories, 198 files
-
We now know a few basic commands.
-
We will now see how to put basic commands together into powerful workflows.
-
Unix systems are organized around a "small pieces, loosely joined" philosophy.
- Each tool does one thing (well) and one thing only.
- Multiple tools are put together to do complex tasks.
-
We will begin in the
molecules
directory.
$ cd molecules/
$ ls
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
- We have several "protein database files." Let's do a word count on each of them.
$ wc *.pdb
20 156 1158 cubane.pdb
12 84 622 ethane.pdb
9 57 422 methane.pdb
30 246 1828 octane.pdb
21 165 1226 pentane.pdb
15 111 825 propane.pdb
107 819 6081 total
-
wc
gives us the number lines, words, and characters in each file. -
*
is a wildcard character. It matches any number of characters. -
wc -l
gives us just the number of lines for each file.
$ wc -l *.pdb
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
-
If there were thousands of files, how would we know which one was the shortest?
-
Begin by saving the word counts into a file using redirect (
>
).
$ wc -l *.pdb > lengths.txt
$ ls
cubane.pdb ethane.pdb lengths.txt methane.pdb octane.pdb pentane.pdb propane.pdb
How do we check that it worked?
$ cat lengths.txt
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
- We can now sort the contents of the file using
sort
.
$ sort -n lengths.txt
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
-
The first one on the list is now the shortest.
-
head
can be used to get the first line from a file.
$ sort -n lengths.txt > sorted-length.txt
$ head -n 1 sorted-length.txt
9 methane.pdb
-
Saving all these intermediate files along the way is confusing.
-
The pipe (
|
) character take output from one command and "pipes" it into the input of the next command.
$ sort -n lengths.txt | head -n 1
9 methane.pdb
- Let's recreate the whole work flow using pipes.
$ wc -l *.pdb
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
$ wc -l *.pdb | sort -n
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
$ wc -l *.pdb | sort -n | head -n 1
9 methane.pdb
- Notice how we can build the workflow one piece at a time, checking that each step works.
Challenge: All of Nelle's data files should have 300 lines. Change directory into
data-shell/north-pacific-gyre/2012-07-03
and create a pipeline that checks if there are any files that are too short.
$ cd ../north-pacific-gyre/2012-07-03/
$ wc -l *.txt
300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
300 NENE01751B.txt
300 NENE01812A.txt
300 NENE01843A.txt
300 NENE01843B.txt
300 NENE01971Z.txt
300 NENE01978A.txt
300 NENE01978B.txt
240 NENE02018B.txt
300 NENE02040A.txt
300 NENE02040B.txt
300 NENE02040Z.txt
300 NENE02043A.txt
300 NENE02043B.txt
5040 total
$ wc -l *.txt | sort -n
240 NENE02018B.txt
300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
300 NENE01751B.txt
300 NENE01812A.txt
300 NENE01843A.txt
300 NENE01843B.txt
300 NENE01971Z.txt
300 NENE01978A.txt
300 NENE01978B.txt
300 NENE02040A.txt
300 NENE02040B.txt
300 NENE02040Z.txt
300 NENE02043A.txt
300 NENE02043B.txt
5040 total
$ wc -l *.txt | sort -n | head -n 5
240 NENE02018B.txt
300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
-
One of the files is 60 lines shorter than the others. Nelle sees that she did that assay at 8:00 am on a Monday morning. Someone was probably in using the machine on the weekend and she forgot to reset it.
-
We can also check if there are any files that are too long using
tail
.
$ wc -l *.txt | sort -n | tail -n 5
300 NENE02040B.txt
300 NENE02040Z.txt
300 NENE02043A.txt
300 NENE02043B.txt
5040 total
-
All is well, but one of our files end in a "Z" instead of "A" or "B".
-
Nelle informs us that her lab uses "Z" to indicate samples with missing information.
Challenge: Help Nelle find all the files that end with "Z".
$ ls *Z.txt
NENE01971Z.txt NENE02040Z.txt
-
Nelle checks the log and realizes those two samples are missing depth recordings.
-
She informs you there are other analyses she can use them for, so she can't delete them.
-
Another wild card expression will allow us to select just files that end in "A" or "B"
$ ls *[AB].txt
NENE01729A.txt NENE01751A.txt NENE01843A.txt NENE01978B.txt NENE02040B.txt
NENE01729B.txt NENE01751B.txt NENE01843B.txt NENE02018B.txt NENE02043A.txt
NENE01736A.txt NENE01812A.txt NENE01978A.txt NENE02040A.txt NENE02043B.txt
-
Loops allow us to execute the same command repeatedly.
-
Loops reduce typing and allow for increased automation.
$ pwd
/home/nelle/Desktop/data-shell/creatures
- Let's create a loop that outputs the first 3 lines of each file in the directory.
$ for filename in basilisk.dat unicorn.dat
> do
> head -n 3 $filename
> done
COMMON NAME: basilisk
CLASSIFICATION: basiliscus vulgaris
UPDATED: 1745-05-02
COMMON NAME: unicorn
CLASSIFICATION: equus monoceros
UPDATED: 1738-11-24
-
Explain the different parts of a loop and what they do.
-
Talk about good variable naming conventions (file name versus x, or temperature)
-
Let's make our loop slightly more complicated.
$ for filename in *.dat
> do
> echo $filename
> head -n 100 $filename | tail -n 20
> done
basilisk.dat
CGGTACCGAA
AAGGGTCGCG
CAAGTGTTCC
CGGGACAATA
GTTCTGCTAA
GATAAGTATG
TGCCGACTTA
CCCGACCGTC
TAGGTTATAA
GGCACAACCG
CTTCACTGTA
GAGGTGTACA
AGGATCCGTT
GCGCGGGCGG
CAGTCTATGT
TTTTCGACAC
TGGACTGCTT
CCCTTTGAGG
GTGGATTTTT
CGTAACGGGT
unicorn.dat
CGGTACCGAA
AAGGGTCGCG
CAAGTGTTCC
CGGGACAATA
GTTCTGCTAA
GATAAGTATG
TGCCGACTTA
CCCGACCGTC
TAGGTTATAA
GGCACAACCG
CTTCACTGTA
GAGGTGTACA
AGGATCCGTT
GCGCGGGCGG
CAGTCTATGT
TTTTCGACAC
TGGACTGCTT
CCCTTTGAGG
GTGGATTTTT
CGTAACGGGT
-
Discuss shell expansion and when it happens.
-
echo
prints whatever it is given. -
White space is used to separate items in a list.
Challenge: What happens when we try to use wild cards to copy multiple files?
$ cp *.dat original-*.dat
Why? How can we fix this with a loop?
$ cp *.dat original-*.dat
cp: target 'original-*.dat' is not a directory
- Since wild cards are expanded by the shell before the command is executed, we have confused
cp
.
$ for filename in *.dat
> do
> cp $filename original-$filename
> done
$ ls
basilisk.dat original-basilisk.dat original-unicorn.dat unicorn.dat
-
Nelle is ready to begin analyzing her files using
goostats
. -
goostats
calculates statistics from a protein file. -
The program is run by typing
bash goostats
and giving it two arguments: the file you are analyzing, and the output file where the data is saved. -
Let's build the loop in steps using
echo
.
$ cd ../north-pacific-gyre/2012-07-03/
$ for datafile in NENE*[AB].txt
> do
> echo $datafile
> done
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt
NENE01843B.txt
NENE01978A.txt
NENE01978B.txt
NENE02018B.txt
NENE02040A.txt
NENE02040B.txt
NENE02043A.txt
NENE02043B.txt
$ for datafile in NENE*[AB].txt; do echo $datafile stats-$datafile; done
NENE01729A.txt stats-NENE01729A.txt
NENE01729B.txt stats-NENE01729B.txt
NENE01736A.txt stats-NENE01736A.txt
NENE01751A.txt stats-NENE01751A.txt
NENE01751B.txt stats-NENE01751B.txt
NENE01812A.txt stats-NENE01812A.txt
NENE01843A.txt stats-NENE01843A.txt
NENE01843B.txt stats-NENE01843B.txt
NENE01978A.txt stats-NENE01978A.txt
NENE01978B.txt stats-NENE01978B.txt
NENE02018B.txt stats-NENE02018B.txt
NENE02040A.txt stats-NENE02040A.txt
NENE02040B.txt stats-NENE02040B.txt
NENE02043A.txt stats-NENE02043A.txt
NENE02043B.txt stats-NENE02043B.txt
$ for datafile in NENE*[AB].txt; do echo bash goostats $datafile stats-$datafile; done
bash goostats NENE01729A.txt stats-NENE01729A.txt
bash goostats NENE01729B.txt stats-NENE01729B.txt
bash goostats NENE01736A.txt stats-NENE01736A.txt
bash goostats NENE01751A.txt stats-NENE01751A.txt
bash goostats NENE01751B.txt stats-NENE01751B.txt
bash goostats NENE01812A.txt stats-NENE01812A.txt
bash goostats NENE01843A.txt stats-NENE01843A.txt
bash goostats NENE01843B.txt stats-NENE01843B.txt
bash goostats NENE01978A.txt stats-NENE01978A.txt
bash goostats NENE01978B.txt stats-NENE01978B.txt
bash goostats NENE02018B.txt stats-NENE02018B.txt
bash goostats NENE02040A.txt stats-NENE02040A.txt
bash goostats NENE02040B.txt stats-NENE02040B.txt
bash goostats NENE02043A.txt stats-NENE02043A.txt
bash goostats NENE02043B.txt stats-NENE02043B.txt
- It looks right, let's run it!
$ for datafile in NENE*[AB].txt; do bash goostats $datafile stats-$datafile; done
^C
-
It's running ... I think. Use
Ctrl
-C
to stop the loop. -
Let's try again, but add an
echo
statement so that we get a status update.
$ for datafile in NENE*[AB].txt; do echo $datafile; bash goostats $datafile stats-$datafile; done
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt
NENE01843B.txt
NENE01978A.txt
NENE01978B.txt
NENE02018B.txt
NENE02040A.txt
NENE02040B.txt
NENE02043A.txt
NENE02043B.txt
-
In order to make our workflow reproducible, we save a workflow into a text file and recall the entire workflow any time we need it.
-
This makes it much faster to repeat a task next time we need to do it.
-
Let's create a script that returns the middle lines of a file.
$ cd ../../molecules/
$ ls
cubane.pdb lengths.txt octane.pdb propane.pdb
ethane.pdb methane.pdb pentane.pdb sorted-length.txt
$ nano middle.sh
GNU nano 2.5.3 File: middle.sh
head -n 15 octane.pdb | tail -n 5
[ Read 1 line ]
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify ^C Cur Pos
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Linter ^_ Go To Line
$ ls
cubane.pdb lengths.txt middle.sh pentane.pdb sorted-length.txt
ethane.pdb methane.pdb octane.pdb propane.pdb
- we now ask the shell to execute the script.
$ bash middle.sh
ATOM 9 H 1 -4.502 0.681 0.785 1.00 0.00
ATOM 10 H 1 -5.254 -0.243 -0.537 1.00 0.00
ATOM 11 H 1 -4.357 1.252 -0.895 1.00 0.00
ATOM 12 H 1 -3.009 -0.741 -1.467 1.00 0.00
ATOM 13 H 1 -3.172 -1.337 0.206 1.00 0.00
-
It executed the commands in the file.
-
This script will always give us the middle of the
octane.pdb
file. Let's make it more versatile.
$ nano middle.sh
GNU nano 2.5.3 File: middle.sh Modified
head -n 15 "$1" | tail -n 5
[ Read 1 line ]
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify ^C Cur Pos
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Linter ^_ Go To Line
-
$1
is a special variable. It tells bash to take the first argument and put it in that place in the script. -
We surround
$1
in quotes in case somebody (else) puts a space in their argument.
$ bash middle.sh cubane.pdb
ATOM 9 H 1 1.410 -1.631 0.942 1.00 0.00
ATOM 10 H 1 -0.262 -2.112 -1.024 1.00 0.00
ATOM 11 H 1 -2.224 -0.925 0.328 1.00 0.00
ATOM 12 H 1 -0.468 -0.501 2.315 1.00 0.00
ATOM 13 H 1 2.224 0.892 -0.134 1.00 0.00
$ bash middle.sh propane.pdb
ATOM 9 H 1 -0.914 0.551 -1.359 1.00 0.00
ATOM 10 H 1 -1.396 1.211 0.219 1.00 0.00
ATOM 11 H 1 -2.058 -0.345 -0.332 1.00 0.00
TER 12 1
END
- Let's make our script even more useful by allowing for a change in the range of lines we select from the file.
$ nano middle.sh
GNU nano 2.5.3 File: middle.sh
head -n "$2" "$1" | tail -n "$3"
[ Read 1 line ]
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify ^C Cur Pos
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Linter ^_ Go To Line
$ bash middle.sh pentane.pdb 15 5
ATOM 9 H 1 1.324 0.350 -1.332 1.00 0.00
ATOM 10 H 1 1.271 1.378 0.122 1.00 0.00
ATOM 11 H 1 -0.074 -0.384 1.288 1.00 0.00
ATOM 12 H 1 -0.048 -1.362 -0.205 1.00 0.00
ATOM 13 H 1 -1.183 0.500 -1.412 1.00 0.00
$ bash middle.sh pentane.pdb 16 2
ATOM 13 H 1 -1.183 0.500 -1.412 1.00 0.00
ATOM 14 H 1 -1.259 1.420 0.112 1.00 0.00
-
It is important to use comments in your scripts so others can use them (and you can six months later).
-
#
indicates a comment in bash. -
The computer ignores lines that start with
#
(for human eyes only).
$ nano middle.sh
GNU nano 2.5.3 File: middle.sh Modified
# Name: middle.sh
# Usage: bash sorted.sh filename end_line num_lines
# Description: Select lines from the middle of a file.
head -n "$2" "$1" | tail -n "$3"
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify ^C Cur Pos
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Linter ^_ Go To Line
- What if we want to process many files with one script?
Challenge: Use pipes to create a workflow that sorts files by length in both this directory and the
data-shell/creatures
directory.
$ wc -l *.pdb ../creatures/*.dat
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
163 ../creatures/basilisk.dat
163 ../creatures/original-basilisk.dat
163 ../creatures/original-unicorn.dat
163 ../creatures/unicorn.dat
759 total
$ wc -l *.pdb ../creatures/*.dat | sort -n
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
163 ../creatures/basilisk.dat
163 ../creatures/original-basilisk.dat
163 ../creatures/original-unicorn.dat
163 ../creatures/unicorn.dat
759 total
-
Since the bash shell is interactive, we can play with things until they work and then save our history to a file to make setting up a script faster.
-
Let's create a script that sorts any number of files by length.
$ history | tail -n 10 > sorted.sh
- We can use the special variable
$@
which will take any number of arguments (we don't have to know how many).
$ nano sorted.sh
GNU nano 2.5.3 File: sorted.sh Modified
# Name: sorted.sh
# Usage: bash sorted.sh one_or_more_filenames
# Description: Sort filenames by their length.
wc -l "$@" | sort -n
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify ^C Cur Pos
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Linter ^_ Go To Line
$ bash sorted.sh *.pdb ../creatures/*.dat ../data/elements/*.xml
7 ../data/elements/Bk.xml
7 ../data/elements/Cf.xml
7 ../data/elements/Cm.xml
7 ../data/elements/Es.xml
7 ../data/elements/Fm.xml
7 ../data/elements/Lr.xml
7 ../data/elements/Md.xml
7 ../data/elements/No.xml
8 ../data/elements/Ac.xml
8 ../data/elements/At.xml
8 ../data/elements/La.xml
9 ../data/elements/Ag.xml
9 ../data/elements/Al.xml
9 ../data/elements/Am.xml
9 ../data/elements/Ar.xml
9 ../data/elements/As.xml
9 ../data/elements/Au.xml
9 ../data/elements/Ba.xml
.
.
.
.
9 ../data/elements/U.xml
9 ../data/elements/V.xml
9 ../data/elements/W.xml
9 ../data/elements/Xe.xml
9 ../data/elements/Yb.xml
9 ../data/elements/Y.xml
9 ../data/elements/Zn.xml
9 ../data/elements/Zr.xml
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
163 ../creatures/basilisk.dat
163 ../creatures/original-basilisk.dat
163 ../creatures/original-unicorn.dat
163 ../creatures/unicorn.dat
1667 total
Homework (optional): Create a script that makes Nelle's analysis from the previous section reproducible.
# Calculate stats for Site A and Site B data files.
for datafile in NENE*[AB].txt
do
echo $datafile
bash goostats $datafile stats-$datafile
done
-
One final skill that is useful on the command line is finding things.
-
There are two tools for finding things:
grep
andfind
.
$ cd ../writing/
$ cat haiku.txt
The Tao that is seen
Is not the true Tao, until
You bring fresh toner.
With searching comes loss
and the presence of absence:
"My Thesis" not found.
Yesterday it worked
Today it is not working
Software is like that.
-
grep
searches content within files. -
We use it by typing
grep
, followed by the search term, and then the file that we want to look in.
$ grep not haiku.txt
Is not the true Tao, until
"My Thesis" not found.
Today it is not working
$ grep The haiku.txt
The Tao that is seen
"My Thesis" not found.
grep
does not pay attention to word boundaries by default. We can change this by using the-w
option.
$ grep -w The haiku.txt
The Tao that is seen
grep
defines a word as anything separated by a space. If we want to find a phrase, we need to put it in quotes.
$ grep -w "is not" haiku.txt
Today it is not working
grep
can tell us the line number where the search term was found.
$ grep -n "it" haiku.txt
5:With searching comes loss
9:Yesterday it worked
10:Today it is not working
grep
is case-sensitive by default. We can make it case insensitive.
$ grep -nwi "the" haiku.txt
1:The Tao that is seen
2:Is not the true Tao, until
6:and the presence of absence:
- We can invert our search and find all the lines that don't have "the".
$ grep -nwv "the" haiku.txt
1:The Tao that is seen
3:You bring fresh toner.
4:
5:With searching comes loss
7:"My Thesis" not found.
8:
9:Yesterday it worked
10:Today it is not working
11:Software is like that.
- These options are only the beginning ...
$ grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
-F, --fixed-strings PATTERN is a set of newline-separated strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
-e, --regexp=PATTERN use PATTERN for matching
-f, --file=FILE obtain PATTERN from FILE
-i, --ignore-case ignore case distinctions
-w, --word-regexp force PATTERN to match only whole words
-x, --line-regexp force PATTERN to match only whole lines
-z, --null-data a data line ends in 0 byte, not newline
Miscellaneous:
-s, --no-messages suppress error messages
-v, --invert-match select non-matching lines
-V, --version display version information and exit
--help display this help text and exit
.
.
.
-
find
is used to find files themselves. -
To use
find
, typefind
, followed by where you want to look, then the search terms/options.
$ find .
.
./data
./data/two.txt
./data/LittleWomen.txt
./data/one.txt
./thesis
./thesis/empty-draft.md
./haiku.txt
./tools
./tools/stats
./tools/old
./tools/old/oldtool
./tools/format
- We can look for only directories.
$ find . -type d
.
./data
./thesis
./tools
./tools/old
- We can search for only files.
$ find . -type f
./data/two.txt
./data/LittleWomen.txt
./data/one.txt
./thesis/empty-draft.md
./haiku.txt
./tools/stats
./tools/old/oldtool
./tools/format
- We can search for a specific file. Let's find all the text files.
$ find . -name *.txt
./haiku.txt
-
What happened?
-
Remember that the shell expands wild cards first before executing commands. Here's how to fix it:
$ find . -name '*.txt'
./data/two.txt
./data/LittleWomen.txt
./data/one.txt
./haiku.txt
- We can use the
$()
construction to usegrep
andfind
together.
$ grep "FE" $(find .. -name '*.pdb')
../data/pdb/heme.pdb:ATOM 25 FE 1 -0.924 0.535 -0.518