# The Unix Shell: File and Directory Management

## Listing files

In [None]:
ls

In [None]:
ls scripts

### Include hidden files

In [None]:
ls -a

### Exclude current and parent directory

In [None]:
ls -A

### Show details

In [None]:
ls -l

### Shown only directories

In [None]:
ls -d */

#### Alternative using grep

In [None]:
ls -l | grep -E '^d'

#### Show only files

In [None]:
ls -l | grep -Ev '^d'

### Sort by last modified time

In [None]:
ls -lt

### Human readable output

In [None]:
ls -lth

### Recursive listing

In [None]:
ls -R

## Globbing

The use of wild cards to specify Unix paths is known as globbing.

### `*` represets any number of characters

In [None]:
ls *ipynb

In [None]:
ls *Text*

### `?` represents exactly one character

In [None]:
ls data/iris??.csv

### Character sets

- `[abc]` represents a or b or c
- [a-z] represents any lower case character
- `!` negates

In [None]:
ls data/[X-Z]*

In [None]:
ls data/[!X-Z]*

## Directory navigation

### Show current directory

In [None]:
pwd

### Move to parent directory

In [None]:
cd ..

In [None]:
pwd

### Move back to last directory

In [None]:
cd -

### Move using relative addressing

In [None]:
cd data

In [None]:
pwd

### Move using absolute addressing

#### Move to yesterday's folder

In [None]:
cd /Users/cliburn/_teach/HTS_SummerCourse_2017/Materials/Computation/Wk4_Day2_PM/

In [None]:
pwd

#### Move back to today's folder

In [None]:
cd /Users/cliburn/_teach/HTS_SummerCourse_2017/Materials/Computation/Wk4_Day3_PM/

In [None]:
pwd

## Making and removing new directories

In [None]:
mkdir foo

In [None]:
ls

### Making intermediate directories automatically

In [None]:
mkdir -p a/b/c/d

In [None]:
ls -R a

### Deleting directories

In [None]:
rmdir foo

#### Only works if directory is empty

The `| cat` part is not necessary on the command line, but is only used here for convenience of Run All Cells as Jupyter stops on non-zero exit codes. The `| cat` syntax "pipes" the output of `rmdir data` to a the `cat` program.

In [None]:
rmdir data | cat

### Recursive intermediate directories as well

In [None]:
rmdir -p a/b/c/d

In [None]:
ls

## Working with files

### Making an empty file

In [None]:
touch foo.txt

In [None]:
ls

### Deleting a file

In [None]:
rm foo.txt

In [None]:
ls

### Viewing a file

In [None]:
cat scripts/avg.sh

In [None]:
head -n 3 scripts/avg.sh

In [None]:
tail -n 3 scripts/avg.sh

### Can start tail form a specified line number with (+)

In [None]:
tail -n +4 scripts/avg.sh

### Copying and moving files

In [None]:
ls

#### Copying files

In [None]:
cp "The_Unix_Shell_01___File_and_Directory_Management.ipynb" foo.ipynb

In [None]:
ls

#### Copying directories (Recursive copy)

In [None]:
cp -R data data2

In [None]:
ls -R

#### Renaming a file

In [None]:
mv foo.ipynb foo_Copy.ipynb

In [None]:
ls

#### Move a file to a new location

In [None]:
mv foo_Copy.ipynb scripts

In [None]:
ls -R

## File compression and archival

### Combine multiple files into single file

In [None]:
ls data

In [None]:
man tar | head -n 20

In [None]:
tar -cvf data.tar data

In [None]:
rm -rf data/

In [None]:
ls data*

### Compress concatenated file

In [None]:
gzip data.tar

In [None]:
ls data*

### Uncompress

In [None]:
gunzip data.tar.gz

In [None]:
ls data*

### Recover original files

In [None]:
tar -xvf data.tar

In [None]:
ls data*

In [None]:
rm data.tar

### Concatenate and compress

In [None]:
tar -cvzf data.tar.gz data

In [None]:
rm -rf data/

In [None]:
ls data*

### Uncompress and recover

In [None]:
tar -xvzf data.tar.gz

In [None]:
ls data*

In [None]:
rm data.tar.gz

### Checksums

When working with genomic data, we deal with very large files. There is a small risk that these files will be corrupted over time or during data transfer. To ensure that files are not changed, we use a "checksum" function. This is a function that generates an long, essentially random number called a checksum that represents the contents of the file. When the file contents change, so will the checksum. In theory, there is a very small probability that two different files generate the same checksum, but in practice the probability is too small to worry about.

There are several different algorithms for generating the checksums, and at least 3 Unix commands to do so, but they all work very similarly for our purposes.

The strategy is:

- Generate and store a checksum together with a data file whose integrity you care about
- When you use or download the data, re-generate the checksum (using the same algorithm e.g. MD5) and compare with the checksum

In [None]:
cat hello.txt

In [None]:
cksum hello.txt

In [None]:
md5sum hello.txt

In [None]:
sha1sum hello.txt

### If we alter hello.txt in any way the checksum will be different

In [None]:
cat hello.txt

In [None]:
md5sum hello.txt > hello.md5

In [None]:
cat hello.md5

Now make a small change to `hello.txt`

In [None]:
cat > test1.txt << EOF
One, two buckle my shoe
Three, four lock the door
EOF

In [None]:
cat > hello.txt << EOF
1 Hello, bash
2 Hella, again
3 Hello
4 again
EOF

In [None]:
cat hello.txt

In [None]:
md5sum hello.txt

In [None]:
md5sum -c hello.md5

#### Restore original text

In [None]:
cat > hello.txt << EOF
1 Hello, bash
2 Hello, again
3 Hello
4 again
EOF

In [None]:
md5sum hello.txt > test.md5

In [None]:
md5sum -c hello.md5

### Checksums for multiple files

In [None]:
echo "aaaaa" > a.txt
echo "bbbbb" > b.txt
echo "ccccc" > c.txt

#### Generate md5 checksum file

In [None]:
md5sum a.txt b.txt c.txt > MD5_CHECKSUM

In [None]:
cat MD5_CHECKSUM

##### Modify one file

In [None]:
echo "bbcbb" > b.txt

#### Check file integrity for all files

In [None]:
md5sum -c MD5_CHECKSUM