# Intro To Unix Basics

### Why learn Unix?

"Unix" is a term for a family of operating systems (just like "Windows"). The Mac operating systems (OSX) are part of the Unix family. You will also hear the term "Linux" a lot - linux is part of the Unix family. Unix operating systems are very popular for running servers.

While you may be used to interacting with your laptop using a mouse and graphical interface, Unix servers like Stanford's Sherlock cluster do not work that way (graphical interfaces are a LOT of work to build and are less flexible). Instead, we need to interact with these servers through the command line or terminal. These servers can store enormous files (such as files containing millions of ATAC-seq reads), and they have many CPUs and even GPUs too, which allow us to perform powerful bioinformatic analyses that our laptops just aren't capable of.

Don't worry, it's easy once you get the hang of it, and it looks really cool to people who don't know what you are doing!

![title](images/slide1.png)

![title](images/slide2.png)

### Exercises

1. Log into the VM set up for training camp and open a terminal from the Jupyter homepage (click the "New" drop-down menu on the right and then select Terminal). Let's explore your home directory and beyond and get familiar with paths.

      When you first open a terminal, by default you will be in your home directory. The command prompt (the bottom line in the terminal, where you can type commands into) will show you where you are, via the `~` just before the `$` prompt. `~` is short-hand for "your home directory." Let's find out what the absolute path of your home directory is. The command for retrieving the absolute path of your current directory is `pwd`. Type this command into the command prompt line and then hit enter to run:

In [None]:
pwd

You should see the terminal print out one line of output, and then display a fresh command prompt line under that. This line of output is the absolute path of your home directory -- it should start with `/`, which is how we write "root directory" or the top-most level of a computer's filesystem, followed by `homes/` and then your username. We can interpret this as: "your current location is the directory `[your username]`, which is inside the directory `homes`, which is just under the root."

2. Now let's look at what's inside your home directory. We can do this using the command `ls` (short for "list", or "list out the contents of a directory"). The `ls` command by default expects you to also provide a path to the directory you want to list the contents of, but running `ls` with no additional input is short-hand for "list out the contents of my current directory". Go ahead and type `ls` into the terminal and hit enter to run:

In [None]:
ls

You should see that the output of running `ls` is a list of files that match the files seen on your Jupyter homepage, including several Jupyter notebooks (files ending in `.ipynb`), a text (`.txt`) file and a couple JSON-formatted files (ending in `.json`), plus a directory called `images`.

3. This `ls` output is nice, but what if we want more information, such as when the file was last edited, who owns the file, or how big it is? To retrieve this info, we can use the same `ls` command, but this time we'll add the flag `-l`.

    Flags are special inputs to Unix commands that modify the way the command behaves in some pre-determined way. Flags always start with `-` and they're command-specific. Meaning, every Unix command may or may not accept flags, and which flags it accepts, as well as what the effect of the flags are, will be different for different commands. But don't worry -- most commands don't need flags for normal usage, and for others, there will only be one or two flags that are good to know.

    The `ls` command can take the flag `-l`, which will cause it to output additional input about each file it lists. The way you add a flag to a command is by typing `[command] [flag]`, where the command name and flag are separated by a space. Let's try it out:

In [None]:
ls -l

You should see the same files listed as when you ran `ls` by itself, but with additional columns of information. The first column displays permissions -- who can read ("r"), write ("w"), and execute ("x") this file -- but don't worry about that for now. The third column is the file owner's username; the fifth is the size of the file in bytes (e.g. a file with 1000 bytes = 1 KB, 1000000 bytes = 1 MB, and 10^9 bytes = 1 GB); and the remaining columns show the date and time of last edit.

Another flag `ls` accepts is `-t`, which tells `ls` to sort its output by edit time, so the most recent file is on top. Unix commands with multiple flags like `ls` can take those flags simultaneously, by typing them one after the other, separated by spaces. Run the command below and confirm that the files are listed in sorted order.

In [None]:
ls -l -t

The info printed due to the `-l` flag helps us to see that indeed, the files have been sorted, but we don't need it for the `-t` flag to work. Run `ls -t` by itself and check that even though the extra info columns aren't output, the files are still listed in sorted order.

In [None]:
ls -t

4. Now that we have checked out the home directory, let's investigate `images`. If you add as input to `ls` a path to a directory, it will print the contents of that directory, instead of defaulting to your current directory. Inputs are added to commands the same way flags are: after the command name, separated by a space. Try it out by running:

In [None]:
ls images

The output should show a bunch of files ending in `.png`, a common image file extension. Note that here, `images` is a very short example of a relative path. Since we are in your home directory, and `images` is also inside your home directory, the relative path from where we are to the `images` folder is simply `images`. We could also use the full path that includes `~`, the short-hand for your home directory, or we could use the absolute path (from root):

In [None]:
ls ~/images

In [None]:
ls /homes/[insert your username here!]/images

Now, check that you can combine one or more flags and directory path inputs by running each of these commands:

In [None]:
ls -l images

In [None]:
ls -l -t images

In [None]:
ls -t -l images

In [None]:
ls images -l

Note that you get the same output regardless of the order of the flags and the input. This will not necessarily be true for all commands and flags, but in this case it is.

5. So far, we haven't moved out of our home directory. To change directories, we can use the command `cd`. `cd` expects one input: the path to the directory you want to move to. Let's move to the `images` directory:

In [None]:
cd images

After running this command, you will notice that on the command prompt where `~` used to be, you now see `~/images`. The command prompt is reflecting that your location has changed. Another way you can confirm that you moved directories successfully is by running:

In [None]:
pwd

Now that we have moved, `pwd` no longer prints the absolute path of the home directory (where we were). Now the output is the absolute path of `images`.

Since we are now inside `images`, to list the contents of this directory, we can simply run `ls` without any input path:

In [None]:
ls

Confirm that this outputs the same list of `.png` files that you saw earlier when you ran `ls images` from the home directory.

To get back to our home directory, we have a couple options for the path we can input to `cd`. We could use `~` or the absolute path to the home directory (`/homes/[your username]`). Another option is to use `..`, which means "up one directory" or "parent directory". Try all three options (you'll need to go back to images between each command!):

In [None]:
cd ~

In [None]:
cd /homes/[your username]

In [None]:
cd ..

![title](images/slide3.png)

### Exercises

Let's practice creating, moving, and deleting directories. First, make sure you're in your home directory:

In [None]:
cd ~

Let's create a new directory using the `mkdir` command. `mkdir` expects one input: the path of the directory you want to create. Let's call the new directory `practice`. Run `mkdir practice` below and then use `ls` to confirm that this new directory shows up in the `ls` output:

In [None]:
mkdir practice

In [None]:
ls

We can double-check that this new directory is empty (this should output nothing):

In [None]:
ls practice

Let's say we don't like what we named this directory and want to "rename" it by moving it. We can use `mv`, a command that expects two inputs: first, an existing directory's path, and second, a path to where you want that directory to be moved.

In [None]:
mv practice new_dir

The directory formerly known as `practice` is now `new_dir`. Confirm that the change happened:

In [None]:
ls

You'll notice that `practice` no longer exists. What if we wanted to copy a directory to a new location, but also keep the original directory? For that we can use `cp`, which expects the same inputs as `mv`. There is one key difference though: while `mv` and `cp` use will look identical for files, when we're moving around directories, `cp` requires the flag `-r`. Try the two versions of the command below -- the first one should give you an error complaining that the `-r` flag is missing.  

In [None]:
cp new_dir new_dir_copy

In [None]:
cp -r new_dir new_dir_copy

Confirm that you now can see both the original and the copied directory:

In [None]:
ls

In practice, bioinformaticians often use `mv` more than `cp`, because our files can be pretty huge, and having unnecessary copies of those files will waste precious memory space on a computer. One exception is when you are processing files and you want to leave a copy of the original files somewhere safe, just in case something goes wrong! 

Let's imagine we don't need the directory `new_dir_copy` any more. We can delete directories using the command `rm`. `rm` expects one input: the path to the thing being deleted. And as with `cp`, when working with directories, `rm` also needs the flag `-r`:

In [None]:
rm -r new_dir_copy

In [None]:
ls

NOTE: `rm` is permanent!!! There is no undo. Be cautious whenever you are typing `rm` to make sure you don't delete something you need!

To finish, let's clean up:

In [None]:
rm -r new_dir

In [None]:
ls

### Additional Exercises To Experiment With

1. What would be the output of each of the `ls` commands below? Think about it and then run each command to check your guesses. Note that the `-p` flag for `mkdir` allows creation of nested directories.

In [None]:
mkdir -p exercise/a_dir/a_dir/a_dir/a_dir
cd exercise
cd a_dir/a_dir
touch a_dir/../../a_dir/a_dir/../hi.txt
cd ../..
echo "ls a_dir"
ls a_dir
echo "ls a_dir/a_dir"
ls a_dir/a_dir
echo "ls a_dir/a_dir/a_dir"
ls a_dir/a_dir/a_dir
echo "ls a_dir/a_dir/a_dir/a_dir"
ls a_dir/a_dir/a_dir/a_dir

cd ..
rm -r exercise

2.  What is the absolute path of `hi.txt` in the example below?

In [None]:
mkdir -p exercise/a_dir/a_dir/
echo "blah" > exercise/a_dir/a_dir/hi.txt
cat exercise/a_dir/a_dir/hi.txt

Check if you're right by running the command below (replacing with your guess), which will throw an error if your absolute path is incorrect.

In [None]:
cat /replace/with/absolute/path/to/hi.txt

In [None]:
rm -r exercise

3. What happens if you try... (Note that 2 of 3 will throw an error)

In [None]:
cd directory_that_doesnt_exist

In [None]:
ls directory_that_doesnt_exist

In [None]:
cd

4. What happens if you try... (Note that 2 of 5 will throw an error, and the duplicate command is meant to be run twice)

In [None]:
mkdir test_dir

mkdir test_dir

In [None]:
mkdir -p test_dir

In [None]:
mkdir dir1/dir2/dir3/dir4

In [6]:
mkdir -p dir1/dir2/dir3/dir4

In [None]:
rm -r test_dir

In [7]:
rm -r dir1/dir2/dir3/dir4

6. What happens when you try...

In [None]:
mkdir test_dir

mkdir test_dir2

mv test_dir test_dir2

In [None]:
mkdir other_test_dir

In [None]:
cp other_test_dir test_dir2

(Hint: the destination directory is not overwritten.) Can you imagine how this behavior could cause problems if you weren't aware the destination directory already existed?

![title](images/slide4.png)

### Exercises

Let's first create a directory to play around in and move into that directory:

In [None]:
mkdir example_files
cd example_files

If we run `ls` in this directory, we'll see that it is empty. Let's fill it up with some files! To start, let's create a couple of empty files using the command `touch`. `touch` expects one input: the path of a file you want to create. In our case, we can simply type the name we want the file to have; this input will be interpreted as a relative path, so the file will simply be created inside the current directory with that name.

In [None]:
touch file_a.txt

In [None]:
touch file_b.txt

In [None]:
touch file_c.txt

In [None]:
ls

The output of `ls` shows that we have successfully created 3 files. To check what is written in any of these files, we can use `cat`. `cat` takes as input the name of one (or more) files, and it prints out all the contents of the input file(s). We should see that `cat` doesn't output anything at the moment for any of our files because they're still empty:

In [None]:
cat file_a.txt
cat file_b.txt
cat file_c.txt

Let's add a few lines to each file using `echo` and double-arrows `>>`. The command `echo` does just that -- it prints out whatever you type as input. The arrows will then append the output of echo to a file. First, let's test echo out:

In [None]:
echo "Helloooooo"

In [None]:
echo "Repeat after me..."

Now let's use echo to add a line of text to `file_a.txt`:

In [None]:
echo "This is line 1." >> file_a.txt

In [None]:
cat file_a.txt

When you ran the line of code that "piped" the output of `echo` into the file, you might have noticed that nothing printed out. That's because the output of `echo` was "re-directed" so that it no longer went to your terminal, and instead it ended up inside the file. The output of `cat` should show that indeed, `"This is line 1."` is written inside the file.

Let's add some more lines!

In [None]:
echo "This is line 2." >> file_a.txt
echo "This is line 3." >> file_a.txt
echo "This is line 4." >> file_a.txt
echo "This is line 5." >> file_a.txt

In [None]:
cat file_a.txt

`echo` is not the only command that you can send the output of to a file. You can do this with every Unix command! Let's try it with the `cat` command we just ran to add the same lines to another file.

In [None]:
cat file_a.txt >> file_b.txt

In [None]:
cat file_a.txt
cat file_b.txt

Now `file_a.txt` and `file_b.txt` look identical. One additional feature of `cat` is that if you give multiple files as input, `cat` will print out the contents of each of them, 1 by 1. Let's try it out, and then send the input of that command to our third file:

In [None]:
cat file_a.txt file_b.txt

In [None]:
cat file_a.txt file_b.txt >> file_c.txt

In [None]:
cat file_c.txt

You can imagine that for any unix command you run, it can be helpful to save the output to a file. This can be better than letting the command print its output directly to the terminal for you to read, because when you close a terminal, that output will disappear. Thus, it is common practice in bioinformatic analysis to run commands and save the result of each command to a new file.

Another common occurrence in bioinformatic analysis is having really, really large files. Imagine you perform an ATAC-seq experiment and generate 100 million sequencing reads. Even if you could put all the information you ever need about each read into one line per read, a file containing all that informations for all the reads would have 100 million lines! You and your computer do not want to print out 100 million lines of text for manual reading. So, here are a couple commands that prevent you from needing to use `cat` on large files.

First, one thing that can be helpful to check for large files is the number of lines in the file. We can use the command `wc` for this. By default, `wc` will print out the number of lines, words (characters separated by space), and characters in the file, but in practice we're usually mostly interested in the line count. We use the flag `-l` to tell `wc` to only print out the number of lines. (Note that here the flag `-l` means something different than it does for `ls`!)

Based on what we just did, we expect `file_a.txt` and `file_b.txt` to have 5 lines each, and `file_c.txt` to have 10 lines. Let's confirm this with `wc -l`:

In [None]:
wc -l file_a.txt

In [None]:
wc -l file_b.txt

In [None]:
wc -l file_c.txt

It would also be great if we could peek at just a small part of a realy large file. For example, may we aren't sure what format the data is in, and being able to look at the first couple lines of the file would tell us that. We can use the commands `head` or `tail` to do this. Both `head` and `tail` expect you to type one file name/path as input. By default, `head` will print the top 10 lines of the input file, while `tail` will print the bottom 10 lines. But you can change the number of lines printed using a flag: just add `-5` to print 5 lines, or `-3` to print 3 lines, etc. Let's try this on one of our test files:

In [None]:
head -3 file_a.txt

In [None]:
tail -2 file_a.txt

Finally, the commands `mv`, `cp`, and `rm` work the same with files as with directories -- for `cp` and `rm`, though, you no longer need the `-r` flag. Let's "rename" each file by moving them to a new name:

In [None]:
mv file_a.txt file_1.txt

In [None]:
mv file_b.txt file_2.txt

In [None]:
mv file_c.txt file_3.txt

In [None]:
ls

The output of `ls` should show you the new names for each of the files. Now we'll try out `cp`:

In [None]:
cp file_3.txt copy_of_file_3.txt

In [None]:
cat copy_of_file_3.txt

In [None]:
ls

And finally, let's cleanup all the files we've created.

In [None]:
rm file_1.txt

In [None]:
rm file_2.txt

In [None]:
rm file_3.txt

In [None]:
rm file_4.txt

![title](images/slide5.png)

### Additional Exercises To Experiment With

Make sure that you're still in the directory we've been using so far:

In [None]:
cd ~/example_files

We're going to use a file for the next few exercises that you'll create yourself later on in training camp. This file is a bed file, meaning it has the "BED" format -- a common format for bioinformatics files. It is still a text file, but the "bed file" designation means that we expect there to be columns in this file. The first column will be the name of a chromosome and the second and third columns will contain integer coordinates, such as the beginning and end coordinates of an ATAC-seq peak, a gene, or really any biologically meaningful region.

The location of this file is `/outputs/all_merged.peaks.bed` (this is an absolute path). To copy it over to where you are in the `example_files` directory, use `cp`:

In [None]:
cp /outputs/all_merged.peaks.bed all_merged.peaks.bed

This file describes the locations of a bunch of ATAC-seq peaks. Each line represents one peak. This file is too big for a human to read through all the tines -- use `head` and `wc -l` to peek at the contents of the file and check how long it is (a.k.a. the number of peaks):

In [None]:
head all_merged.peaks.bed

In [None]:
wc -l all_merged.peaks.bed

Let's find out how many peaks are on chromosome 16. In this file, chromosome 16 is written as "chrXVI". We can use `grep` to get only the lines in this file that have "chrXVI" in them. The way `grep` works is by scanning each line of the file, one at a time, for the input "pattern" (the first input given). If it finds the pattern, it prints the line; otherwise, it doesn't print the line. If we use "chrXVI" as the `grep` pattern, and input our file of ATAC-seq peaks, the output should be only the lines that contained chrXVI -- only the peaks on chromosome 16.
 
We'll send the output of grep to a file because it will be fairly long:

In [None]:
grep "chrXVI" all_merged.peaks.bed > chr16.peaks.bed

Let's double-check that we got the data we wanted to by peeking at the results with `head` and `tail`, and then counting the number of lines in the file to find the number of peaks on chromosome 16.

In [None]:
head chr16.peaks.bed

In [None]:
tail chr16.peaks.bed

In [None]:
wc -l chr16.peaks.bed

Another way to do this is using the `grep` flag `-c`, which asks grep to simply output the number of lines that had a match to the input pattern.

In [None]:
grep -c "chrXVI" all_merged.peaks.bed

Confirm that this gives you the same result as the previous command.

What would have happened if you used an incomplete chromosome name to `grep`, such as "chrXV", that could also match other chromosome names? How would you know whether or not this has happened?

Next, let's try out using `sed`, which is a very powerful command that (in its most straightforward usage) looks for patterns in each line similar to grep, but instead of printing out lines when it finds matches, it replaces each pattern match with whatever you tell it to use as replacement.

As an example, say we want to replace the "chrXVI" at the beginning of each line in our file of chromosome 16 peaks with "chr16". We can do that with `sed` like so:

In [None]:
sed 's/chrXVI/chr16/' chr16.peaks.bed > chr16.fixedname.peaks.bed

In [None]:
head chr16.fixedname.peaks.bed

Finally, let's try out using `awk` on this file. `awk` is great for any file format that has columns, so it works great with bed files! At its simplest, `awk` can be used to isolate a single column from a file. Try it with a few columns in our bed file:

In [None]:
awk '{ print $1 }' chr16.fixedname.peaks.bed > awk_output1.txt

In [None]:
head awk_output1.txt

In [None]:
awk '{ print $2 }' chr16.fixedname.peaks.bed > awk_output2.txt
head awk_output2.txt

All you need to do is change the number after the `$` to the number of the column you want to print out.

`awk` can also do column math! In our peaks file, the width of each peak is equal to the coordinate in column 3 minus the coordinate in column 2. We can calculate this for each line:

In [None]:
awk '{ print $3 - $2 }' chr16.fixedname.peaks.bed > awk_output3.txt

In [None]:
head awk_output3.txt

`awk` in particular is really powerful and has so many uses beyond printing columns. If you have time, read through this notebook full of examples: https://colab.research.google.com/drive/1VOC7CVLWNvj59VAlpazbQI1dKM_YUXAN?usp=sharing