## Introduction to Unix: Working with Files and Directories

### Questions:
- How can I view and search file contents?
- How can I create, copy and delete files and directories?
- How can I control who has permission to modify a file?
- How can I repeat recently used commands?
### Objectives:
- View, search within, copy, move, and rename files. Create new directories.
- Use wildcards (`*`) to perform operations on multiple files.
- Use the `history` command to view recently used commands.

In [None]:
# Make sure you have the most up to date code
%cd ~/be487-fall-2024
!git stash
!git pull

In [None]:
'''
Set a variable for your netid
Replace "MY_NETID" with your actual netid
Notice that the `cd` command is a magic command in Jupyter Notebooks and starts with %.
Run this cell to set your netid
'''
netid = "MY_NETID"
work_dir = "/xdisk/bhurwitz/bh_class/" + netid
%cd $work_dir

### Section 1: Working with Files

### Wildcards

Now that we know how to navigate around our directory structure, let's
start working with **our sequencing files**. We did a sequencing experiment and have four result files, which are stored in our `untrimmed_fastq` directory. 

Navigate to your `untrimmed_fastq` directory.

In [None]:
'''
Type the commands below, and run the cell
%cd /xdisk/bhurwitz/bh_class/$netid/exercises/data/untrimmed_fastq
'''

#### What is a fastq file? 

We are interested in looking at the FASTQ files in this directory. We can list all files with the `.fastq` extension using the command:

```
$ ls *.fastq
```

In [None]:
'''
Type the command below, and run the cell
!ls *.fastq
'''

#### Wow, that was wild! What is wildcard?

You should see something like this:

```
JC1A_R1.fastq JC1A_R2.fastq JP4D_R1.fastq JP4D_R2.fastq
```

The `*` character is a special type of character called a wildcard, which can be used to represent any number of any type of character. 
Thus, `*.fastq` matches every file that ends with `.fastq`. 

This command: 

```
$ ls *R1.fastq
```

```
JC1A_R1.fastq JP4D_R1.fastq
```

lists only the file that ends with `R1.fastq`.

### Section 2: Command History

If you want to repeat a command that you've run recently, you can access previous commands using the up arrow on your keyboard to go back to the most recent command. Likewise, the down arrow takes you forward in the command history.

A few more useful shortcuts: 

- <kbd>Ctrl</kbd>+<kbd>C</kbd> will cancel the command you are writing, and give you a 
fresh prompt.
- <kbd>Ctrl</kbd>+<kbd>R</kbd> will do a reverse-search through your command history.  This
is very useful.
- <kbd>Ctrl</kbd>+<kbd>L</kbd> or the `clear` command will clear your screen.

You can also review your recent commands with the `history` command, by entering:

```
$ history
```

In [None]:
'''
Type the command below, and run the cell
%history
'''

### Section 3: Examining Files

We now know how to switch directories, and look at the contents of directories, but how do we look at the contents of files?

One way to examine a file is to print out on the screen all of the
contents using the program `cat`.
```
$ cat JC1A_R2.fastq
```

We aren't going to run this here, because Jupyter may freeze on you given that the file contents are too large.

#### I love my cat, but is less more...

`cat` is a terrific program, but, as you just saw if your ran the command, when the file is really big (as the files we have), it can be annoying to use. You can always use Ctrl+C to stop the command.  

The program, `less`, is useful for this case. `less` opens the file as read only, and lets you navigate through it. The navigation commands are identical to the `man` program.

Enter the following command:

```
$ less JC1A_R2.fastq
```

Some navigation commands in `less`

| key     | action |
| ------- | ---------- |
| <kbd>Space</kbd> | to go forward |
|  <kbd>b</kbd>    | to go backward |
|  <kbd>g</kbd>    | to go to the beginning |
|  <kbd>G</kbd>    | to go to the end |
|  <kbd>q</kbd>    | to quit |

`less` also gives you a way of searching through files. Use the
"/" key to begin a search. Enter the word you would like
to search for and press `enter`. The screen will jump to the next location where that word is found. 

**Shortcut:** If you hit "/" then "enter", `less` will  repeat
the previous search. `less` searches from the current location and
works its way forward. Note, if you are at the end of the file and search for the sequence "CAA", `less` will not find it. You either need to go to the beginning of the file (by typing `g`) and search again using `/` or you can use `?` to search backwards in the same way you used `/` previously.

Note that we are not going to use less in the Jupyter notebook, given that it is an interactive command and can only be run on the command line. But, this is useful to know once you move into the real-world and the command line!

#### Head and tail! Another way to explore files is to use the head and tail commands

Using head we can see the top (10 lines of a file by default), or use flags like -n to set the number of lines you want to look at.

In [None]:
'''
Type the command below, and run the cell
!head -1 JC1A_R2.fastq
''' 

In [None]:
'''
Type the command below, and run the cell
!tail -5 JC1A_R2.fastq
''' 

### Section 4: Details on the FASTQ format

Since we are learning while using FASTQ files, let's understand what they are. Although it looks complicated (and it is), it's easy to understand the [fastq](https://en.wikipedia.org/wiki/FASTQ_format) format with a little decoding. Some rules about the format include...

|Line|Description|
|----|-----------|
|1|Always begins with '@' and then information about the read|
|2|The actual DNA sequence|
|3|Always begins with a '+' and sometimes the same info in line 1|
|4|Has a string of characters which represent the quality scores; must have same number of characters as line 2|

We can view the first complete read in one of the files our dataset by using `head` to look at the first four lines.

```
$ head -n 4 JC1A_R2.fastq
```

```
@MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:13417:1998 2:N:0:TCGNAG
CGCGATCAGCAGCGGCCCGGAACCGGTCAGCCGCGCCNTGGGGTTCAGCACCGGCNNGGCGAAGGCCGCGATCGCGGCGGCGGCGATCAGGCAGCGCAGCAGCAGGAGCCACCAGGGCGTGCGGTCGGGCGTCCGTTCGGCGTCCTCGCGCCCCAGCAGCAGGCGCACGCCAGGGAATCCGACCCGCCGCCGGCTCGGCCGCGTCNCCCGCNCCCGCCCCCCGAGCACCCGNAGCCNCNCCACCGCCGCCC
+
1>AAADAAFFF1G11AA0000AAFE/AAE0FBAEGGG#B/>EF/EGHHHHHHG?C##???/FE/ECHCE?C<FGGGGCCCGGGG@?AE.BFFEAB-9@@@FFFFFEEEEFBFF--99A-;@B=@A@@?@@>-@@--/B--@--@@-F----;@--:F---9-AB9=-@-9E-99A-;:BF-9-@@-;@-@#############################################################
```
 
Most of the nucleotides are correct, although we have some unknown bases (N). This is actually a good sequence read!
 
Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality score, the numerical score is converted into a code where each individual character represents the numerical quality score for an individual nucleotide. For example, in the line above, the quality score line is: 

```
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
```

The `#` character and each of the `!` characters represent the encoded quality for an individual nucleotide. The numerical value assigned to each of these characters depends on the sequencing platform that generated the reads. The sequencing machine used to generate our data uses the standard Sanger quality PHRED score encoding, Illumina version 1.8 onwards. Each character is assigned a quality score between 0 and 42 as shown in the chart below.

```
Quality encoding: !"#$%&'()\*+,-./0123456789:;<=>?@ABCDEFGHIJK                   |         |         |         |         |
Quality score:    0........10........20........30........40..                          
```
 
Each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and dependent on how much signal was captured for the base incorporation.  

Looking back at our read: 

```
@MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:13417:1998 2:N:0:TCGNAG
CGCGATCAGCAGCGGCCCGGAACCGGTCAGCCGCGCCNT
+
1>AAADAAFFF1G11AA0000AAFE/AAE0FBAEGGG#B
```

We can now see that the quality of each of the `N`s is 0 and the quality of the only nucleotide call (`C`) is also very poor (`#` = a quality score of 2). This is indeed a very bad read.

### Section 5: Creating, moving, copying, and removing files

Now we can move around in the file structure, look at files, and search files. But what if we want to copy files or move them around or get rid of them? Most of the time, you can do these sorts of file manipulations without the command line,but there will be some cases (like when you're working with a remote computer like we are for this lesson) where it will be impossible. You'll also find that you may be working with hundreds of files and want to do similar manipulations to all of those files. In cases like this, it's much faster to do these operations at the command line.

#### Copying Files

When working with computational data, it's important to keep a safe copy of that data that can't be accidentally overwritten or deleted. For this lesson, our raw data is our FASTQ files.  We don't want to accidentally change the original files, so we'll make a copy of them and change the file permissions so that we can read from, but not write to, the files.

First, let's make a copy of one of our FASTQ files using the `cp` command.

In [None]:
'''
Run the commands below:
%cd /xdisk/bhurwitz/bh_class/$netid/exercises/data/untrimmed_fastq
!cp JC1A_R2.fastq JC1A_R2-copy.fastq
!ls -F
'''

#### So what happened?

We now have two copies of the `JC1A_R2.fastq` file, one of them named `JC1A_R2-copy.fastq`. We'll move this file to a new directory called `backup` where we'll store our backup data files.

#### Creating Directories

The `mkdir` command is used to make a directory. Enter `mkdir`
followed by a space, then the directory name you want to create.

In [None]:
'''
Run the command below:
!mkdir backup
'''

#### Moving / Renaming 

We can now move our backup file to this backup directory. We can
move files around using the command `mv`. 

In [None]:
'''
Run the commands below:
!mv JC1A_R2-copy.fastq backup
!ls backup
'''

#### What just happened?

You should see something like this:

```
JC1A_R2-copy.fastq
```

The `mv` command moved your JC1A_R2-copy.fastq file into the backup directory.

The `mv` command is also how you rename files. Let's rename this file to make it clear that this is a backup.

In [None]:
'''
Run the commands below:
%cd backup
!mv JC1A_R2-copy.fastq JC1A_R2-backup.fastq
!ls
'''

#### What am I left with?

You should see this:

```
JC1A_R2-backup.fastq
```

We renamed the file in the backup directory!

#### Removing files and directories

When we want to remove a file or a directory we use the `rm` command. By default, `rm` will not delete directories. You can tell `rm` to delete a directory using the `-r` (recursive) option. Or, you can `rmdir` if the directory is empty. 

Let's delete the backup directory we just made. 

In [None]:
'''
Run the commands below:
%cd ..
!rm -r backup
'''

#### Proceed with Caution!

**Important**: The `rm` command permanently removes the file. Be careful with this command. It doesn't just nicely put the files in the Trash. They're really gone. The command above will delete not only the directory, but all files within the directory. 

If you have write-protected files in the directory, you will be asked whether you want to override your permission settings. If we want to modify a file without all the permissions you'll be asked if you want to override your file permissions. For example:

```
rm: remove write-protected regular file ‘example.fastq’? 
```

If you enter `n` (for no), the file will not be deleted. If you enter `y`, you will delete the file. This gives us an extra measure of security, as there is one more step between us and deleting our data files.

#### Exercise 1: Make a backup folder with write-protected permissions

Starting in the `/xdisk/bhurwitz/bh_class/your_netid/exercises/data/untrimmed_fastq` directory, do the following:
1. Go into the /xdisk/bhurwitz/bh_class/$netid/exercises/data/untrimmed_fastq directory
1. Create a copy of each of your FASTQ files.   
2. Use a wildcard to move all of your backup files to a new backup directory.   
3. Change the permissions on all of your backup files to be write-protected.  

In [None]:
'''
Write your commands below:
'''

### Summary

Unix is a great way to quickly work with and modify files programmatically.

### Keypoints:
- You can view file contents using `less`, `cat`, `head` or `tail`.
- The commands `cp`, `mv`, and `mkdir` are useful for manipulating existing files and creating new directories.
- You can view file permissions using `ls -l`.
- The `history` command and the up arrow on your keyboard can be used to repeat recently used commands.


-----

In [None]:
# The End
!cp ~/be487-fall-2024/exercises/02_intro_unix/ex02-02_working_with_files.ipynb  /xdisk/bhurwitz/bh_class/$netid/exercises/02_intro_unix/ex02-02_working_with_files.ipynb