# Writing Scripts and Working with Data

### Questions:
- What is a for loop and how can iterate through many files?
- What is basename, and why is it useful?
- How can we automate a commonly used set of commands?
- How can we transfer files between local and remote computers?

### Objectives:
- Write a for loop to iterate through datasets.
- Use basename to quickly get the name of the file without an extension.
- Write a basic shell script.
- Use the `bash` command to execute a shell script.
- Use `chmod` to make a script an executable program.

### Keypoints:
- Loops are great for automating tasks and iterating through many files.
- basename is a great way to create new files with different extensions after processing them.
- Scripts are a collection of commands executed together.
- Scripts are executable text files.
- In a terminal, `scp` transfers information to and from virtual and local computers


### Getting Started

In [None]:
'''
Set a variable for your netid
Replace "MY_NETID" with your actual netid
'''
netid = "MY_NETID"
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/exercises/03_bash_scripting"
%cd $work_dir

### Section 1: Writing "for loops"

Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes). 

Loops are helpful when performing operations on groups of sequencing files, such as unzipping or trimming multiple files. We will use loops for these purposes in subsequent analyses, but will cover the basics of them for now.

When the shell sees the keyword `for`, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the **variable**, and the commands inside the loop are executed, before moving on to  the next item in the list. Inside the loop, we call for the variable's value by putting `$` in front of it. The `$` tells the shell interpreter to treat the **variable** as a variable name and substitute its value in its place, rather than treat it as text or an external command. In shell programming, this is usually called "expanding" the variable.

Let's write a for loop to show us the first two lines of the fastq files in the untrimmed_fastq directory. A semicolon, `;`, can be used to separate two commands written on a single line.

The for loop begins with the formula `for <variable> in <group to iterate over>`. In this case, the word `filename` is designated as the variable to be used over each iteration. In our case `JC1A_R1.fastq` and `JC1A_R2.fastq` will be substituted for `filename` because they fit the pattern of ending with .fastq in directory we've specified. The next line of the for loop is `do`. The next line is 
the code that we want to execute. We are telling the loop to print the first two lines of each variable we iterate over and save the information to a file. Finally, the word `done` ends the loop.

Note that we are using `>>` to append the text to our `seq_info.txt` file. If we used `>`, the `seq_info.txt` file would be rewritten every time the loop iterates, so it would only have text from the last variable used. Instead, `>>` adds to the end of the file.

In [None]:
'''
Type the commands below, and run the cell
%cd /xdisk/bhurwitz/bh_class/$netid/exercises/data/untrimmed_fastq
'''

In [None]:
'''
Type the command below, and run the cell
!for filename in *.fastq; do head -n 2 $filename >> seq_info.txt; done
'''

#### Let's check out the resulting file we created

To see the content of the little file we just made it is useful to use the `cat` command.

In [None]:
'''
Type the command below, and run the cell
!cat seq_info.txt
'''

#### You should see something like this...

```
@MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:13417:1998 1:N:0:TCGNAG
CTACGGCGCCATCGGCGNCCCCGGACGGTAGGAGACGGCGATGCTGGCCCTCGGCGCGGTCGCGTTCCTGAACCCCTGGCTGCTGGCGGCGCTCGCGGCGCTGCCGGTGCTCTGGGTGCTGCTGCGGGCGACGCCGCCGAGCCCGCGGCGGGTCGGATTCCCCGGCGTGCGCCCCCCGCTCCGGCTCGAGGACGCCGCACCGACGCCCCACCCCCCCCCCCGGTGGCTCCTCCTGCCGCCCTGCCTGATCC
@MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:13417:1998 2:N:0:TCGNAG
CGCGATCAGCAGCGGCCCGGAACCGGTCAGCCGCGCCNTGGGGTTCAGCACCGGCNNGGCGAAGGCCGCGATCGCGGCGGCGGCGATCAGGCAGCGCAGCAGCAGGAGCCACCAGGGCGTGCGGTCGGGCGTCCGTTCGGCGTCCTCGCGCCCCAGCAGCAGGCGCACGCCAGGGAATCCGACCCGCCGCCGGCTCGGCCGCGTCNCCCGCNCCCGCCCCCCGAGCACCCGNAGCCNCNCCACCGCCGCCC
@MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:12622:2006 1:N:0:CTCAGA
CCCGTTCCTCGGGCGTGCAGTCGGGCTTGCGGTCTGCCATGTCGTGTTCGGCGTCGGTGGTGCCGATCAGGGTGAAATCCGTCTCGTAGGGGATCGCGAAGATGATCCGCCCGTCCGTGCCCTGAAAGAAATAGCACTTGTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTCAGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAGCAAACCTCTCACTCCCTCTACTCTACTCCCTT
@MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:12622:2006 2:N:0:CTCAGA
GACAAGTGCTATTTCTTTCAGGGCACGGACGGGCGGATCATCTTCGCGATCCCCTACGAGACGGATTTCACCCTGATCGGCACCACCGACGCCGAACACGACATGGCAGACCGCAAGCCCGACTGCACGCCCGAGGAACGGGAGATCGGAAGAGCGTCGTGTAGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAGCGATCAACTCGACCGACCTGTCTTATTATATCTCACGTAA
```

### Section 2: Using Basename

Basename is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the `.fastq` extension from the files that we’ve been working with. 

In [None]:
'''
Type the command below, and run the cell
!basename JC1A_R2.fastq .fastq
'''

We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.

```
JC1A_R2
```

If we try the same thing but use `.fasta` as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file.

```
$ basename JC1A_R2.fastq .fasta
```

```
JC1A_R2.fastq
```

Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let's try this.

Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case `${filename}`, and finally state that `.fastq` should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line "> echo $name" will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.

In [None]:
'''
Type the command below, and run the cell
!for filename in *.fastq; do name=$(basename $filename .fastq); echo ${name}; done
'''

#### You should get something like this...

```
JC1A_R1
JC1A_R2
JP4D_R1
JP4D_R2
```

#### Exercise 3: Using `basename`

Print the file prefix of all of the `.txt` files in our current directory.

In [None]:
'''
Type the commands below, and run the cell
'''

#### Solution

```
!for filename in *.txt; do name=$(basename $filename .txt); echo ${name}; done
```

#### What else can we do?

One way this is really useful is to move files. Let's rename all of our .txt files using `mv` so that they have the years on them, which will document when we created them. 

In [None]:
'''
Type the commands below, and run the cell
!for filename in *.txt; do name=$(basename $filename .txt); mv ${filename} ${name}_2024.txt; done
!ls -l
'''

### Section 3: Writing Scripts

A really powerful thing about the command line is that you can write scripts. Scripts let you save commands to run them and also lets you put multiple commands together. Though writing scripts may require an additional time investment initially, this can save you time as you run them repeatedly. Scripts can also address the challenge of reproducibility: if you need to repeat analysis, you retain a record of your command history within the script.

One thing we will commonly want to do with sequencing results is pull out bad reads and write them to a file to see if we can figure out what is going on with them. We are going to look for reads with long sequences of N's like we did before, but now we are going to write a script, so we can run it each time we get new sequences rather than type the code in by hand each time.

Bad reads have a lot of N's, so we are going to look for  `NNNNNNNNNN` with `grep`. We want the whole FASTQ record, so we are also going to get the one line above the sequence and the two lines below. We also want to look at all the files that end with `.fastq`, so we will use the `*` wildcard.

In [None]:
'''
Type the command below, and run the cell
!grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt
'''

#### How can we create a script?

If we were on the HPC and we were using the shell, we would use a text editor like `nano` or `vim` to create and edit a file. But, because we are inside a Jupyter Notebook, we are going to Python, a programming language to help write the script for us.We will call it `bad-reads-script.sh`. The `sh` is not required, but using that extension tells us it is a shell script.

In [None]:
my_code = '''grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads_from_script.txt'''

with open('bad-reads-script.sh', mode='w') as file:
    file.write(my_code)

#### We can run this script!

Now comes the neat part. We can run this script. Type:

In [None]:
'''
Type the command below, and run the cell
!bash bad-reads-script.sh
'''

#### What happened?

It will look like nothing happened, but now if you look at `scripted_bad_reads.txt`, you can see that there are now reads in the file.

In [None]:
'''
Type the command below, and run the cell
!ls -l bad_reads*
'''

#### Exercise 2: Edit a script

We want the script to tell us when it is done. Try using the `echo` command to let the user know the script is done.  

```
1. Create a new script called `bad-reads-script.sh` and add the line `echo "Script finished!"` after the `grep` command.
2. Run the updated script.
```


In [None]:
my_code = '''grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads_from_script.txt
# add a unix command here to say "Script finished!"
'''

with open('bad-reads-script.sh', mode='w') as file:
    file.write(my_code)

In [None]:
'''
Write a command below to run the bad-reads-script.sh script, and run the cell
'''

### Section 4: Making the script into a program

We had to type `bash` because we needed to tell the computer what program to use to run this script. Instead, we can turn this script into its own program. We need to tell it that it is a program by making it executable. We can do this by changing the file permissions. We talked about permissions in an earlier class.

First, let us look at the current permissions.

In [None]:
'''
Type the command below, and run the cell
!ls -l bad-reads-script.sh
'''

#### what do you see?

```
-rw-rw-r-- 1 user user 0 Aug 25 21:46 bad-reads-script.sh
```

We see that it says `-rw-r--r--`. This combination shows that the file can be read by any user and written to by the file owner (you). We want to change these permissions so the file can be executed as a program. We use the command `chmod` as we did earlier when we removed write permissions. Here we are adding (`+`) executable permissions (`+x`).

In [None]:
'''
Let's change the file permissions to executable
Type the command below, and run the cell
!chmod +x bad-reads-script.sh
'''


In [None]:
'''
What do you see now?
Type the command below, and run the cell
!ls -l bad-reads-script.sh
'''

#### You should get something like this...

```
-rwxrwxr-x 1 user user 0 Aug 25 21:46 bad-reads-script.sh
```

Now we see that it says `-rwxr-xr-x`. The `x`'s there now tell us we can run it as a program. So, let us try it! We will need to put `./` at the beginning, so the computer knows to look here in this directory for the program.

In [None]:
'''
Type the command below, and run the cell
!./bad-reads-script.sh
'''

#### what next?

The script should run the same way as before, but now we have created our own computer program!

You can also add in a line to the script to let it know you are using bash shell to run the commands...

In [None]:
my_code = '''#!/bin/bash
grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads_from_script.txt
# add a unix command here to say "Script finished!"
'''

with open('bad-reads-script.sh', mode='w') as file:
    file.write(my_code)

In [None]:
'''
Type the command below, and run the cell
!./bad-reads-script.sh
'''

#### Section 4: Compressing large files with gzip

It is good practice to keep any large files compressed while not using them. In this way, you save storage space; you will see that you will appreciate it when you advance your analysis. So, since we will not use the FASTQ files for now, let us compress them. Moreover, run `ls -lh` to confirm that they are compressed. 

Warning, it will take a few minutes to run the command below.

In [None]:
'''
Type the command below, and run the cell
%cd /xdisk/bhurwitz/bh_class/$netid/exercises/data/untrimmed_fastq
!gzip *.fastq
!ls -lh  *.fastq.gz
'''

#### What do your zipped files look like?

```
total 428M
-rw-r--r-- 1 user user  24M Aug 26 12:36 JC1A_R1.fastq.gz
-rw-r--r-- 1 user user  24M Aug 26 12:37 JC1A_R2.fastq.gz
-rw-r--r-- 1 user user 179M Aug 26 12:44 JP4D_R1.fastq.gz
-rw-r--r-- 1 user user 203M NAug 26 12:51 JP4D_R2.fastq.gz
```

### Section 6: Moving and downloading data

So far, we have worked with pre-loaded data on the hpc. Usually, however, most analyses begin with moving data onto the HPC. Below we will show you some commands to download data onto the hpc or to move data between your computer and the HPC.

#### Getting data to/from the HPC

Two programs will download data from a remote server to your local
(or remote) machine: ``wget`` and ``curl``. They were designed to do slightly different tasks by default, so you will need to give the programs somewhat different options to get the same behavior, but they are mostly interchangeable.

 - ``wget`` is short for "world wide web get", and its basic function is to *download*

 - ``cURL`` is a pun. It is supposed to be read as "see URL", so its primary function is
 to *display* webpages or data at a web address.

Which one you need to use mainly depends on your operating system, as most computers will
only have one or the other installed by default.

Let us say you want to download some data from Ensembl. We will download a tiny
tab-delimited file that tells us what data is available on the Ensembl bacteria server.
Before starting our download, we need to know whether we are using ``curl`` or ``wget``.

To see which program you have, type:

In [None]:
'''
Type the command below, and run the cell
!which curl
!which wget
'''

#### Which which?

``which`` is a BASH program that looks through everything you have
installed and tells you what folder it is installed to. If it cannot
find the program you asked for, it returns nothing, i.e., it gives you no
results.

On the hpc, you will likely get the following output:

```
$ which curl
```

```
/usr/bin/curl
```


```
$ which wget
```

```
/usr/bin/wget
```

This output means that you have both ``curl`` and ``wget`` installed.

Now we can use one of the following commands to download the file to your home directory:

In [None]:
'''
Type the command below, and run the cell
%cd
!wget ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt

or

%cd
!curl -O ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt

'''

#### What just happened?

Since we wanted to *download* the file rather than view it, we used ``wget`` without any modifiers. With ``curl`` however, we had to use the -O flag, which simultaneously tells ``curl`` to download the page instead of showing it to us **and** specifies that it should save the file using the same name it had on the server: species_EnsemblBacteria.txt

It's important to note that both ``curl`` and ``wget`` download to the computer that the command line belongs to. So, if you are logged into HPC on the command line and execute the ``curl`` command above in the terminal, the file will be downloaded to the HPC, not your local one. Also, note that we used `cd` to change into your home directory.

#### Moving files between your laptop and the HPC

What if the data you need is on your local computer, but you need to get it *onto* the HPC? Or, what if you need to download files from the HPC to your laptop. Here is how you can do it!
       
#### Upload/Download small files between the HPC and your laptop.  
    
If your files are small, you can use the online [HPC portal](https://ood.hpc.arizona.edu/) and select the file browser from the top menu to get easy access for transferring files to/from your /home, /xdisk, and /groups directories. You can also view, edit, copy, and rename your files. 
    
#### Transferring larger data files between your local machine and the HPC with `scp`  
    
`scp` stands for 'secure copy protocol' and is a widely used UNIX tool for moving files between computers. The simplest way to use `scp` is to run it in your local terminal and use it to copy a single file. 

You will need to use an SSH v2 compliant terminal to move files to/from HPC. For more information on using SCP, use man scp.

Moving a File or Directory to the HPC:

<details>
  <summary markdown="span">Mac OS</summary>
  <ul>
In your terminal, navigate to the desired working directory on your local machine (laptop or desktop usually). To move a file or directory to a designated subdirectory in your account on HPC:

```
$ scp -rp filenameordirectory NetId@filexfer.hpc.arizona.edu:subdirectory
Getting a File or Directory From the HPC:
```

In your terminal, navigate to the desired working directory on your local machine. The copy a remote file from HPC to your current directory:

```
$ scp -rp NetId@filexfer.hpc.arizona.edu:filenameordirectory .
** the space folllowed by a period at the end means the destination is the current directory** 
```

</details>

<details>
  <summary markdown="span">PC</summary>
  <ul>
Windows users can use software like WinSCP to make SCP transfers. To use WinSCP, first download/install the software from: https://winscp.net/eng/download.php

To connect, enter filexfer.hpc.arizona.edu in the Host Name field, enter your NetID under User name, and enter your password. Accept by clicking Login. You'll be prompted to Duo Authenticate.

</details>

Wildcards can be used for multiple file transfers (e.g. all files with .dat extension). Note the backslash " \ " preceeding *

$ scp YOUR_NETID@filexfer.hpc.arizona.edu:subdirectory/\*.fastq ~/Downloads

> #### Exercise 4: Downloading data with `scp`  
> Let's say we want to download a text file `/xdisk/bhurwitz/bh_class/your_netid/exercises/data/untrimmed_fastq/scripted_bad_reads.txt` from the hpc to your local computer. Note you will perform this action from a shell on your local computer.

> Which of the following commands would download the file?  
> A)  
> ```
> $  scp local_file.txt your_netid@filexfer.hpc.arizona.edu:/xdisk/bhurwitz/bh_class/your_netid/exercises/data/untrimmed_fastq/
> ```

> B)  
> ```
> $ scp your_netid@filexfer.hpc.arizona.edu:/xdisk/bhurwitz/bh_class/your_netid/exercises/data/untrimmed_fastq/scripted_bad_reads.txt ~/Downloads
> ```

<details>
  <summary markdown="span">Solution</summary>
  <ul>
<li>A) False. This command will upload the file `local_file.txt` to the /xdisk/bhurwitz/bh_class/your_netid/exercises/data/untrimmed_fastq/ directory.</li>
<li>B) True. This option downloads the bad reads file in `/xdisk/bhurwitz/bh_class/your_netid/exercises/data/untrimmed_fastq/` to your local `~/Downloads` directory. Be sure to execute this from your local machine.</li>
  
</details>

In [None]:
# The End!
!cp ~/be487-fall-2024/exercises/03_bash_scripting/ex03-01_writing_scripts.ipynb  /xdisk/bhurwitz/bh_class/$netid/exercises/03_bash_scripting/ex03-01_writing_scripts.ipynb