# Practical Exercise

Let's apply these constructs to something you can use in a script. We're going to download files from the NCBI directory for the honey bee genome. 

Navigate to https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1 to see a directory listing (clicking will open in a new tab).

It looks like:

<img src="https://onishdata.bmb.colostate.edu/jupyterlab_icons/html_dirlisting.png" width="534px">


Let's save the url in a variable.

```BASH

baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

```

**I'm quoting mine in single quotes** so I don't have to check for things the shell will process.

In [None]:
baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

The different files have different data, in a variety of formats. This lecture is more about linux mechanics, so we're just grabbing a small one.

Choose by saving one of the file names to a variable called `datafile`. I'm going to use "GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz".

In [None]:
datafile="GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz"
echo $datafile
# fully qualified URL
echo $baseUrl/$datafile

If you have defined the variables correctly, the following command downloads the specified file. This will place a file in your directory with the same name as the string stored in `$datafile`

In [None]:
wget $baseUrl/$datafile

Now, get the checksum of the file by running the command `md5sum` on `$datafile`

In [None]:
# if you're on mac OS X, you might need to do `md5 -r` instead of md5sum
md5sum $datafile

**What is md5sum?** It is an algorithm that generates a digest called a *hash*. This is a signature of the specified file that you can check against an expected number. Using a hash function to check file integrity is called a "checksum".

**But, how do we check if it's right?**  We need the checksum file. Looking at the directory listing above, the filename is `md5checksums.txt`. Sometimes its called `checksum.txt`, or `checksum.md5` or similar.

Set this name, `md5checksums.txt`, to the variable `checksumfile`. 

In [None]:
checksumfile=md5checksums.txt

Now download the file as you did with `wget $baseUrl/$datafile`, but make use `$checksumfile` instead of `$datafile`

In [None]:
wget $baseUrl/$checksumfile

In [None]:
head $checksumfile

---
Is the checksum right??? How do we tell? There's too much information.

Try using `grep` with information you got from the `md5sum` command above. 

It will take the form `grep PATTERN $checksumfile`.

**Challenge!** Once you figure out how to get the information with grep, can you run the commands in succession to get a more readable answer?

In [None]:
# md5sum command
md5sum $datafile
# grep command
grep $datafile $checksumfile

Did it work? Can you formulate those two commands to use only `$datafile` and `$checksum` as the arguments to the commands?

Let's use jupyterlab to convert this notebook into a script.

Go to File->Export Notebook As...Export Notebook to Executable Script

This will save it to your computer. If you're still figuring out the directory structure, here is the output of my saved script below.
```BASH
baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

datafile="GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz"
echo $datafile
# fully qualified URL
echo $baseUrl/$datafile

wget $baseUrl/$datafile

md5sum $datafile

checksumfile=md5checksums.txt

wget $baseUrl/$checksumfile

head $checksumfile

grep $datafile $checksumfile

# md5sum command
md5sum $datafile
# grep command
grep $datafile $checksumfile
```

# Let's use the script we created from the notebook.

---

If you can find that exported script on your home computer, then upload it using the up-arrow icon on the JupyterLab icon menu.

**Now, switch over to a terminal and make sure you are in our current directory.**

You can get the current directory of this notebook via: `pwd`

In [None]:
pwd

1. Copy the output from the above command, and paste it after the `cd` command "*in your terminal*. Remember to leave a space between the `cd` and the pasted text.
1. Check to see if you have the script using `ls checksum.sh`
1. If not, copy the text from my finished script below:
1. Type `nano checksum.sh` *in your terminal*

*My finished script*

```BASH
baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

datafile="GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz"
echo $datafile
# fully qualified URL
echo $baseUrl/$datafile

wget $baseUrl/$datafile

md5sum $datafile

checksumfile=md5checksums.txt

wget $baseUrl/$checksumfile

head $checksumfile

grep $datafile $checksumfile

# md5sum command
md5sum $datafile
# grep command
grep $datafile $checksumfile
```

Let's make it executable:
```BASH

chmod -v a+x checksum.sh
./checksum.sh

```