# Practical Exercise

Let's apply these constructs to something you can use in a script. We're going to download files from the NCBI directory for the honey bee genome. 

Navigate to https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1 to see a directory listing (clicking will open in a new tab).

It looks like:

<img src="html_dirlisting.png" alt="Directory listing" style="width: 534px;"/>


Let's save the url in a variable.

```BASH

baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

```

**I'm quoting mine in single quotes** so I don't have to check for things the shell will process.

In [2]:
baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

The different files have different data, in a variety of formats. This lecture is more about linux mechanics, so we're just grabbing a small one.

Choose by saving one of the file names to a variable called `datafile`. I'm going to use "GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz".

In [3]:
datafile="GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz"
echo $datafile
# fully qualified URL
echo $baseUrl/$datafile

GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz


If you have defined the variables correctly, the following command downloads the specified file. This will place a file in your directory with the same name as the string stored in `$datafile`

In [4]:
wget $baseUrl/$datafile

--2020-08-26 17:07:12--  https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::11, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6755398 (6.4M) [application/x-gzip]
Saving to: ‘GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz’


2020-08-26 17:07:14 (16.5 MB/s) - ‘GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz’ saved [6755398/6755398]



Now, get the checksum of the file by running the command `md5sum` on `$datafile`

In [5]:
md5sum $datafile

c081b74001f46055b9f5710be2c67f33  GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz


**What is md5sum?** It is an algorithm that generates a *cryptographic hash*. This is a signature of the specified file that you can check against an expected number. Using a hash function to check file integrity is called a "checksum".

**But, how do we check if it's right?**  We need the checksum file. Looking at the directory listing above, the filename is `md5checksums.txt`. Sometimes its called `checksum.txt`, or `checksum.md5` or similar.

Set this name, `md5checksums.txt`, to the variable `checksumfile`. 

In [6]:
checksumfile=md5checksums.txt

Now download the file as you did with `wget $baseUrl/$datafile`, but make use `$checksumfile` instead of `$datafile`

In [7]:
wget $baseUrl/$checksumfile

--2020-08-26 17:07:24--  https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1/md5checksums.txt
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::7, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19101 (19K) [text/plain]
Saving to: ‘md5checksums.txt’


2020-08-26 17:07:25 (433 KB/s) - ‘md5checksums.txt’ saved [19101/19101]



In [8]:
head $checksumfile

4e40d7b3a70329dee14aeb2f658564ea  ./Annotation_comparison/GCF_003254395.2_Amel_HAv3.1_compare_prev.gbp.gz
67a2f793585e04aab9ec5d059b9b4130  ./Annotation_comparison/GCF_003254395.2_Amel_HAv3.1_compare_prev.txt.gz
d3b0bf799204e1b34020a1dac1f5686c  ./Evidence_alignments/GCF_003254395.2_Amel_HAv3.1_cross_species_tx_alns.gff.gz
390a28099c7478fcb914efbe2cf3a855  ./Evidence_alignments/GCF_003254395.2_Amel_HAv3.1_same_species_tx_alns.gff.gz
a8349bcc6a5733d23e5d289dea93e720  ./GCF_003254395.2_Amel_HAv3.1_assembly_report.txt
0fad70b254b53b42894b1132a6def686  ./GCF_003254395.2_Amel_HAv3.1_assembly_stats.txt
6e1ec842ae5a24ed53a6d866ac613d17  ./GCF_003254395.2_Amel_HAv3.1_assembly_structure/Primary_Assembly/assembled_chromosomes/AGP/chrLG1.agp.gz
671eb4b73fa352819654beefe17f465a  ./GCF_003254395.2_Amel_HAv3.1_assembly_structure/Primary_Assembly/assembled_chromosomes/AGP/chrLG1.comp.agp.gz
c60d55f350e165e09ce79ec9fbab3851  ./GCF_003254395.2_Amel_HAv3.1_assembly_structure/Primary_Assembly/assembled_c

---
Is the checksum right??? How do we tell? There's too much information.

Try using `grep` with information you got from the `md5sum` command above. 

It will take the form `grep PATTERN $checksumfile`.

In [9]:
grep c081b74001f46055b9f5710be2c67f33 $checksumfile

c081b74001f46055b9f5710be2c67f33  ./GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz


**Challenge!** Once you figure out how to get the information with grep, can you run the commands in succession to get a more readable answer?

In [10]:
# md5sum command
md5sum $datafile
# grep command
grep $datafile $checksumfile

c081b74001f46055b9f5710be2c67f33  GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz
c081b74001f46055b9f5710be2c67f33  ./GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz


Did it work? Can you formulate those two commands to use only `$datafile` and `$checksum` as the arguments to the commands?

Let's use jupyterlab to convert this notebook into a script.

Go to File->Export Notebook As...Export Notebook to Executable Script

This will save it to your computer. If you're still figuring out the directory structure, here is the output of my saved script below.
```BASH
baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

datafile="GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz"
echo $datafile
# fully qualified URL
echo $baseUrl/$datafile

wget $baseUrl/$datafile

md5sum $datafile

checksumfile=md5checksums.txt

wget $baseUrl/$checksumfile

head $checksumfile

grep c081b74001f46055b9f5710be2c67f33 $checksumfile

# md5sum command
md5sum $datafile
# grep command
grep $datafile $checksumfile
```

# Let's use the script we created from the notebook.

---

If you can find that exported script on your home computer, then upload it using the up-arrow icon on the JupyterLab icon menu.

**Now, switch over to a terminal and make sure you are in our current directory.**

You can get the current directory of this notebook via: `pwd`

1. Copy the output from the above command, and paste it after the `cd` command "*in your terminal*. Remember to leave a space between the `cd` and the pasted text.
1. Check to see if you have the script using `ls checksum.sh`
1. If not, copy the text from my finished script below:
1. Type `nano checksum.sh` *in your terminal*

*My fishined script*
---
```BASH

baseUrl='https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7460/104/GCF_003254395.2_Amel_HAv3.1'

datafile="GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz"
echo $datafile
# fully qualified URL
echo $baseUrl/$datafile

wget $baseUrl/$datafile

md5sum $datafile

checksumfile=md5checksums.txt

wget $baseUrl/$checksumfile

head $checksumfile

grep $datafile $checksumfile

# md5sum command
md5sum $datafile
# grep command
grep $datafile $checksumfile

```

See how we made a script from the jupyter notebook? Do we run it?