## Chapter 6. Bioinformatics data

### Retrieving Bioinformatics Data

#### Downloading Data with wget and curl

**wget:**
`wget` is useful for quickly downloading a file from the command line — for example, human chromosome 22 from the GRCh37 (also known as hg19) assembly version:

```bash
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz
```

- In general, FTP is preferable to HTTP for large files (and is often recommended by websites like the UCSC Genome Browser).
- One of `wget`’s strengths is that it can download data recursively. Running it with `--recursive` or `-r` option, `wget` will also follow and download the pages linked to, and even follow and download links on these pages (by default up to five links deep depth level can be set using `--level` or `-l` option).
- We can ask `wget` to download only certain files that follow a pattern with `--accept` option and we can prevent it from downloading higher leverls using `--no-parent` option. See the example below:

```bash
  wget --accept "*.gtf" --no-directories --no-parent --recursive \
  http://genomics.someuniversity.edu/labsite/annotation.html
```
- In some cases, the remote host may block your IP address if you’re downloading too much too quickly. So beware when using `--recursive` option. However, for controlled downloading speed, we can set `--limit-rate` option accordingly not to exceed server download limits.
- Here are some useful options for `wget` command:

| Option | Values | Use|
|--------|--------|----|
|-A, --accept| Either a suffix like “.fastq” or a pattern with *, ?, or [ and ], optionally comma-<br>separated list | Only download files matching this criteria.|
|-R, --reject | Same as with --accept | Don’t download files matching this; for example, to download all<br>files on a page except Excel files, use --reject ".xls".|
|-nd, --no-directory | No value | Don’t place locally downloaded files in same directory hierarchy as remote files.|
|-r, --recursive| No value | Follow and download links on a page, to a maximum depth of five by default.|
|-np, --no-parent| No value |Don’t move above the parent directory.|
|--limit-rate | number of bytes to allow per second |Throttle download bandwidth.|
|--user=user | FTP or HTTP username | Username for HTTP or FTP authentication.|
|--ask-password | No value | Prompt for password for HTTP of FTP authentication; --password= could also be used, but then your password is in your shell’s history.|
|-O | Output filename| Download file to filename specified; useful if link doesn’t have an informative name (e.g., http://lims.sequencingcenter.com/seqs.html?id=sample_A_03).|


**Curl:**
`curl` behaves similarly to `wget`, although by default writes the file to standard output. To download
chromosome 22 as we did with wget, we’d use:

```bash
$ curl http://[...]/goldenPath/hg19/chromosomes/chr22.fa.gz > chr22.fa.gz
```

- We can set download file name using `-O <file_name>` option. If `<file_name>` is not provided, it will save it with its original name.
- It can transfer files using more protocols than `wget` (e.g., `SFTP` (secure `FTP`) and `SCP` (secure copy).
- `Curl` itself is also a library, meaning in addition to the command-line program, Curl’s functionality is wrapped by software libraries like `RCurl` and `pycurl`.

**Note:** `wget` and `curl` are generally used to download individual files from a web server. When you use them on a GitHub URL, it's important to get the <u>raw file URL</u>, not the standard file or directory page URL. The standard URL for a file on GitHub leads to an HTML page with the file contents, not the file itself.

**Rsync and Secure Copy (scp):**

While `wget` and `curl` are good for quickly downloading files from the commandline, for heavy-duty file transfers `Rsync` is better. Why:
- Often faster because it only sends the difference between file versions.
- Compresses files during transfers.
- Has an archive option that preserves links, modification timestamps, permissions, ownership, and other file attributes.
- Rsync basic command is `rsync source destination`, where `source` is the source of the files or directories you’d like to copy, and `destination` is the destination you’d like to copy these files to. Either `source` or `destination` can be a remote host specified in the format `user@host:/path/to/directory/`.
- For remote source or destination, we need to provide connection protocol, like `ssh`, using `-e` option For example:

```bash
rsync -azv -e ssh source_dir/ user@host:path/to/dir
```
In above command, `-a` enables archive mode, `-z` enables file transfer compression, and `-v` enables verbosing. You can omit `-e ssh` if you connect to a host through an SSH host alias. Like below:

```bash
rsync -azv source_dir/ host_alis:path/to/dir
```

**Trailing slashes in rsync:**
Trailing slashes (e.g., `data/` versus `data`) are meaningful when specifying paths in rsync. A trailing slash in the source path means copy the contents of the source directory, whereas no trailing slash means copy the entire directory itself.

- rsync only only transmits files if they don’t exist or they’ve changed. Rerunning rsync will ensure everything is synchronized between the two directories.
- If we are not transferring large files between source and densitation, using rsync is overkill. Instead, we can use `scp` like below:

```bash
scp source_dir/ user@host:path/to/dir
```

### Data Integrity

#### Checksums
- Used for checking transferred data's integrity.
- Are very compressed summaries of data, which would reflect chnages as small as one bit of the data.
- Are also helpful for keeping tracks of data versions, such as intermediate data.
- Facilitate reproducibilty

#### SHA and MD5 Checksums
- Two most common checksum algorithms. For example, Git commit IDs use SHA-1 checksum.
- SHA-1 is newer than MD5 and is preferred but MD5 is more common.
- We can use `md5sum` or `shasum` to generate checkcums for files.

```bash
echo "bioinformatics is fun" | md5sum  # using md5sum
echo "bioinformatics is fun" | shasum  # using shasum
```
- Are reported in hexadecimal format, where each digit can be one of 16 characters: digits 0
through 9, and the letters a, b, c, d, e, and f.
- The trailing dash indicates this is the SHA-1 checksum of input from <u>standard in</u>, otherwise there is no trailing dash we use file as an input.
- When we have many files, `shasum` can create and validate against a file containing the checksums of files. For example, for fastq files in `data/` directory, we can run:

```bash
shasum data/*fastq > fastq_checksums.sha
```

- We can use shasum’s check option (`-c`) to validate that these files match the original versions:

```bash
$ shasum -c fastq_checksums.sha
```

- Some servers use an antiquated checksum implementation such as `sum` or `chsum`.

### Looking at Differences Between Data

#### Unix tool `diff`

- Unix’s `diff` works line by line, and outputs blocks (called hunks) that differ between files.

```bash
diff -u file1 file2
```
`-u` option is short for unified format, which is similar to Git's `diff` subcommand's output's format.

Here is the breakdown of `diff`'s output (in unified format):

<img src="images/diff_output.png" width=500>

1. These two lines are the header of the unified diff. The original file `gene-1.bed` is prefixed by `---`, and the modified file `gene-2.bed` is prefixed by `+++`. The date and time in these two lines are the modification times of these files.
2. This line indicates the start of a changed hunk. The pairs of integers between `@@` and `@@` indicate where the hunk begins, and how long it is, in the original file `(-1,22)` and modified file `(+1,19)`, respectively.
3. Lines in the diff that begin with a space indicate the modified file’s line hasn’t changed.
4. Lines in the diff that begin with a + indicate a line has been added to the modified file.
5. Similarly, - indicates lines removed in the modified file.
6. An adjacent line deletion and line addition indicates that this line was changed in the modified file.

- `diff`’s output can also be redirected to a file, which creates a `patch file`.
- Patch files act as instructions on how to update a plain-text file, making the changes contained in the diff file.
- The Unix tool `patch` can apply changes to a file needed to be patched.

### Compressing Data and Working with Compressed Data

- Using pipes and redirection, we can stream compressed data and write compressed files directly to the disk.
- Common Unix tools like `cat`, `grep`, and `less` all have variants that work with compressed data.
- Python’s `gzip` module allows us to read and write compressed data from within Python.

#### gzip vs bzip2

| Feature | Gzip | Bzip2 |
| :--- | :--- | :--- |
| **Speed** | Faster compression and decompression | Slower compression and decompression |
| **Compression Ratio**| Good | Better (smaller file sizes) |
| **Common Use** | Bioinformatics, general-purpose compression | Long-term data archiving |
| **Tools** | `gzip`, `gunzip` | `bzip2`, `bunzip2` |

**Difference between `gzip` and `gunzip`**

While `gzip` (which stands for "GNU zip") compresses files and adds a `.gz` extension, `gunzip` reverses the process to restore the original file. On most Unix-like systems, `gunzip` is a simple alias for running `gzip -d` (the decompress flag).

Here is an example of using `gzip` to compresse output of an imaginary program `trimmer`:

```bash
trimmer in.fastq.gz | gzip > out.fastq.gz
```

- `gzip` also can compress files on disk in place: `gzip file`, which will ouptut `file.gz`.
- A nice feature of the gzip compression algorithm is that you can concatenate gzip compressed output directly to an existing gzip file.

```bash
cat *.gz > combined.gz
```

- If you need to compress multiple separate files into a single archive, use the `tar` utility.

#### Working with Gzipped Compressed Files

- Many common Unix and bioinformatics tools can work directly with compressed files.
- If programs cannot handle compressed input, you can use `zcat` and pipe output directly to the standard input of another program.
- There can be a slight performance cost in working with gzipped files, as your CPU must decompress input first. Usually, the convenience of z-tools like `zgrep`, `zless`, and `zcat` and the saved disk space outweigh any potential performance hits.

Here's a table summarizing the GNU tools and their compressed file counterparts.

| Command | Counterpart for `.gz` files | Counterpart for `.bz2` files | Counterpart for `.xz` files |
| :--- | :--- | :--- | :--- |
| **`cat`** | `zcat` | `bzcat` | `xzcat` |
| **`grep`** | `zgrep` | `bzgrep` | `xzgrep` |
| **`less`** | `zless` | `bzless` | `xzless` |
| **`diff`** | `zdiff` | `bzdiff` | `xzdiff` |

---
**Key Takeaways**

The **`z`**, **`bz`**, and **`xz`** prefixes are a common GNU convention for handling compressed files directly. These tools work by decompressing the input on the fly and then performing their standard function, like searching (`grep`), viewing (`less`), or comparing (`diff`). This saves you a step by eliminating the need to manually decompress the file before using the tool.

### Case Study: Reproducibly Downloading Data

- Genomes releases are coordinated through the Genome Reference Consortium. The “GRC” prefix in `GRCm39` refers to the **Genome Reference Consortium**.

- Downloaded mouse genome GRCm39 and checksums from Ensembl's FTP site:

```bash
wget -A "*.fa.gz", "CHECKSUMS" -np -nd -r -N ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna_index/
```

- Ran a quick sanity check to see if the file contains chromosomes, contigs, and scaffolds:

```bash
zgrep "^>" Mus_musculus.GRCm39.dna.toplevel.fa.gz | less
```
| Feature | Contigs | Scaffolds |
| :--- | :--- | :--- |
| **Structure** | Continuous sequence with no gaps. | Composed of multiple contigs separated by gaps. |
| **Gaps** | No gaps. | Contains gaps of estimated length. |
| **Assembly Level**| The most basic, fundamental assembled unit. | A higher-level structure that links multiple contigs. |
| **Information** | Based on direct overlapping sequence reads. | Uses additional long-range information to order contigs. |