# Chapter 2. Setting Up and Managing a Bioinformatics Project
---

This is what a typical bioinformatics project direcotry would look like:

In [1]:
%%bash
cd ../test-project/
tree .

bash: line 1: cd: ../test-project/: No such file or directory


[01;34m.[0m
├── bioinfx_skills.ipynb
├── chapter01.ipynb
├── chapter02.ipynb
├── foundations_of_git.html
└── [01;34mimages[0m
    ├── [01;35mawk_comp_logic.png[0m
    ├── [01;35mawk_funcs.png[0m
    ├── [01;35mcommit_chain.png[0m
    ├── [01;35mdiff_output.png[0m
    ├── [01;35mfindOverlaps.png[0m
    └── [01;35mgrep_speed.png[0m

1 directory, 10 files


Above direcotry is a sensible project layout scheme.
- *data/* contains all raw and intermediate data.
- General project-wide scripts will be in a *scripts/* directory. If scripts contain many files (e.g., multiple Python modules), they should reside in their own
subdirectory.
- *analysis/* directory contains many smaller analyses—for (e.g. quality of your raw sequences, the aligner output, and the final data that will produce
figures and tables for a paper.
- Use relative paths to refer to other files becuase using absolute paths leaves your work less portable between collaborators and decreases reproducibility.

### Documentation
We can keep our documentation in a plain-text *README* file because plain-text files:

- Can easily be read, searched, and edited directly from the command line.
- Can be put under version control.
- Is future-proof format; plain-text files written in the 1960s are still readable today.

What exactly do we need to document?

- Methods and workflows
    - Full command lines that are run through the shell that generate data or intermediate results.
    - When using scripts, include command lines options.
- Origin of all data in our project directory
    - Keep track of where data was downloaded from, who gave it to you, and any other relevant information.
- When data was downloaded
    - A script that downloads data directly from a database might produce different results if rerun after the external database is updated. Consequently, it’s important to document when data came into your repository.
- Data version information
    - Many databases have explicit release numbers, version numbers, or names (e.g., TAIR10 version of genome annotation for Arabidopsis thaliana, or Wormbase release WS231 for Caenorhabditis elegans). It’s important to record all version information in your documentation, including minor version numbers.
- How data was downloaded
    - Was MySQL used to download a set of genes? Or the UCSC Genome Browser? These details can be useful in tracking down issues like when data is different between collaborators.
- The versions of the software that is run

A good approach is to keep *README* files in each of your project’s main directories. These *README* files don’t necessarily need to be lengthy, but they should at the very least explain what’s in this directory and how it got there. For example, a *data/README* file would contain metadata about your data files in the *data/* directory.

#### Shell Expansion Tips
*Brace expansion* creates strings by expanding out the comma-separated values inside the braces. Using brace expansion, we can create *test-project* directory with one command:

```bash
mkdir -p test-project/{analysis, scripts, data/seqs}
```

Let's create some fake sequencing files and play with them:

In [2]:
%%bash
cd ../test-project/data
touch seqs/zmays{A,B,C}_R{1,2}.fastq
ls -l seqs

bash: line 1: cd: ../test-project/data: No such file or directory
touch: cannot touch 'seqs/zmaysA_R1.fastq': No such file or directory
touch: cannot touch 'seqs/zmaysA_R2.fastq': No such file or directory
touch: cannot touch 'seqs/zmaysB_R1.fastq': No such file or directory
touch: cannot touch 'seqs/zmaysB_R2.fastq': No such file or directory
touch: cannot touch 'seqs/zmaysC_R1.fastq': No such file or directory
touch: cannot touch 'seqs/zmaysC_R2.fastq': No such file or directory
ls: cannot access 'seqs': No such file or directory


CalledProcessError: Command 'b'cd ../test-project/data\ntouch seqs/zmays{A,B,C}_R{1,2}.fastq\nls -l seqs\n'' returned non-zero exit status 2.

In [3]:
%%bash
# We can use ranges of characters to select certain files
cd ../test-project/data/seqs/
ls -l zmays[AB]_R{1,2}.fastq
# Note that character ranges won't work with number. Instead, use brace expansion

bash: line 2: cd: ../test-project/data/seqs/: No such file or directory
ls: cannot access 'zmays[AB]_R1.fastq': No such file or directory
ls: cannot access 'zmays[AB]_R2.fastq': No such file or directory


CalledProcessError: Command 'b"# We can use ranges of characters to select certain files\ncd ../test-project/data/seqs/\nls -l zmays[AB]_R{1,2}.fastq\n# Note that character ranges won't work with number. Instead, use brace expansion\n"' returned non-zero exit status 2.

In general, it’s best to be as restrictive as possible with wildcards to minimize matches with unintended files.

#### Leading Zeros and Sorting
Another useful trick is to use leading zeros (e.g., file-0021.txt rather than file-21.txt) when naming files. This is useful because lexicographically sorting files (as ls does) leads to the correct ordering.

### Markdown Formatting Basics

Here is a basic Markdown document illustrating the format:
```
# *Zea Mays* SNP Calling

We sequenced three lines of *zea mays*, using paired-end
sequencing. This sequencing was done by our sequencing core and we
received the data on 2013-05-10. Each variety should have **two**
sequences files, with suffixes `_R1.fastq` and `_R2.fastq`, indicating
which member of the pair it is.

## Sequencing Files

All raw FASTQ sequences are in `data/seqs/`:

    $ find data/seqs -name "*.fastq"
    data/seqs/zmaysA_R1.fastq
    data/seqs/zmaysA_R2.fastq
    data/seqs/zmaysB_R1.fastq
    data/seqs/zmaysB_R2.fastq
    data/seqs/zmaysC_R1.fastq
    data/seqs/zmaysC_R2.fastq

## Quality Control Steps

After the sequencing data was received, our first stage of analysis
was to ensure the sequences were high quality. We ran each of the
three lines' two paired-end FASTQ files through a quality diagnostic
and control pipeline. Our planned pipeline is:

1. Create base quality diagnostic graphs.
2. Check reads for adapter sequences.
3. Trim adapter sequences.
4. Trim poor quality bases.

Recommended trimming programs:

- Trimmomatic
- Scythe
```

Let's create a markdown file named *README.md* and copy above text into it, then convert md file to an HTML file usinf Pandoc. Pandoc can convert between a variety of different markup and output formats:

In [4]:
%%bash
cd ../test-project/
# dump text into notebook.md file
echo """
# *Zea Mays* SNP Calling

We sequenced three lines of *zea mays*, using paired-end
sequencing. This sequencing was done by our sequencing core and we
received the data on 2013-05-10. Each variety should have **two**
sequences files, with suffixes `_R1.fastq` and `_R2.fastq`, indicating
which member of the pair it is.

## Sequencing Files

All raw FASTQ sequences are in `data/seqs/`:

    $ find data/seqs -name "*.fastq"
    data/seqs/zmaysA_R1.fastq
    data/seqs/zmaysA_R2.fastq
    data/seqs/zmaysB_R1.fastq
    data/seqs/zmaysB_R2.fastq
    data/seqs/zmaysC_R1.fastq
    data/seqs/zmaysC_R2.fastq

## Quality Control Steps

After the sequencing data was received, our first stage of analysis
was to ensure the sequences were high quality. We ran each of the
three lines' two paired-end FASTQ files through a quality diagnostic
and control pipeline. Our planned pipeline is:

1. Create base quality diagnostic graphs.
2. Check reads for adapter sequences.
3. Trim adapter sequences.
4. Trim poor quality bases.

Recommended trimming programs:

- Trimmomatic
- Scythe
""" > notebook.md

bash: line 1: cd: ../test-project/: No such file or directory
bash: line 40: _R1.fastq: command not found
bash: line 40: _R2.fastq: command not found
bash: line 40: data/seqs/: No such file or directory


In [None]:
%%bash
cd ../test-project/
# convert md file to html file using Pandoc
pandoc --from markdown --to html notebook.md > output.html
pandoc --from html --to pdf output.html --output output.pdf