# Workshop 1 - Jupyter notebooks and the bash kernel

This workshop is designed to introduce you to the two core tools that you will be using for the other workshops.  

Before you begin this workshop you should know some very basic unix commands. These are covered in chapters 1-3, 9 and 10 of the [interactive guide](http://bc2023.bioinformatics.guide/lessons/).  Unless you are already familiar with UNIX it is essential that you read over those chapters before you start (less than 5 minutes). To prepare for later workshops you should also work through all the chapters in the interactive guide especially chapters 11 and 17-22.

At the end of this workshop you should;

1. Understand what a jupyter notebook is and how it relates to the unix command line
2. Be able to edit text and run unix commands from within a jupyter notebook
3. Know how to assess your learning by using the self-assessment exercises in a jupyter notebook
4. Know how to calculate the extinction coefficient for a short DNA sequence by hand
5. Know how to run a command-line program to calculate the extinction coefficient of any DNA sequence

### Jupyter notebooks

The document you are reading is a jupyter notebook. 

It consists of series of cells that contain either text or computer code. 

Jupyter notebooks are very useful for bioinformatics because they allow text to be mixed together with code for manipulating data, running programs and creating plots.

### Text cells and Markdown

The cell you are reading is a text cell. Click on it to make it the currently active/selected cell. The active cell will have thin coloured border around it with a thicker border on the left. If the border is blue the cell is not editable.  

Double click on this cell to make it editable.  

You should see that it's border turns green.  You should also see that it's content changes to plain text in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Here-Cheatsheet) format.  Markdown is a way of writing documentation that is very simple but still allows some basic styling (headers, links, images, code, bold, italics, equations, quotes) 

### Code cells and the Bash kernel

The text you type into code cells should consist of valid commands that can be interpreted by the notebook's `kernel`. A notebook's `kernel` is the engine it uses to evaluate code cells.  This notebook is running the [Bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29) kernel. This means that when you run code cells they will be interpreted as if you typed the same text at the unix command prompt.

Jupyter [notebooks support many types of kernels](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels) including `Python`, `R` and `Bash` which are particularly useful for bioinformatics.

**Note:** You can tell which kernel a notebook is running by looking at the kernel indicator in the top right corner. 

### Running cells


The notebook will not actually run your cells until you tell it to.  You can do this by first selecting the cell and then using the menu to select Cell -> Run Cells. 

The cell immediately below this one is a code cell. 

The `ls` command in this cell should be familiar to you. Try running it.

Try double-clicking on a text cell to set it into edit mode.  Then run the text cell.  When text cells are run they aren't evaluated by the `kernel` but are rendered for display in your web browser.

In [10]:
ls --help

Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all           do not list implied . and ..
      --author               with -l, print the author of each file
  -b, --escape               print C-style escapes for nongraphic characters
      --block-size=SIZE      scale sizes by SIZE before printing them; e.g.,
                               '--block-size=M' prints sizes in units of
                               1,048,576 bytes; see SIZE format below
  -B, --ignore-backups       do not list implied entries ending with ~
  -c                         with -lt: sort by, and show, ctime (time of last
                               modification of file status information);
                               with -l:

# IMPORTANT

> ## Run the Setup Code 

In order for this notebook to work properly you need to **run the cell below before doing anything else**. This will load custom functions and settings required to make the self assessment exercises work. 

If you restart your kernel you will also need to rerun the setup code 

> ## Don't use the `cd` command 

The answers to all self assessment exercises assume that you don't change your directory from the default.  You shouldn't ever need to use the `cd` command to answer an exercise.


In [37]:
# Essential Setup Code : Must be run first.
wget https://www.dropbox.com/s/zqgacjshllprdcc/setup.sh?dl=0 -O setup.sh
source ./setup.sh


--2022-08-02 12:12:32--  https://www.dropbox.com/s/zqgacjshllprdcc/setup.sh?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.83.18, 2620:100:6033:18::a27d:5312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.83.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/zqgacjshllprdcc/setup.sh [following]
--2022-08-02 12:12:33--  https://www.dropbox.com/s/raw/zqgacjshllprdcc/setup.sh
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc1e2b6f3c5c898f5bb922988d29.dl.dropboxusercontent.com/cd/0/inline/BqMxl4niPd556ztaYI_cHtqL090668QphyTsL5-4e-qQ86KjOzG1xF4AkJB__MEHC3hvFJcv-UQw6w4Ffu1QMvcVDsZ7ns1E8PYcuzc-zKdN40G93lbwpDolCIJt6mE-igdxCKFByOi9XZZfhkfpirLAEqGoSStIAiFv51MbwTc7lw/file# [following]
--2022-08-02 12:12:33--  https://uc1e2b6f3c5c898f5bb922988d29.dl.dropboxusercontent.com/cd/0/inline/BqMxl4niPd556ztaYI_cHtqL090668QphyTsL5-4e-qQ86KjOzG1xF4AkJB__MEHC3hvFJcv

### Keyboard shortcuts

Using the mouse all the time to run cells can be very tedious.  To save time, select the cell type (command)-(enter). Depending on your keyboard this combination may be slightly different (eg (control)-(return) on a mac).

## Exercise 1

**Your task: ** Write a command to list the contents of the current directory

This is deliberately easy (the answer is `ls`) so that you can focus on understanding the self-assessment mechanism. 

Follow these steps for every exercise:

1. Read the text describing the problem and figure out your answer.  Feel free to create new cells to experiment with commands until you get things right. You might also want to use the terminal on [the tutorial site](http://bc2023.bioinformatics.guide/lessons/)
2. Enter your code into the answer cell. The answer cell contains a blank space for you to put your answer but it is important that you don't change the other code in the cell.  Eg. like below
![autograded_example](autograded_answer_example.png)
3. Be sure to run your answer cell.  This will make your answer accessible to the test cell
4. Run the test cell to check your answer. The test cell is locked and always comes immediately after the answer cell. 

In [3]:
e1_answer(){
### BEGIN SOLUTION
ls
### END SOLUTION
}

In [4]:
test_e1

Your answer is correct


In [1]:
# This code cell is for you to experiment with the ls command (see exercise below)
ls -a -F -l

total 2500
drwxr-xr-x. 4 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au     200 Jul 30 08:58 ./
drwxr-xr-x. 8 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au     140 Jul 30 09:14 ../
-rw-r--r--. 1 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au   26858 Jul 30 09:14 autograded_answer_example.png
-rw-r--r--. 1 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au   26750 Jul 30 09:14 blast_gs.png
drwxr-xr-x. 2 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au      42 Jul 30 08:58 blast_input.graffle/
-rw-r--r--. 1 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au 2307914 Jul 30 09:14 blast_input.png
-rw-r--r--. 1 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au  158460 Jul 30 09:14 blastx.png
drwxr-xr-x. 2 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au      44 Jul 30 08:58 .ipynb_checkpoints/
-rw-r--r--. 1 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au   25609 Jul 30 09:14 jupyter_intro.ipynb
-rw-r--r--. 1 sci-irc@ad.jcu.edu.au domain users@ad.jcu.edu.au    1404 Jul 30 09:14 setup.sh


### Extending the `ls` command

Use the code cell above and try various optional arguments to the `ls` command. Eg.

```bash
ls -F
ls -1
ls -a
ls -R
ls -S
```

Now try printing the *help* text for the `ls` command

```bash
ls --help
```

Search through the help and look for each of the options in the commands above. Use the description for each option to understand the output you see when you run each command.

**Note:** Another way to bring up the *help* is the `man` command but unfortunately this doesn't work well in a jupyter notebook



## Exercise 2

**Your task: ** Write a command to list the contents of your current directory (not including hidden files) in reverse order



In [6]:
e2_answer(){
### BEGIN SOLUTION
ls -r
### END SOLUTION
}

In [7]:
test_e2

Your answer is correct


## Exercise 3

**Your task:** Write a command to list the contents of your current directory sorted by reverse size



In [17]:
e3_answer(){
### BEGIN SOLUTION
ls -rS
### END SOLUTION
}

In [18]:
test_e3

Your answer is correct


### Exercise 4

**Your task:** Write a command to print your current working directory



In [19]:
e4_answer(){
### BEGIN SOLUTION
pwd
### END SOLUTION
}

In [20]:
test_e4

Your answer is correct


# Extinction Coefficients

A widely used method for estimating DNA concentrations is based on optical absorbance at a wavelength of 260nm. 

Absorbance is given by the Beer-Lambert equation:

$$ A = \epsilon \times c \times p$$

Where A is absobance, c is the concentration (in mole/L), p is the path length (in cm) and $epsilon$ is the extinction coefficient (in L/(mole.cm)).  The path length can be easily measured (size of the vessel holding our sample), and absorbance is measured by the spectrophotometer. The remaining variable, $\epsilon$ is a characteristic of the DNA that we are measuring.  For a short oligonucleotide sequence it can vary alot depending on the sequence of DNA bases (A, C, G, T). 


### Calculating $\epsilon$

Let's consider the oligonucleotide sequence `AGGCT`.  How do we calculate its extinction coefficient? 

A convenient and fairly accurate method for calculating $\epsilon$ is the nearest neighbour method.  Under the nearest neighbour method $\epsilon$ for an oligo (single stranded DNA) of length $N$ nucleotides is calculated according to the formula:

$$ \epsilon = \sum_{i=1}^{N-1} \epsilon_{i,i+1} - \sum_{i=2}^{N-1}\epsilon_{i}$$

where $i$ is the nucleotide position.  Another way to write this is;

$$ \epsilon = \sum_{Nearest Neighbours} \epsilon_{Nearest Neighbour} - \sum_{Inner Bases}\epsilon_{Inner Base}$$

Let's break this down. 

First let's calculate the contribution from the nearest neighbour pairs.  In `AGGCT` the nearest neighbour pairs are `AG`, `GG`, `GC` and `CT` as shown below

![nearest neighbours](nn.png)

The contributions from these neighbour pairs can be obtained from the following lookup table.

5'-3' | dA | dC | dG | dT
----- | ---| ---| -- | --
dA | 27,400 | 21,200 | 25,000 | 22,800 
dC | 21,200 | 14,600 | 18,000 | 15,200
dG | 25,200 | 17,600 | 21,600 | 20,000
dT | 23,400 | 16,200 | 19,000 | 16,800

When looking up values from this table the first nucleotide in the pair denotes the row and the second nucleotide denotes the column. This gives, `AG` = 25,000, `GG` = 21,600, `GC` = 17,600, `GT` = 15,200

The nearest neighbour contribution is the sum of these which gives:

$$ \sum_{Nearest Neighbours} \epsilon_{Nearest Neighbour} = 25,000 + 21,600 + 17,600 + 15,200 = 79,400$$

To avoid double counting we need to subtract the contributions from individual bases that appear twice in the neighbour pairs. These are the inner bases (all except for the two ends). Contributions from individual bases are:

$$ 
dA = 15,400 \\
dC = 7,400 \\
dG = 11,500 \\
dT = 8,700 
$$

So for the sequence `AGGCT` the individual base contribution that must be subtracted is:

$$
\sum_{Inner Bases}\epsilon_{Inner Base} = 2 \times dG + dC = 30,400
$$


Finally the extinction coefficient for the oligo `AGGCT` is:

$$\epsilon = 79,400 - 30,400 = 49,000$$


### Exercise 5

**Your task:** Using the nearest neighbour method (see above), calculate the extinction coefficient for the single strand DNA sequence `TGCATA`

Calculate your answer manually (you might need a calculator).  Once you have your answer write it in the answer field below using the echo command.  For example, if your answer is "50000" you would write

```bash
echo 50000
```


In [32]:
e5_answer(){
### BEGIN SOLUTION
echo 61000
### END SOLUTION
}

In [33]:
test_e5

Your answer is correct



While it is not too difficult to calculate extinction coefficients for short sequences it becomes increasingly tedious and error-prone for long sequences. 

This notebook includes a computer program written in [Python](https://en.wikipedia.org/wiki/Python_%28programming_language%29) that can calculate the extinction coefficient for any sequence.  

You can run the program simply by typing its name (`extinction.py`) and then providing the sequence you wish to analyze.  Try it out by calculating extinction coefficients for a few sequences.

```bash
extinction.py GCCGTCGTTACGAGCATACAATGC
```

```bash
extinction.py AATGCTGCGGTGTACGAGATGGGGGCACACG
```

In [36]:
extinction.py AATGCTGCGGTGTACGAGATGGGGGCACACG

305500


### Exercise 6

**Your task:** Use the `extinction.py` program to calculate the extinction coefficient for the sequence `TGCATATGCTATTT`


In [40]:
e6_answer(){
### BEGIN SOLUTION
extinction.py TGCATATGCTATTT
### END SOLUTION
}

In [41]:
test_e6

Your answer is correct


# Optional

Playing with the command line is the best way to learn.  

1. Try the `fortune` command.
```
    fortune
```
Run it a few times
2. Try the `cowsay` command like this
```
    cowsay "keyboard good, mouse bad"
```
3. To be a bit more faithful to [the original](https://en.wikipedia.org/wiki/Animal_Farm) we need to make the following change
```
    cowsay -f sheep "keyboard good, mouse bad" 
```
4. Now try combining the two commands  
```
    fortune | cowsay
```
    > This introduces a new concept, the pipe operator, `|`.
    > A pipe allows the output of one command to be used as input for another
    > .We will cover pipes in more detail in workshop 2
5. Try out various cows.  You can find more inside the directory `/usr/share/cowsay`.  
6. Read the [cowsay man page](https://linux.die.net/man/1/cowsay) to see if you can change the appearance of cows in other ways. 
7. If you are truly unsatisfied with the default cows you can find more [here](https://github.com/paulkaefer/cowsay-files/tree/master/cows)