# Bio334 Practical Bioinformatics

The 2nd module, 14-16, May, 2025

## Masaomi Hatakeyama
- GitHab https://github.com/masaomi/bio334_2025
- TAs: Narjes Yousefi,  Kenji Yip Tong

# Quieck Review

1. Double loop for the combination of Lists
2. *for* + *if* for the comparison of two sequences
3. Nucleotide diversity (Advanced exercise yesterday, a milestone today)

# Tips

1. **Deduction**: Drawing specific conclusions logically from general principles or laws.
2. **Induction**: Observing multiple specific instances to identify patterns and generalize conclusions.
3. **Abduction**: Formulating the most hypothesis based on observed data, often involving intuitive leaps or "Aha!" moments.

# Nucleotide Diversity

<img src="https://raw.githubusercontent.com/masaomi/bio334_2025/main/jupyter_notebooks/png/pi.png" width=50%>

- $d $: number of nucleotide differences
- $n $: number of sequences
- $l $: length of sequence


# Nucleotide Diversity, Mean pairwise difference

$\Pi = \frac{\sum_{i<j}d_{ij}}{{}_n \mathrm{C}_2}$

$\pi = \frac{\sum_{i<j}d_{ij}}{{}_n \mathrm{C}_2l} = \frac{\Pi}{l} = \frac{\sum_{i<j}\pi_{ij}}{{}_n \mathrm{C}_2} = \frac{\sum_{i<j}\frac{d_{ij}}{l}}{{}_n \mathrm{C}_2}$

- $\Pi $:Mean pairwise difference
- $\pi $: nucleotide diversity




# Today

1. Part1(9-10) File I/O
2. **Part2(10-11) Method, PI** (<font color=gray>*Milestone*</font>)
3. Part3(11-12) Segregating site
4. **Part4(13-15) Tajima's D** (<font color=gray>*Milestone*</font>)
5. Part5(15-17) Batch process, module


## Lectures (<font color=gray>*plan*</font>)

1.  9:00- (10min)
2. 10:00- (10min)
3. 11:00- (10min)
4. 13:00- (10min)
5. 15:00- (10min)


In [None]:
import IPython.display
IPython.display.Audio("voice/day2_part1.mp3")

Today, we will focus on more details and applications. 

The code will become more complicated, but just keep in mind one thing. 

If you get confused with the code, please break it down into smaller parts, elements, and think the process logically step by step, and integrate the piece of blocks as a whole. 

Breaking and integration. That's the most important thing in both science and programming.

# Why programming?

1. **Reusability**: Reproducibility, you can calculate with just one command type again
2. **Batch process**: Automation, computer can work while you are sleeping
3. **Understanding**: It helps you to learn algorithms behind a problem, and it improves your logical thinking ability



In [None]:
IPython.display.Audio("voice/why_programming.mp3")

First of all, I have a question. 

Why are you learning computer programming?

In my opinion, there are three good points you should keep in mind in computer programming regardless of programming language.

Reusability, batch processing, and understanding.

The programmining code is reusable.
You can use the source code many times once you make it, you can use the source code again in another time, situation, repeatedly. This will maintain the reproducibility of the calculation result.

The programming is batch process.
A computer program processes many things at once, in other words, the many complicated processes are automized.

Making a programming code improves your understanding.
By implementing an algorithm, calculation steps, you understand it more concretely. implementing the code means that you understand the process well. 

This will be proven after you finish this module, and compare your knowledge before and after the course. Hopefully, you will feel that you get a more clear idea about nucleotide diversity than before.


# Reusability

How can you improve the reusability? (besides copy&paste)

* Generalization of **Process**
    - $\Rightarrow $ Separation of **Data** and **Process**

What does this mean?



In [None]:
IPython.display.Audio("voice/reusability.mp3")

How can you improve the reusability?

The key concept is the genelization.

In order to generalize a process in computer programming, you need to clearly separate the data and process. 

What does this mean?

Let's look at the example below.


# Command line argument

Warming up exercise

    import sys
    print(sys.argv)

    # Result
    # $ python day2_1_example1.py 123 abc
    # ['day2_1_example1.py', '123', 'abc']




- *sys.argv* is a List object
- An argument becomes a String object



In [None]:
IPython.display.Audio("voice/command_line_argument.mp3")

The first example shows the separation of command-line argument and script.

It is difficult to show you an example in the Jupyter notebook, so just please try to execute the example in the terminal.

When you execute a Python script, you can put additional input data to the Python script. It is called a command-line argument.

The external data can be used in the Python script. 

The next example is another separation of data and process.


# File Input

    import sys
    file = open("input.fa")
    for line in file:
        print(line.rstrip())
    file.close()




* *for* statement reads line by line assigning line data to the variable
* *rstrip()* removes line break from the line
* *open()* and *close()* are needed



In [None]:
IPython.display.Audio("voice/load_fasta.mp3")

This example shows the file loading. 

Loading a text file and just showing the file contents by *print()* function.

This is also the separation of data and process. The python script is the process, and the fasta file is the input data.

By separating data and process, you can use this Python code many times for other fasta files without changing the code.

In other words, the reusability is improved by separating data and process.


In [None]:
import sys
file = open("input.fa") # input.fa is expected to be uploaded in jupyterhub folder
for line in file:
    print(line.rstrip())
file.close()

In [None]:
# Another example
import sys
with open("input.fa") as file: # input.fa is expected to be uploaded in jupyterhub folder
    for line in file:
        print(line.rstrip())

# Mini-Summary

Command line argument + File input

* We can separate *Data* from *Process*
  * **Data**: input file
  * **Process**: script file

$\Rightarrow $ **Reusability up!!** You do not have to update the source code again.



# Command line argument + File input

    import sys
    file = open(sys.argv[1])
    for line in file:
        print(line.rstrip())
    file.close()

    # command line
    # $ python day2_1_example2.py input.fa


In [None]:
import sys
# Assuming that the script runs as follows:
# $ python day2_1_example2.py input.fa
sys.argv[1] = "input.fa" # This is needed only for jupyter notebook

file = open(sys.argv[1])
for line in file:
    print(line.rstrip())
file.close()


We used sequences coded in a Python script yesterday

```
seq1 = "ATGC"   # First sequence
seq2 = "ATAT"   # Second sequence
seq3 = "ATGC"   # Third sequence

sequences = [seq1, seq2, seq3]
```



*It is called hard-coding*


# Day2 Part1 Exercise1

Loading a fasta file and count the genome size of *Arabidopsis thaliana*
```
# command example
$ python day2_1_exercise1.py athal_genome.fa
genome size = 119667750 bp
```
* The script is reusable without any changes
* bp: base pair, Kb, Mb, Gb
* There is *athal_genome.fa* in data folder (compressed by gzip)


In [None]:
IPython.display.Audio("voice/day2_part1_exercise.mp3")

The first exercise today is just loading a FASTA file. 

The FASTA file contains a lot of nucleotide sequences, actually more than two.

If you do not know what the FASTA file is. Please look at the explanation below.

# FASTA format

Nucleotides (*Arabidopsis thaliana*)

	>AT1G51370.2 | F-box/RNI-like/FBD-like domains-containing protein
	ATGGTGGGTGGCAAGAAGAAAACCAAGATATGTGACAAAGTGTCACATGAGGAAGATAGGATAAGCCAGTTTTTGATATCTGAAATACTTTTTCATCTTTCTACCAAGGACTCTGTCAGAACAAGCGCTTTGTCTACCAAATTTTGGCAATCGGTTCCTGGATTGGACTTAGACCCCTACGCATCCTCAAATACCAATACAATTGTGAGTTTT






1. **>** Annotation information 
2. Sequence



# FASTQ Format

    @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65






1. ID
2. Sequence
3. Nothing
4. Quatliy (ASCII code')

