In [1]:
from IPython.display import Image
from IPython.display import clear_output
from IPython.display import FileLink, FileLinks

<img src="img/python-logo-master-flat.png" alt="Python Logo" style="width: 120px; float: right; margin: 0 0 10px 10px;" />

## Introduction to Python with Application in Bioinformatics



### Nanjiang Shu

#### 2024-07-17 (Day 3)

## Review of Day 2

- Writing and running the Python script
  - File extension: `.py`
  - Execution: use `python myscript.py` or `./myscript.py`
- How to read and write files
- Practised file processing and data maniputation with a VCF file 
- Introduced the course project

## Review of the quiz from yesterday
<a href="https://forms.office.com/Pages/DesignPageV2.aspx?origin=NeoPortalPage&subpage=design&id=DQSIkWdsW0yxEjajBLZtrQAAAAAAAAAAAAa__Yehr4dURVQ1UU9YSjMzUkdUVEtWMEg4V0VaQ1pXMy4u&analysis=true">Link to the quiz statistics</a>

## Day 3

- __Session 1__
    - Functions and methods
    - Difference of functions and methods
    - Introduction to some useful functions and methods
- __Session 2__    
    - How to write you own functions
    - How to pass arguments from command line using `sys.argv`
    - String formatting
- __Project time__ in the afternoon

## Session 1: Functions and Methods

### What is a function

- A function is a block of code that performs a specific task.
- It can take input (parameters) and return output (results).
- Functions help to reuse code and make programs more modular and readable.

```python
def function_name(parameters):
    # Block of code
    return result
```

In [14]:
print("Hello Python")

Hello Python


In [16]:
len("ACCCCTTGAACCCC")

14


In [17]:
max([87, 131, 69, 112, 147, 55, 68, 130, 119, 50])

147

In [85]:
def gc_content(seq):
    seq = seq.upper()
    gc_count = seq.count("G") + seq.count("C")
    return (gc_count / len(seq)) * 100

sequence = "ACCCCTTGAACCCC"
gc_content = gc_content(sequence)
print(gc_content)

64.28571428571429


## Note: we will describe more about how to define a function in the next session

### What is a method
- A method is a function that is associated with objects (instances of class)
    - All data types in Python are objects

Note: A method is a function associated with an object, which is defined as a class. We will not introduce how to program with class in this beginners course, but if you are interested, you can find more information in Python documentation.

In [68]:
class MyClass:
    def my_method(self):
        return "I love Python"

obj = MyClass()
print(obj.my_method())

I love Python


In [69]:
"ACCCGGGT".lower()

'acccgggt'

In [71]:
mylist = [5, 13, 1, 13, 25]
mylist.sort()
print(mylist)

[1, 5, 13, 13, 25]


In [72]:
# A floating number also has methods
5.0.is_integer()

True



### What is the difference of a `function` and a `method`?
| __Function__                      | __Method__                                                |
|-------------------------------|-------------------------------------------------------|
| Standalone block of code      | Function associated with an object                    |
| Called independently          | Called on an instance of a class                      |
| Not tied to any object or class | Tied to the objects they are called on                |
| Defined outside of a class    | Defined within a class                                |
|


### What does it matter to me?

For now, you only need to know the different syntaxes of using a function and a method:

__A function:__  
`functionName()`

__A method:__  
```<object>.methodName()```


### Introduction to some useful functions
[Python Built-in functions](https://docs.python.org/3/library/functions.html#)

<img src="img/built-in_functions.png" alt="Drawing" style="width: 800px;"/> 

### `help`

In [43]:
help(max)

Help on built-in function max in module builtins:

max(...)
    max(iterable, *[, default=obj, key=func]) -> value
    max(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its biggest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the largest argument.



Note: we have talked about the `range()` function and have used it in the for loop. for example, 
It has more flexibility and be used in many other ways, 

### `range`

In [42]:
for i in range(5):
    print(i)

0
1
2
3
4


In [174]:
# range(start, stop, step)
for i in range(1, 10, 2):
    print(i)

1
3
5
7
9


### `sorted`

In [46]:
li = [5, 4, 1, 3, 10, 13, 13]
print(sorted(li))

[1, 3, 4, 5, 10, 13, 13]


In [47]:
print(sorted(li, reverse=True))

[13, 13, 10, 5, 4, 3, 1]


### `dir`: return a list of names comprising the attributes of the given objet

In [78]:
dir()

['In',
 'MyClass',
 'Out',
 '_',
 '_1',
 '_10',
 '_11',
 '_12',
 '_15',
 '_17',
 '_18',
 '_19',
 '_2',
 '_20',
 '_21',
 '_23',
 '_25',
 '_3',
 '_4',
 '_48',
 '_5',
 '_50',
 '_52',
 '_53',
 '_54',
 '_56',
 '_57',
 '_59',
 '_6',
 '_60',
 '_62',
 '_63',
 '_65',
 '_69',
 '_7',
 '_72',
 '_77',
 '_8',
 '_9',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i11',
 '_i12',
 '_i13',
 '_i14',
 '_i15',
 '_i16',
 '_i17',
 '_i18',
 '_i19',
 '_i2',
 '_i20',
 '_i21',
 '_i22',
 '_i23',
 '_i24',
 '_i25',
 '_i26',
 '_i27',
 '_i28',
 '_i29',
 '_i3',
 '_i30',
 '_i31',
 '_i32',
 '_i33',
 '_i34',
 '_i35',
 '_i36',
 '_i37',
 '_i38',
 '_i39',
 '_i4',
 '_i40',
 '_i41',
 '_i42',
 '_i43',
 '_i44',
 '_i45',
 '_i46',
 '_i47',
 '_i48',
 '_i49',
 '_i5',
 '_i50',
 '_i51',
 '_i52',
 '_i53',
 '_i54',
 '_i55',
 '_i56',
 '_i57',
 '_i58',
 '_i59',
 '_i6',
 '_i60',
 '_i61',
 '_i62',
 '_i63',
 '_i64',
 '_i65',
 '_i66',
 '_

### Introduction to some userful methods

####  Operations on strings

<img src="img/string_methods.png" alt="Drawing" style="width: 600px;"/> 

#### ```str.split()```

In [86]:
a = '  split a string into a list '
a.split()

['split', 'a', 'string', 'into', 'a', 'list']

In [88]:
a.split(maxsplit=3)

['split', 'a', 'string', 'into a list ']

#### `str.join`

<img src="img/join.png" alt="Drawing" style="width: 800px;"/> 

In [92]:
','.join(["a", "b", "c", "d"]) # convert to comma separated

'a,b,c,d'

### Useful operations on mutable sequences, e.g. list, set

<br></br>

<img src="img/list_methods.png" alt="Drawing" style="width: 400px;"/> 

In [98]:
a = [1, 2, 3, 4, 5, 5, 5, 5]
a.append(6)
a

[1, 2, 3, 4, 5, 5, 5, 5, 6]

In [99]:
a.pop(1)
a

[1, 3, 4, 5, 5, 5, 5, 6]

In [100]:
a.reverse()
a

[6, 5, 5, 5, 5, 4, 3, 1]

In [107]:
a.remove(5)
a

[5, 5, 4, 3, 1]

In [159]:
a.insert(0, 13)
a

[13, 6, 5, 13, 5, 5, 4, 3, 2, 1, 0, 0]

In [108]:
### Tuples are immutable
a = (2, 3, 4, 15)
a.pop()

AttributeError: 'tuple' object has no attribute 'pop'

Note: do you still remmeber the Dictionary data type we talked aobut at the first day?

### Dictionary
- A dictionary is an unordered, mutable collection of key-value pairs. 
- Each key in a dictionary must be unique and immutable (e.g., strings, numbers, or tuples), while the values associated with keys can be of any data type and can be duplicated.

In [None]:
sequence_info = {  # a dictionary
    "gene": "TP53",
    "species": "Homo sapiens",
    "length": 2000
}

__Syntax__:  
`a = {}` (create empty dictionary)  
`d = {'key1':value1, 'key2':value2, 'key3':value3}`

### Userful operations on Dictionaries

<img src="img/dictionary.png" alt="Drawing" style="width: 600px;"/>  

In [109]:
sequence_info = {  # a dictionary
    "gene": "TP53",
    "species": "Homo sapiens",
    "length": 2000
}
sequence_info

{'gene': 'TP53', 'species': 'Homo sapiens', 'length': 2000}

In [116]:
len(sequence_info)

3

In [110]:
# Access the value for a given key
gene_id = sequence_info['gene']
print(gene_id)

'TP53'

In [None]:
sequence_info_list = ["TP53", "Homo sapiens", 2000]
gene_id = sequence_info_list[0]
# think if you have dozens of items in a list, and refer to them by indexes

#### Value of a key can be modified

In [111]:
sequence_info['gene'] = "COX2"
sequence_info

{'gene': 'COX2', 'species': 'Homo sapiens', 'length': 2000}

#### Check membership

In [124]:
# check membership for keys
'gene' in sequence_info

True

In [126]:
# check membership for values
2000 in sequence_info.values()

True

#### Loop through a dictionary

In [118]:
for key in sequence_info.keys():
    print(key)

gene
species
length


In [120]:
for value in sequence_info.values():
    print(value)

COX2
Homo sapiens
2000


In [127]:
for key, value in sequence_info.items():
    print(key, value)

gene COX2
species Homo sapiens
length 2000


#### Be aware: key must be immutable, value can be any data type

In [114]:
mydict = {
    "simple list": [1, 2, 3]
}

In [115]:
mydict = {
    [1, 2, 3]: "simple list"
}

TypeError: unhashable type: 'list'

## Summary

- A method always belongs to an object of a specific class, a function does not have to
- The official Python documentation describes the syntax for all built-in functions and methods
  - https://docs.python.org/3.9/

## Day 3, Exercise 1
- Link: https://python-bioinfo.bioshu.se/exercises.html
___
### Take a break after the exercise

## Session 2: 
- How to define your own function
- How we pass arguments from the command line to the Python script - sys.argv
- String formatting

Note: when we are talking about the conditional statement, we have showned this example
The program check whether a genotype is active or not by comparing it to the target variant, and if true, further check if the phenotype is "expressed" or not, 

The outcome will be determine the status of the variant. 

If I want to check for another genotype or phenotype, I have to change the value in the code


In [142]:
# Use nested conditionals to categorize genetic variants based on multiple attributes.
genotype = "AG"
phenotype = "expressed"
if genotype == "AG":
    # Only check phenotype if genotype is "AG"
    if phenotype == "expressed":
        print("Variant " + genotype + " is active and expressed.")
    else:
        print("Variant " + genotype + " is active but not expressed.")
else:
    print("Variant " + genotype + " is a non-target variant.")

Variant AG is active and expressed.


### Syntax of function
```python
def function_name(arg1, arg2, ...):
    # Block of code
    return result
```

Note: live coding to define the function

In [142]:
# Let's start with the previous code
genotype = "AG"
phenotype = "expressed"
if genotype == "AG":
    # Only check phenotype if genotype is "AG"
    if phenotype == "expressed":
        print("Variant " + genotype + " is active and expressed.")
    else:
        print("Variant " + genotype + " is active but not expressed.")
else:
    print("Variant " + genotype + " is a non-target variant.")

Variant AG is active and expressed.


In [134]:
def check_geno_pheno(genotype, phenotype):
    if genotype == "AG":
        # Only check phenotype if genotype is "AG"
        if phenotype == "expressed":
            print("Variant " + genotype + " is active and expressed.")
        else:
            print("Variant " + genotype + " is active but not expressed.")
    else:
        print("Variant " + genotype + " is a non-target variant.")
    
check_geno_pheno("AG", "expressed")

Variant AG is active and expressed.


In [135]:
check_geno_pheno("A", "expressed")
check_geno_pheno("AC", "nonexpressed")

Variant A is a non-target variant.
Variant AC is a non-target variant.


Note the return statement

### Syntax of function
```python
def function_name(arg1, arg2, ...):
    # Block of code
    return result
```

In [138]:
result = check_geno_pheno("A", "expressed")
print("result = ", result)

Variant A is a non-target variant.
result =  None


Note: live coding to modify the following code, so it return the message

In [None]:
# re-define this function so that it return a meaningful value
def check_geno_pheno(genotype, phenotype):
    if genotype == "AG":
        # Only check phenotype if genotype is "AG"
        if phenotype == "expressed":
            print("Variant " + genotype + " is active and expressed.")
        else:
            print("Variant " + genotype + " is active but not expressed.")
    else:
        print("Variant " + genotype + " is a non-target variant.")
    
check_geno_pheno("AG", "expressed")

In [139]:
# redefine the function so that it return a meaningful value
def check_geno_pheno(genotype, phenotype):
    result = ""
    if genotype == "AG":
        # Only check phenotype if genotype is "AG"
        if phenotype == "expressed":
            result = "Variant " + genotype + " is active and expressed."
        else:
            result = "Variant " + genotype + " is active but not expressed."
    else:
        result = "Variant " + genotype + " is a non-target variant."
    
    return result
# live coding

In [140]:
result = check_geno_pheno("A", "expressed")
print("result = ", result)

result =  Variant A is a non-target variant.


In [141]:
r1 = check_geno_pheno("AG", "expressed")
r2 = check_geno_pheno("A", "expressed")
print("result 1 = ", r1)
print("result 2 = ", r2)

result 1 =  Variant AG is active and expressed.
result 2 =  Variant A is a non-target variant.


## Why use functions?

- Cleaner code
- Better defined tasks in code
- Re-usability
- Better structure

### 5 min exercise

Write a function that calculate the length of sequence
- Name of the function, `get_seq_len`
- Input parameter: `sequence` (String)
- Return value: `length` (Int)

## How to pass arguments to Python script from the command line

Note: yesterday we talked about how to read data from a file and showed this example.
Now we will switch to a terminal 

In [None]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    seqlength = 0
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            seqlength += len(line)

outfile = "../files/output/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
    file.write("Length of DNA sequence: " + str(seqlength))

### `sys.argv`
- `sys.argv` is a list in the `sys` module, it contains the command line arguments passed to the script
    - Position 0: the program name
    - Position 1: the first argument
    - Position 2: the second argument
    - etc

### How to use it
```python
import sys

program_name = sys.argv[0]
arg1 = sys.argv[1] # index error if the first argument is not provided in the command
arg2 = sys.argv[2] # index error if the second argument is not provided in the command
```

In Python, the `import` statement is used to include external modules and libraries into your current script or module.

Note: we will talk more about modules and how to use them tomorrow. Now we will switch to the terminal again

#### Try out `sys.argv`

Python script is called `print_argv.py` in the download folder

In [None]:
# Modify this code so that it takes command line arguments
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    seqlength = 0
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            seqlength += len(line)

outfile = "../files/output/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
    file.write("Length of DNA sequence: " + str(seqlength))

In [None]:
#!/usr/bin/env python
# answer
import sys

if len(sys.argv) > 2:
    seqfile = sys.argv[1]
    outfile = sys.argv[2]
else:
    print("USAGE: " + sys.argv[0] + " seqfile outfile")
    sys.exit(1)
    
with open(seqfile, "r") as file:
    seqlength = 0
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            seqlength += len(line)

# outfile = "../files/output/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
    file.write("Length of DNA sequence: " + str(seqlength))
print("Result has been output to ", outfile)

Note: Can you answer, why we use arguments from the command line

### Why use arguments from the command line


- Avoid hardcoding the filename in the code
- Easier to re-use code for different input files

## String formatting

Format text for printing or for writing to file.

What we have been doing so far:

In [176]:
chrom = "5"
pos = 1235651
ref = "C"
alt = "T"
geno = "1/1"
info = chrom + ":" + str(pos) + "_" + ref + "-" + alt + " has genotype: "+ geno
print(info)

5:1235651_C-T has genotype: 1/1


### Other (better) ways of formatting strings:

__f-strings (since python 3.6)__

In [177]:
chrom = "5"
pos = 1235651
ref = "C"
alt = "T"
geno = "1/1"
info = f"{chrom}:{pos}_{ref}-{alt} has genotype: {geno}"
print(info)

5:1235651_C-T has genotype: 1/1


In [None]:
info = chrom + ":" + str(pos) + "_" + ref + "-" + alt + " has genotype: "+ geno

#### `format` method

In [178]:
chrom = "5"
pos = 1235651
ref = "C"
alt = "T"
geno = "1/1"
info = "{}:{}_{}-{} has genotype: {}".format(chrom, pos, ref, alt, geno)
print(info)

5:1235651_C-T has genotype: 1/1


#### `f-strings" formatting is recommended 


#### It works for other data types as well

In [180]:
genes = ["TP53", "COX2"]
lengths = [355,  458]
print(f"Lengths of genes {genes} are {lengths}")

Lengths of genes ['TP53', 'COX2'] are [355, 458]


In [160]:
gene = "COX1"
exp_level = 45.123253
print(f"Expression level of gene {gene} is {exp_level}")

Expression level of gene COX1 is 45.123253


In [171]:
print(f"Expression level of gene {gene} is {exp_level:.2f}")

Expression level of gene COX1 is 45.12


__The ancient way (python 2)__

In [156]:
genotype = "AG"
result = "Variant %s is active and expressed."%(genotype)
print(result)

Variant AG is active and expressed.


## Summary

- A function is a block of organized, reusable code that is used to perform a single, related action
- Use `sys.argv` to deal with arguments passed to the python script from the command line
    - `sys.argv[0]` is the program name
    - `sys.argv[1]` is is the first argument and so on
- `f-strings` formatting is a convenient and recommended way to format the string
    - Extra reading about string formatting: https://www.w3schools.com/python/python_string_formatting.asp

## Day 3, Exercise 2
- Link: https://python-bioinfo.bioshu.se/exercises.html
___
##### Break
___

### Quiz for Day 3
- Link: https://python-bioinfo.bioshu.se/quiz.html
___

### Lunch 
___
## Project time after lunch