# <center>**Introduction to Python Workshop**</center>
###### <center>[Harvard Chan Bioinformatics Core](https://bioinformatics.sph.harvard.edu/)</center>
###### <center>2020-08-28</center>


---

# Pre-class preparation 
Welcome to **Introduction to Python** workshop! Before we start, let's prepare a few things and get familiar with the platform (Google Colab) that we will use during this workshop. Colab provides free cloud service based on the [Jupyter notebook](https://jupyter.org/) environment. It does not require users to install python locally, it is run entirely in the cloud. 

Please follow the instructions below to set up for this workshop:
1. **Create your own copy of this python notebook (which is also the workshop materials)**: Click "File" at the top -> Click "Save a copy in Drive". This notebook will be automatically named "Copy of Intro_to_Python_in_class_version" and copied in your Google Drive. You can rename it if you want.
2. **Check where the material is located**: Click "File" again -> click "Locate in Drive". A new window should pop up, showing you its location in your Google Drive.
3. **Get familiar with Colab interface and terminology**: 
*   code cell
*   text cell
*   creating and deleting cell
*   adding comments
*   saving the document

4. **Run the code cell below**. You should see an output console under the code cell.


In [None]:
# Prepare for the lesson
_4_2 = 0.314
ans_final = "ATGAACGCATCGATATATATGTATGATAGCAAATACTATACGTAATCGATCAGT"
print("Done! You are all set for pre-class preparation!")

---

# Section I: Introduction to Python 

## What is Python?
Python is a powerful, open-source, general-purpose programming language with a wide variety of applications. Many [websites and apps](https://codeinstitute.net/blog/7-popular-software-programs-written-in-python/) that we are familiar with, including YouTube, Instagram, Spotify etc, are actually built on Python. In the field of bioinformatics, Python is also widely used in computational programming, data analysis, and pipeline development. 

Below we have listed some examples where Python-based tools are used:

| **Example use case** | **Tool** |
| :---: | :---: |
| Pipeline development | [bcbio](https://bcbio-nextgen.readthedocs.io/en/latest/), [snakemake](https://snakemake.readthedocs.io/en/stable/) |
| Image analysis | [CellProfiler](https://cellprofiler.org/) |
| Molecular visualization | [PyMOL](https://pymol.org/2/) |
| Machine learning | [scikit-learn](https://scikit-learn.org/stable/) |

> Note: Many bioinformatics tools are written in Python, and you may have encountered/used them without knowing how they work. This is okay, but if you want to customize them or modify something, you will need to understand the scripts.

Since it was first released in 1991, Python has been constantly developed and updated. The current version of Python is 3.0. Python programming language features include:
- Easy-to-understand syntax
- Large number of built-in libraries for common tasks
- Succinct and readable format, both friendly to the programming amateur and the professional
- Widely used programming language, with good documentation and lots of tutorials

## What will you learn in this workshop?
To take this workshop, you don't need to have prior experience in Python (or any other programming language(s)). We will start with basic Python syntax, progress through various concepts and end with learning about writing conditional statements. The workshop is designed to set a foundation for advanced topics in Python, including data wrangling/visualization, pipeline development, and machine learning applications.

## Additional tips
- `#` at the beginning of the line denotes that the following statement is a comment or annotation for the code. Essentially, any line starting with a `#` won't be executed by Python. 
- Indentation is very important in Python syntax. Not only is it necessary for Python, it is great for code readability. Essentially, by making indentations essential, Python forces good code-writing. We will be talking more about this throughout the workshop.

---

# Section II: Basic Python syntax

## Variables
A "variable" is a temporary container that stores some information, and is given a name. We can use the `=` operator to assign some value to a variable name/variable. Variable naming in Python has to fulfill the following rules:
- must start with a letter or underscore character
- can contain only alpha-numeric characters or the underscore character (A-z, 0-9, _ )
- cannot be a reserved keyword in Python. A complete list can be found [here](https://docs.python.org/3.8/reference/lexical_analysis.html#keywords).
- is case-sensitive (e.g. `Year` and `year` are two different variable names)  

In [None]:
# Assign two variables: 2 to x, and 5 to y
x = 2
y = 5

The "x" and "y" variables are now stored in the current Python computing environment. 

We can use the `print()` function to print out the value of variables to the console. Alternatively, you can just type out the variable names. 
> If running mutliple lines of code in a code chunk, it is recommended to use the `print()` function, otherwise only the value of the last variable in your code chunk will be printed out to the console.

> In Python (like in R), functions are the workhorses and are written as follows with the open/close parentheses: `function_name()` 

In [None]:
# Print out the value of variables 'x' and 'y'
print(x)
print(y)

> Note: Within a line of code, Python is lenient about whether to use spaces or not. We find space helpful for better readability, but it is personal preference.

New variables can be easily generated by performing some mathematical calculation(s) on existing variables. For example, we can calculate the mean of x and y. 

Python also supports a wide range of common mathematical calculations using operators and functions, e.g. power (`**`) and square root (`sqrt()`).

In [None]:
# Calculate the mean of x and y, and store the value to a variable called 'mean'
mean = (x + y) / 2
mean

## Data types
Data comes in many different types. For example we know that the newly created variables `x` and `mean` are numeric, however `x` is a whole number, so its data type within Python is `int` or "integer"; `mean`, on the other hand is a number with decimal places, so its data type within Python is called `float`. 

> These are similar to the "integer" and "numeric" data types in R, respectively.

We can use the `type()` function to check what data type a given variable has.

In [None]:
# Check the data type for 'x' and 'mean'
print(type(x))
print(type(mean))

Another commonly used data type is `str`. String stores a sequence of characters, and can be created by enclosing characters inside single quotation marks `''` or double quotation marks `""` .

> This is similar to the "character" data type in R.

In [None]:
# Generate a str variable called 'text' with the value 'hello world!'. Check its data type
text = 'hello world!'
print(text)
print(type(text))

The last data type we will introduce here is called `bool`. This boolean data type can be either `True` or `False`. It is usually to specify if an expression is true or false. We will cover this data type in the conditional statement section.

> This is similar to the "logical" data type in R.

In [None]:
# Generate a boolean variable called 'test', which judges whether 10 is smaller than 8. Check its data type
z = 10 < 8
print(z)
print(type(z))

## Recap
In this short section, we introduced some very basic terms in Python. We learned **how to assign variables**, and what rules to follow. We also described several important **data types** - `int`, `float`, `str`, `bool`. Hope it has been fun so far!

| **Data type** | **Examples** |
| :---: | :---: |
| int (numeric) | 2 |
| float (numeric) | 3.5 |
| str | 'hello world!' |
| bool | True, False|

In the next section, we will focus on one important concept - Python lists. This will be something you use all the time in Python.

---

# Section III: Python List

## Create a List
Next, let's talk about data structures. In languages like Python and R data is contained within variables in specific ways. A frequently used "data structure" in Python is called a `list`. A Python list is a collection of data stored within square brackets `[]`.  

Lists have the following features:
- order of its elements matters
- can store mixed data types that we introduced above
- can even contain a sublist

> Note: There are other Python data structures, including `tuple`, `dictionary`, `sets`. We are not going to cover those in this workshop, but they can be very useful in some situations. If you are interested in learning more about them, this [website](https://thomas-cokelaer.info/tutorials/python/data_structures.html) has good introductory information.

In [None]:
# Create an empty list called 'empty'
empty = []
empty

In [None]:
# Create a list called 'species', containing three strings: ecoli, human, corn.
species = ['ecoli', 'human', 'corn']
species

In [None]:
# Create a list called 'glengths', containing three numeric values that correponds to genome length (in Mb): 4.6, 3000, 2500
glengths = [4.6, 3000, 2500]
glengths

In [None]:
# Create a list called 'combined', containing all three species and corresponding genome lengths as pairs
combined = ['ecoli', 4.6, 'human', 3000, 'corn', 2500]
combined

In [None]:
# Create a list called 'combined2', with each species and genome length pair as a sublist
combined2 = [['ecoli', 4.6],['human', 3000], ['corn', 2500]]
combined2

## Subsetting a single element from a list
Now that we have created a list, how do we access the data from it? 

We can do so by specifying the "index" number - the location of the data within the list. Similar to some other programming languages, **Python index starts from 0** (it is not intuitive, we know! Please just bear with it). 

The first element of a list is `list[0]`. Alternatively, we can also use `-` to access the data starting from last element. The last element of a list is `list[-1]`. The image below illustrates the index for each elements. 

<p align="center">
<img src="https://github.com/hbctraining/Training-modules/blob/master/Python/img/list1.png?raw=true" width="500"/>
</p>



In [None]:
# Get the 3rd element from list 'combined'
combined[2]

In [None]:
# Get the 3rd element from list 'combined2'. Notice that the result is a sublist!
combined2[2]

In [None]:
# Get the 3rd from the last element from the list 'combined'
combined[-3]

## Subsetting multiple elements from a list
Now, what if we want to access multiple elements in a list? 

Here we introduce the slicing `:` operator. The syntax of "slicing" is `[start:stop:step]`. *start* refers to the starting index of the slice. *stop* refers to the index of the element just **before** the finish of our "slice". *step* refers to step value of the slice.
> Note: You don't have to specify all slicing elements; when it is not specified, Python will use default value - **by default, it will start from the first element, stop at the last element, and use step of 1**.

<p align="center">
<img src="https://github.com/hbctraining/Training-modules/blob/master/Python/img/list2.png?raw=true" width="500"/>
</p>

In [None]:
# Get the first two elements from the list 'combined' - method 1: specify both start and stop position
combined[0:2]

In [None]:
# Get the first two elements from the list 'combined' - method 2: specify only stop position
combined[:2]

In [None]:
# Get the last two elements from the list 'combined' - method 1: use normal index
combined[4:]

In [None]:
# Get the last two elements from the list 'combined' - method 2: use negative index
combined[-2:]

In [None]:
# Get every other element from the list 'combined'
combined[::2]

In [None]:
# Get every element in reverse order from the list 'combined'
combined[::-1]

## Modifying a list
Next, we will learn several ways to modify elements in an existing list. 

* To change the value of an element, we can simply reassign the element to a new value. 
* To add a new element to a list, we can use the `+` operator. 
* Lastly, to delete an existing element from a list, we can use the `del()` function.

> Note: The `del()` function will modify the original list. Be mindful of this, especially if you run this function more than once you will be removing more values than you initially set out to.

In [None]:
# Change ecoli's genome length from 4.6 to 4.8 in combined
combined[1] = 4.8
combined

In [None]:
# Add information for another genome to combined - fly with genome length of 180
combined = combined + ['fly', 180]
combined

In [None]:
# Delete information about corn and its genome length from combined
del(combined[4:6])
combined

## Loops
Now that we have defined a list, how can we perform specific tasks on each element of this list? 

You could write many lines of code to do the same thing on every list element, but that would be inefficient, time-consuming and error-prone. To get around this we are introducing an important concept in programming - **iterating over elements with a `for` loop**. 

A loop will **iterate through a defined sequence of elements**, and within each loop, the same lines of code are executed. The basic syntax for a `for` loop is:  
```
for variable_name in iterable_sequence:
    code_line1
    code_line2
    ...
```

>Note: 
- **Indentation is very important #1**. It is what Python uses to define "a chunk of code". By default, Python uses four spaces (or tab key) for indentation, but you can use a different number as long as you have at least 1 space. Python does not use `{}` to enclose the code chunk in the loop, like some other languages.
- **Indentation is very important #2**. It is not for styling purpose, but a requirement. Incorrect indentation will cause errors or return unexpected results. You may refer more information from this [article](https://www.faceprep.in/python/python-indentation/).
- Loops can be applied not just to a list, but also other iterable sequence of elements - for example, string, tuple, and dictionary.
- The variable_name can be any string, but we recommend using a variable name that makes sense in the context of your analysis.

In [None]:
# Print every element in the list 'combined'
for x in combined:
    print(x)

If we want to loop over just some selected elements - not necessarily all the elements in a list, we can do so by specifying a new list, containing the index of the iterable sequence, and then accessing just the data from the corresponding indices.

In [None]:
# Print species names in the list 'combined', by iterating their indices, in the following order: human, ecoli, and fly 
index = [2, 0, 4]
for i in index:
    print(combined[i])

If the number of selected indices we want to iterate over is very large, the above method of subsetting can be very tedious. Luckily, Python provides a built-in function called `range`, which generates a sequence of integers. For example, `range(n)` will return integers from `0` to `n-1`. We can also define a subset of integers using syntax `range(start, end, step)`. The rule for the "start / end / step" position is the same as how we access multiple data in a list, except that we use `,` rather than `:` for separation.

In [None]:
# Use range() to print numbers from 0 to 5
for i in range(6):
    print(i)

In [None]:
# Use range() to print even numbers from 2 to 8
for i in range(2, 9, 2):
    print(i)

Once we know how to quickly define a sequence of integers, we can then use these integers as indices, and access only selected elements from a list.

In [None]:
# Print just the first 4 elements in the list 'combined'
for i in range(4):
    print(combined[i])

In [None]:
# Print all the species names in the list 'combined'
for i in range(0, 5, 2):
    print(combined[i])

### Exercise
We mentioned that the loop works for any data structure that has a sequence of elements. Now let's try to implement it on a "string" (*Note: you can think of string as a sequence of characters*). Create a variable with the following DNA sequence - 'ACTGAT' (which is a string), and then use a `for` loop to print out every base of this DNA.

In [None]:
#### Insert your code below ####
dna = "ACTGAT"
for base in dna:
    print(base)

## Recap
We have covered a lot of content in this section! 

We first introduced **what is a list** and **how to create a list in Python**. We then learned **how to access and manipulate one or more elements in a list**. Sometimes there are multiple ways to achieve this goal. Finally, we touched upon **a key concept - the `for` loop**, and practiced iterating over lists and strings.

In the next section, we are going to introduce tools that make Python programming much more powerful - functions.

---

# Section IV: Functions

## Built-in function
A function is a collection of reusable code that performs a particular task. Python has a set of built-in [functions](https://docs.python.org/3/library/functions.html). For example, `max()` returns the maximum value of a list consisting of numeric numbers. Another example is `range()`, which we encountered earlier.

In [None]:
# Use the max() function to return maximum value of the list 'glengths'
max(glengths)

Let's take another example - `round()` - this function rounds a numeric value to a certain decimal point. By default, the output will be a whole number. 

In [None]:
# define a variable with the value of pi, and then output the corresponding whole number using the round() function 
pi = 3.14159
round(pi)

What if we want to round the value to a specific number of decimal places? In that case, we would have to use additional *arguments* when using the function. 

To check the available arguments and usage information for a function, one can use the `help()` function. However, we would recommend that you simply search the web for the function you want to use. For instance, [here](https://www.programiz.com/python-programming/methods/built-in/round) shows some nice examples for the `round()` function. You can easily find similar resources online for most other functions.

In [None]:
# Use the `help()` function to display the usage of the `round()` function
help(round)

We now know that we can specify number of digits using the `ndigits` argument within `round()`. Let's try that with pi!

In [None]:
# round the value of pi to 2 decimal places
round(pi, ndigits=2)

### Exercise
Another useful base function is `sorted()` - it sorts the elements of a given list in a specific order. Use this function to reorder the `glengths` list in **descending** order. Check [here](https://www.programiz.com/python-programming/methods/built-in/sorted) if you are not sure what argument to use.

In [None]:
# Sort the glengths list in descending order
#### Insert your code below ####
sorted(glengths, reverse = True)

## Object-specific function
Python has a lot of functionality beyond the basic built-in functions. Recall the data types and data structures that we learned earlier? They are all called Python **objects**. Depending on the object type, there are functions/methods to perform object-specific tasks. 

Let's take a concrete example. One function for a Python string is `count`. `count` searches the substring in the given string and returns how many times the substring is present within the object. The syntax is `string.count(substring)`.

In [None]:
# Count number of T in a DNA sequence 'ACTGAT'
DNA = "TCAGTT"
DNA.count("T")

Let's look at another example of object-specific functions: use the `join` function to concatenate all the elements of an iterable object (e.g. list), separated by a string separator. The syntax is `string_separator.join(iterable)`.

In [None]:
# Join the below elements to a full sentence, separated by empty space
words = ['Welcome', 'to', 'the', 'Python', 'workshop']
' '.join(words)

Pretty handy, right? We have just touched the tip of the iceberg so far. There are many more [functions](https://docs.python.org/3/library/stdtypes.html#string-methods) for strings in Python. Below we list a few more functions that you will likely see or may use.

| **Function** | **Description** | **Example** | **Output** |
| :---: | :---: | :---: | :---: |
| capitalize | Converts the first character to upper case | 'atgc'.capitalize() | 'Atgc' |
| count | Returns the number of times a specified value occurs in a string | 'atgc'.count('c') | 1 |
| islower | Returns True if all characters in the string are lower case | 'atgc'.islower() | True |
| join | Joins the elements of an iterable to the end of the string | ''.join(['a', 't', 'g', 'c']) | 'atgc' |
| replace | Returns a string where an old value is replaced with a new value | 'atgc'.replace('a', 'g') | 'gtgc' |
| split | Splits the string at the specified separator, and returns a list | 'hello world'.split() | ['hello', 'world'] |

Similarly, for Python lists, there are a collection of [functions available](https://docs.python.org/3/tutorial/datastructures.html). To use a function with a list, the syntax is `list_name.function_name()`. Below we have listed a few examples (but it is far from exhaustive). 

| **Function** | **Description** |
| :---: | :---: |
| append | Adds an element at the end of the list |
| count | Returns the number of elements with the specified value |
| index | Returns the index of the first element with the specified value |
| reverse | Reverses the order of the list |
| sort | Sorts the list |

### Exercise
One built-in Python function is `len()`. `len()` returns the length of an object (e.g. string or list). For example, `len('TCAGT')` will output 5. 

Now let's calculate the GC content (percentage of G + C) for a DNA sequence (assuming all letters are capitalized)!

In [None]:
# This is the example DNA sequence we want you to use
dna = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'

# First, store the length of this DNA sequence into a variable called 'length'. 
# Hint: use built-in len() function
#### Insert your code below ####
length = len(dna)

# Next, create two variables, 'count_c' and 'count_g', which store the number of 'C's and 'G's in this DNA sequence. 
# Hint: use the string-specific count() function we learnt in the previous section
#### Insert your code below ####
count_c = dna.count("C")
count_g = dna.count("G")

# Finally, calculate the GC content of this DNA sequence, and store the result to variable called "GC_content"
#### Insert your code below ####
GC_content = (count_c + count_g) / length
GC_content

In [None]:
# Once done, run this cell to check your answer
assert abs(GC_content - _4_2) < 0.01
print("Correct!")

## User-defined functions
Besides using pre-defined functions from Python, we can also create our own user-defined function. 

As our programming task becomes larger, functions can help to break the task into smaller, modular chunks. Applying functions in a program avoids repetition of larger code chunks that are doing the same tasks over and over again. It makes the code easily readable, reusable and easy to organize. To write a user-defined function, the basic syntax is as follows:

```
def function_name(input):
    code_line1
    code_line2
    code_line3
    ...
    return output
```

> Note: The **return statement** in the last line of code is to exit to the place from where it was called. Output is the value that we want the function to return.

Now let's write a simple function that calculates the square of an input number.

In [None]:
# Create a number squaring function called 'square_it'
def square_it(x):
    y = x * x
    return y

> Note: Python uses **indentation** to define a block of code. Therefore, it is very important to **keep consistent indentation** throughout the block.

In [None]:
# Test the 'square_it' function with a number of your choosing
square_it(5)

### Exercise
Now that we know how to create user-defined functions, let's stitch together our previous exercises, and write a REAL function that calculates GC content of an input DNA sequence (assuming all letters are capitalized)!

In [None]:
# Write a function called 'GC_calculator', which calculates GC content of an input DNA sequence
#### Insert your code below ####
def GC_calculator(x):
    length = len(x)
    count_c = x.count("C")
    count_g = x.count("G")
    GC_content = (count_c + count_g) / length
    return GC_content

In [None]:
# Run this code to see if you can obtain expected result 0.5
GC_calculator('ATGC')

In [None]:
# Once done, run this cell to check your answer
assert abs(GC_calculator('ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT') - _4_2) < 0.01
print("Correct!")

## Recap
In this section, we introduced different types of functions: 
* **built-in**
* **object-specific**
* **user-defined**

We also walked through some examples of very practical tasks, using these functions. Hopefully by now you agree that Python is at least useful for something!

In the next section, we will finish the last piece of our knowledge building - how to make our code "smarter" with conditional statements.

---

# Section V: Conditional statement

## Boolean logic and operator
Let's talk a little more about the data type we learned about earlier, `bool`. One situation where we use boolean logic is in a conditional test. A condition is simply a statement, which can be either `True` or `False`. 

Example #1:  `3 <= 5` is `True`, and `3 > 5` is `False`. 

Example #2: We can build a more complicated statement too, such as `len('ATGC') > 5` is `False`. 

Conditional testing is an important tool in programming, where we guide computers to make decisions based on the result of a given test. Below are some common judgements used in conditional statements.

| **Sign** | **Judgement** |
| :---: | :---: |
| == | equals |
| > | greater than |
| < | less than |
| >= | greater than or equal to |
| <= | less than or equal to |
| != | not equal to |

We are now comfortable judging a single statement. How about multiple statements? Multiple statements are usually combined by boolean/logic operators. Common boolean operators are AND, OR, NOT, and in Python, the corresponding operators are `and`, `or`, `not`.

In [None]:
# What is the output of the "AND" operator to combine these two statements: 3 <= 5, len('ATGC') > 5?
3 <= 5 and len("ATGC") > 5

In [None]:
# What is the output if we use the "OR" operator to combine these two statements: 3 <= 5, len('ATGC') > 5?
3 <= 5 or len("ATGC") > 5

In [None]:
# What is the output when we use the "NOT" operator on this statement: len('ATGC') > 5?
not len("ATGC") > 5

## if statement

Now we are ready to create our first conditional statement. The simplest conditional statement is an `if` statement. The syntax is:

```
if statement_of_interest:
    code (to execute when the statement is True)
```

Two scenarios can happen: if the statement of interest is **True**, the code chunk will be executed; if the statement of interest is **False**, the code chunk will not be executed.

Let's take an example: we want to check if the expression of a gene is high or low, and our threshold for highly expressed genes is 100 (an arbitrary unit). We can write a program to do that. 

In [None]:
# Define a variable with the arbitrary gene expression value of 125
gene_expression = 125

# Next, use the if statement to test if "The gene is highly expressed"
if gene_expression > 100:
    print("The gene is highly expressed.")

## else statement

Now, what if we also want to take some action when the statement of interest is False? We can add a additional `else` statement, following the `if` statement. The combination of `if` and `else` is very frequently used in a yes/no decision-making step during programming. 

The updated conditional statement syntax is:

```
if statement_of_interest:
    code (to execute when the statement is True)
else:
    alternative_code (to execute when the statement is False)
```    

> Note: The `else` statement does not have any condition. It is executed automatically when the `if` statement is **not true**. And don't forget the **colon sign ':'** after `else`!

Following our previous example, we now want to output something even when the gene is not highly expressed.

In [None]:
# Write an if-else statement to evaluate the value in the "expression" variable, after changing its value to 75.
gene_expression = 75

if gene_expression > 100:
    print("The gene is highly expressed.")
else:
    print("The gene is lowly expressed.")

## elif statement

Ok, the program looks reasonably "smart" now. But just one issue left: what happens when we have more than two aspects in our decision-making scenario? In that case the simple if-else statement is not enough. In this case, we need to add another conditional statement, called `elif` (short for "else if"). It follows immediately after the `if` statement, and you can add as many `elif` components as you need. When you have only one decision making scenario left, you use `else`; essentially you end with an `else` statement. 

The updated conditional statement syntax now is:

```
if statement1:
    code (to execute when the statement1 is True)
elif statement2:
    code (to execute when the statemenet2 is True)
elif statement3:
    code (to execute when the statement3 is True)
...
else:
    code (to execute when all the above statements are False)
```

Now we can be gene expression experts, and categorize genes into four types: super highly expressed (more than 300), highly expressed (between 100 and 300), lowly expressed (between 20 and 100), and super lowly expressed (less than 20).

In [None]:
# Write an updated version of the if-elif-else statement using the above directives to evaluate the existing "expression" value
# Test the if-else-if statement with a few different values in the gene_expression variable
gene_expression = 350

if gene_expression > 300:
    print("The gene is super highly expressed")
elif gene_expression > 100:
    print("The gene is highly expressed")
elif gene_expression > 20:
    print("The gene is lowly expressed")
else:
    print("The gene is super lowly expressed")

### Exercise
Now let's use a conditional statement to print out the complementary base when given a nucleotide base as input. Basically, we are following the standard base pairing rule: A-T, C-G. 
> Bonus: Can you increase the robustness of the code, by outputting a warning message when a non nucleotide base is used as input?

In [None]:
# Define a nuleotide base
base = 'T'

# Print its complementary base using conditional statements
#### Insert your code below ####
if base == 'A':
    print('T')
elif base == 'T':
    print('A')
elif base == 'G':
    print('C')
elif base == 'C':
    print('G')
else:
    print('Warning: the input base does not exist!')

## Recap
In this section, we introduced the **logic operator** and the user case of **conditional statements**. We gradually built up the complexity of the conditional statement, and now we should be able to handle multiple decision making scenarios.

---

# Section VI: Final exercise
Now let's apply what we have learnt in the whole workshop and write a small script from scratch! 

Write a function called "rev_comp". Given a DNA sequence as input, this function will generate the reverse complementary sequence as output. For example, `rev_comp('ATCGT')` will output `'ACGAT'`. 

>Hint: We can dissect this task into several steps:
>1. Create an empty list to store the complementary sequence
>2. Iterate through every nuleotide base of the DNA sequence (using the `for` loop)
>3. In each iteration, check what the nucleotide is (using a conditional statement), and add its complementary base to the list you created in step 1.
>4. After the iteration is done, reverse the complementarity list
>5. Convert the list into a string (using an object-specific join function)

In [None]:
#### Insert your code below ####
def rev_comp(dna):
    comp = []
    for base in dna:
        if base == 'A':
            comp = comp + ['T']
        elif base == 'T':
            comp = comp + ['A']
        elif base == 'C':
            comp = comp + ['G']
        else:
            comp = comp + ['C']
    rev_comp = comp[::-1]
    rev_comp = ''.join(rev_comp)
    return rev_comp

In [None]:
# Once you are done, test out the function to see if you can obtain expected result 'ACGAT'
rev_comp('ATCGT')

In [None]:
# Once done, run this cell to check your answer
dna = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
assert rev_comp(dna) == ans_final
print("Your function is correct! Yay!")

---

# Section VII: Class survey
That is almost the end of our workshop! Before we conclude with some final remarks, please take some time to complete this class [survey](http://tinyurl.com/hbc-modules). We very much appreciate your comments and feedbacks.

If you have any suggestions or questions in the future, please contact us at [hbctraining (at) hsph.harvard.edu](mailto:hbctraining@hsph.harvard.edu).

---

# Section VIII: Final remarks

## Installing Python
In this workshop, we used [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb), an online cloud-based notebook. Colab allows us to easily write Python code and execute it just using our internet browser. It is rapidly gaining popularity for collaborative research and teaching (so, you may encounter it again soon!).

If you would like to use Python offline though, you will need to install Python on your computer. Python can be downloaded and installed from [here](https://www.python.org/downloads/). The latest release version is Python 3.8 (as of June 2020), and it is recommended that you install the latest version.

After Python is installed, there are multiple platforms where you can run Python code. The choice of the platform is a personal preference, and it does not affect code running - at the core we use the native Python environment. Below we introduce some other popular platforms (from among many other options):
1. Jupyter Notebook/JupyterLab: [Jupyter Notebook](https://jupyter.org/) is an open-source web application for interactive computing. It allows users to create and share documents that contain code, equations, plots and narrative text. Since Colab is based on the Jupyter notebook environment, we have checked it out in this workshop already. The extension of the notebook file is always `.ipynb`. JupyterLab is a next-generation version of Jupyter Notebook, just with some improvements. You can install them from [here](https://jupyter.org/install).
2. Spyder: [Spyder](https://www.spyder-ide.org/) is an open-source integrated development environment (IDE) for scientific programming in Python. It offers a combination of script editing, data analysis, debugging, and visualization. If you are already used to coding in an IDE interface, like RStudio or Matlab, you will find that using Spyder is very familiar and intuitive.
3. **(Recommended)** Anaconda: many Python users would use a one-stop-shop package manager like [Anaconda](https://docs.anaconda.com/anaconda/user-guide/getting-started/). Just to note, Anaconda allows users to launch applications and manage conda packages, environments and channels without using command-line commands. After installing Anaconda, you can access its desktop graphical user interface (GUI), the Anaconda Navigator. Applications like Jupyter Notebook, JupyterLab, and Spyder are by default installed and are available on it. Anaconda for Python can be downloaded from [here](https://www.anaconda.com/distribution/).

## Python vs R
Both Python and R are very popular programming languages, with ample training materials and community support. There are many online discussions about which one would be better for [bioinformatics](https://www.reddit.com/r/bioinformatics/comments/af7wjv/r_language_vs_python_which_is_the_most_necessary/), or for [data analysis](https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis), in general. We should keep in mind that both have strengths and weaknesses - one language is not better than the other. It all depends on what your use case is and which tools/packages within each language are available to best handle your task. Python is a general purpose programming language with easy-to-understand syntax and is a good start for learning basic programming. R is widely used among stasticians and has field-specific advantages, such as for [RNA-seq analysis](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) or for [data visualization](https://ggplot2.tidyverse.org/index.html). 

## Future learning
If you are interested in learning more about basics of Python programming, we listed a few additional resources below:
- [Python course on kaggle](https://www.kaggle.com/learn/python)
- [Python course on codecademy](https://www.codecademy.com/learn/learn-python)
- [Python course on software carpentry](https://swcarpentry.github.io/python-novice-inflammation/)
- [A Byte of Python](https://python.swaroopch.com/)
- [Python for Biologists](http://userpages.fu-berlin.de/digga/p4b.pdf)


---

**Authors**: Jihe Liu, Radhika Khetani

*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*