# Microbial Genomics: Laboratory 1
## Topic: Working with strings and string finding algorithms
#### Tools used: Anaconda, Jupyter notebook

In [1]:
## Compatibility code for running in Google Colab - feel free to ignore
import os

if "colab" in str(get_ipython()):
    from google.colab import drive

    drive.mount("/content/drive")
    os.chdir("/content/drive/My Drive/microbial_genomics_labs/labs")

## Part A: Exercises (10 pts)
Python is the most commonly used language for bioinformatic analysis because it easily works with strings and is very interactive in nature. We will be using Python for many topics in this course, and so it’s important to (re)familiarize yourself with the basics. As with all programming, an understanding of basic syntax should allow you to use available online resources to implement more complex logic.

All Python documentation can be found [here](https://docs.python.org/3/).

### Exercise 1: Variable types and simple math operators (1 pt)
* Variables will not automatically display on the command line. Re-enter the variable name to examine the value stored in it
* Typical variable naming rules apply
* Single equal sign assigns variable name
* To execute a cell: shift+return

In [2]:
## Some examples variable assignments using different data types

my_string = "This is a string"
my_int = 10
my_int2 = 10 + 20
my_list = ["cat", "dog", "duck"]
my_dict = {"Biology": 80, "Chemistry": 75, "Physics": 30}

Practice
1. Define one new variable of each type (string, integer, list) below
2. Define three new variables using each operation: addition, subtraction, and multiplication, respectively. These can be found in Built-in Types --> Numeric Types in the documentation above.
3. Define a dictionary of any number of cities (keys) and associated states (values), and practice calling it using the key

In [None]:
# Practice for exercise 1

### Exercise 2: Common and useful functions (1 pt)
* Python functions are called by parentheses
* [Built-in functions](https://docs.python.org/3/library/functions.html) perform useful operations such as math (rounding, absolute value, etc.), and string (number of characters, concatenation, splitting, etc.) operations. Each variable type has its own type of built-in functions that can be found in the documentation under the corresponding Type.

In [None]:
## For numbers:
myNum = -20.4

# Round myNum and print the resulting value
round_num = round(myNum)
print(round_num)
# Find the absolute value of myNum and print the resulting value
pos = abs(myNum)
print(pos)

In [None]:
## For strings:
strng = "A dog with a cat with a dog"

# Print the string
print(strng)
# Find the length of the string and print it
print(len(strng))
# Replace 'dog' with 'fish' and print the resulting string
str2 = strng.replace("dog", "fish")
print(str2)
# Split str into a list of strings split by spaces
str_list = strng.split(" ")
print(str_list)

Practice

Define a new string or list of strings (your choice). Choose any three string functions from the documentation that we didn't use above and implement them below.

In [None]:
# Practice for exercise 2

### Exercise 3: Control logic (2 pts)
* As with any language, Python uses relational and boolean operators to implement logical statements
* Python uses tabs to indicate an open control structure (if statement, for loop, or user-defined function)
* Initating control structures require a colon at the end of the first line

In [None]:
## Relational operators:
out1 = 4 == 4  # 4 equals 4
print(out1)
out2 = 10 > 5  # 10 is greater than 5
print(out2)
out3 = 10 != 5  # 10 is not equal to 5
print(out3)
out4 = "dog" != "fish"
print(out4)

In [None]:
## Boolean operators:
bool1 = out2 and out1
print(bool1)
bool2 = out2 and not out1
print(bool2)

In [None]:
## Control flow (if/else, for)

# if statements
if out2 and out1:
    print("This if statement is true!")

# if else-if statements
if out1 and not out2:
    print("This first statement is true!")
elif out1 or out3:
    print("This second statement is true!")

# for loops: what is this loop doing?
food_list = ["apple", "carrot", "chocolate", "jam", "eggs", "salmon"]
count = 1
for item in food_list:
    if item == "carrot":
        print("Item number " + str(count))
        break
    else:
        count = count + 1

In [None]:
## Functions


# Make a function that takes in a DNA sequence and returns the reverse complement
def reverse_comp(sequence):
    DNA_dict = {"A": "T", "T": "A", "G": "C", "C": "G"}
    comp_sequence = []
    for nucleotide in sequence:
        comp_sequence.append(DNA_dict[nucleotide])
    output = "".join(reversed(comp_sequence))
    return output


sequence = "ACCGTC"
revcom_seq = reverse_comp(sequence)
print("Reverse complemented sequence: " + revcom_seq)

Practice
1. Use relational operators to determine if 10 is less than 20
2. Make your own if/elif/else statement that uses both number comparisons and string comparisons
3. Write a for loop that iterates through characters of the generic DNA sequence AGCCTAT, and creates a new sequence consisting only of the A's and T's in the order they originally occur

In [None]:
# Practice for exercise 3

### Exercise 4: List indexing and comprehensions (2 pts)
* List comprehensions are ways to more concisely write logic that generates a list
* Indexing in Python begins at 0
* Indexing in Python uses square brackets
* List comprehensions are always defined with external brackets

In [None]:
## Working with lists

# Create a list
fruits = ["orange", "apple", "pear", "banana", "kiwi", "apple", "banana"]
# Display the first item in the list
print(fruits[0])
# Print first two elements of fruits
print(fruits[0:2])
# Display the last item in the list
print(fruits[-1])
# Display every other fruit in the list
print(fruits[0 : len(fruits) : 2])
print(fruits[::2])
# Insert grapes into second position
fruits.insert(1, "grapes")
print(fruits)

In [None]:
## Create a list comprehension

# Create a list of perfect squares from 0 to 9
squares = [x**2 for x in range(10)]
print(squares)
# Create a new list for all pairs of x and y where the two don't equal each other
[(x, y) for x in [1, 2, 3] for y in [3, 1, 4] if x != y]

Practice
1. Define a list of 5 vegetables
2. Append eggplant to the end of your list
3. Display thesecond, third and fourth elements of your list
4. Re-define fruits as above, and use list comprehension to create a new list of fruits only containing fruits that start with a vowel

In [None]:
# Practice for exercise 4

### Exercise 5: Data input/output (2 pts)
* Never open a file without closing it
* To avoid forgetting to close files and corrupting the content, use 'with' and 'open' to open files instead
* Files can be opened as read only ('r'), writing a new one ('w'), or appending an exisitng one ('a')
* 'write' will store a file as any common file type as long as the extension is specified ('txt','xlsx','csv',etc.)
* "\n" adds a new line

In [None]:
## Working with files

# opening a file
filename = "lab1/test.txt"
grocery_list = ["apple", "banana", "orange"]
with open(filename, "w") as f:
    for fruit in grocery_list:
        f.write(fruit + "\n")

with open(filename, "r") as f:
    for line in f:
        print(line)

In [None]:
# just so we can see that the file exists outside python...
!cat lab1/test.txt

Practice
1. Open test.txt to append
2. Add 5 more fruits to the document and save the resulting file

In [None]:
# Practice for exercise 5

### Exercise 6: Running scripts externally (2 pts)
* Importing libraries is a useful way expand Python functionalities. Argument parsing is one of those that exists in a separate package called argparse
* Import statements, such as 'import argparse', goes at the top of a script
* Arguments are defined in the order they are passed into the script by default
* Follow the syntax in reverse_complement.py 
* To create strings with multiple variable types, use .format (see example below)
* To access arguments, use options.<argument_name> as defined in the get_options() function
* if __name__ == "__main__": is the python standard syntax to run a program from an external call 

In [None]:
## Print a string with string formatting

fruits = ['apples','bananas']
string = "I love to eat {} and {}".format(fruits[0],fruits[1])
print(string)
## Calling a python script from Jupyter
seq = "ACGGTGC"
%run lab1/reverse_complement.py "$seq"

## PRACTICE: Open up reverse_complement.py and walk through it - 
## which aspects are unique to this script, and which are general that 
## you should use every time you write a python script and call from Jupyter?
## Get a feel for the syntax! You'll need it in the homework, so play around with the code
## and try things out

## Part B: Homework exercises (30 pts)

### Question 1: Pattern matching (6 pts)

Write an external python script that takes in any arbitrary DNA sequence `s` with only A's, C's, G's, and T's, as well as an arbitary pattern `p` of length between 3 and length of `s`. The script should count the number of times the pattern appears within the sequence, including overlaps, and print the number of matches it finds. 

Demonstrate your script works by using the following variables, and calling it from within your jupyter notebook:

`sequence = "ATTTGCGATAATTATAGCGATATACCCTG"`

`pattern = "ATA"`

In [None]:
# Question 1

### Question 2: Hamming distance (6 pts)

The Hamming distance between two strings having the same length is the minimum number of symbol substitutions required to transform one string into the other. If two strings are given by `s1` and `s2`, then we write the Hamming distance between them as `dH(s1,s2)`. For example, if `s1=ATTA` and `s2=ATTG`, `dH(s1,s2)=1`. 

Write an external python script that computes the hamming distance between two strings of the same length, and prints the distance as the output. 

Demonstrate your script works by using the following two sequences, and calling it from within your jupyter notebook:

`sequence1 = "ATTCGCTGCAAATGCTT"`

`sequence2 = "AATCTGTGCATAACCTT"`

In [None]:
# Question 2

### Question 3: Reverse with list comprehensions (6 pts)

Re-write the reverse complement function from exercise 6 above using list comprehension syntax (i.e. not in the separate script), and demonstrate that it works by reverse-complementing the same sequence we used before. This can be done below, and does not need to be in a separate script.

In [None]:
# Question 3

### Question 4: DNA manipulation (6 pts)

Bacterial chromosomes have a single origin of replication that indicates where polymerases should bind for initiating replication. These are called _dnaA_ boxes, and are typically ~1000 bp long. dnaA boxes contain short repeated segments of DNA, called _k_-mers, that specify polymerase binding sites, where _k_ is the number of nucleotides in the repeated region, i.e. a 3-mer contains 3 nucleotides.  

The _dnaA_ box of Vibrio cholerae is found in `lab1/dnaA.txt`. Write code in your Jupyter notebook below to do the following:
1. Open the file and append each line to obtain a single continuous DNA sequence
2. Search for all unique _k_-mers between the lengths of 3 and 20 nucleotides long
3. Count the number of times each unique _k_-mer appears in the _dnaA_ sequence, and determine the most commonly occurring _k_-mer
4. Print the most commonly occuring _k_-mer, and the number of times it appears

In [None]:
# Question 4

### Question 5: Approximate _k_-mers (6 pts)

Often times k-mer repeats are almost identical, but not exact. How would you update your code from Q#4 to allow for X number of mismatches? 

Provide your answer as a comment in the cell below, being specific about the logic you would use.

In [None]:
# Question 5