# Practical Activity: DNA Strings

This notebook is designed to reinforce the concepts introduced in Unit 1 of the Introduction to Biology and Programming course and also give you some experience of using Python to solve biological problems with minimal guidance. 

Please work through the material presented here and add code in to the cells as indicated. There are cells you can use to check your answers throughout. Also note that, as with all programming, there are many solutions to these problems but as long as your code works and is readable, that's good enough!

## Initial Setup

Before you start working through the exercises below, please make sure you run the Python cell below that will set up everything you will need. This code will provide access to the following functions:

* `dna1`, `dna2`, `dna3`: These functions take no arguments and return 3 different DNA sequences
* `generate_dna_sequence(...)`: This function takes an integer and returns a random DNA string with this number of bases
* `dna_analyser_v1(...)`: Version 1 of a DNA analyser function that takes a DNA string and performs some analysis
* `dna_analyser_v2(...)`: Version 2 of a DNA analyser function that takes a DNA string and performs some analysis
* `get_time()`: Returns the current number of seconds after an arbitrary point (1st Jan, 1970 if you're interested!)
* `dna_experiment_output()`: Produces some DNA data as if produced by an experiment.


In [None]:
from unit_1_library import *

## Exercise 1:  Manipulating DNA Strings

DNA consists of sequences of four bases represented by the letters A, C, G and T. In Python, this is most easily represented by a string of these letters. This first exercise is going to get you to use external functions and string manipulation routines to do some analysis on some DNA strings. Note that at the moment you'll need to use some copy and paste of the code but don't worry - we'll learn techniques to avoid that later in the course!

Can you:

* Create 3 variables that contain the DNA strings returned by the functions `dna1`, `dna2` and `dna3`
* Find the length of all the sequences
* Combine the shortest two sequences into a new sequence
* Find the number of each type of base in the longest sequence
* One of the DNA strings is included in one of the others. Find the base number (index) where this sub-sequence occurs

After this, please fill in the cell underneath with the answers and run it to check if you've got it right!

### Important Info
* To find the length of a string or collection, use the `len()` function, e.g.:
```
In [1]: len("test")
Out[1]: 4
```
* To combine strings you can just add them together:
```
In [1]: "abc" + "def"
Out[1]: 'abcdef'
```
* You can find the number of times a character or substring occurs in a string using `string.count(...)`, e.g.
```
In [1]: mystr = "my testing string"
In [2]: mystr.count("i")
Out[2]: 2
```
* To find the index a substring occurs in another string use `string.find(...)`, e.g.
```
In [1]: mystr = "my testing string"
In [2]: mystr.find("ing")
Out[2]: 7
```

There is more information about the routines provided by the string type here:

https://docs.python.org/3/library/stdtypes.html#string-methods


In [None]:
len_dna1 = 
len_dna2 = 
len_dna3 = 

combined_sequence = 

num_A = 
num_C = 
num_G = 
num_T = 

base_pos = 

check_answers_l4_ex1(globals())

## Exercise 2:  Processing a large DNA string

We're now going to move on to looking at large DNA sequences and how that can affect how you might do some of your processing. You will find that as you scale up to larger datasets, you may start hitting efficiency problems or your chosen algorithm doesn't work as well as it did with smaller datasets. It's good to approach these kinds of issues in a systematic way to find out where the bottlenecks are. You then know where to concentrate on optimising the code or the algorithm.

Can you:

* Use the given function `generate_dna_sequence(...)` to generate a random DNA string that contains 5 million bases and store it in a variable. Don't print this out as it will be very large!
* Call functions `dna_analyser_v1(...)` and `dna_analyser_v2(...)` with this generated DNA variable in turn and check they give the same answers
* You may have noticed that one takes longer than the other. Using the `get_time()` function, use variables to record the time before and after each call to the analyser functions and see how long each one takes. Don't include the time it takes to generate the DNA sequence!

After this, please fill in the cell underneath with the answers and run it to check if you've got it right!

In [None]:
time_v1 = 
time_v2 = 

check_answers_l4_ex2(globals())

## Exercise 3:  Filtering data

Quite often, you will have output from a program or function that has the information you want but amongst other output that you don't want. In these cases you will have to prepare or filter out the information you need before moving on to the next stage of your analysis. This exercise looks at an example of this kind of filtering.

Can you:

* Store the output from the function `dna_experiment_output()` in a variable
* Look at this data and see what issues there are with it - you are wanting a continuous DNA string with no newlines or other characters
* Use `string` functions to fix the DNA data (don't forget the new line characters '\n' and spaces ' ')
* Check this works by passing it through the `dna_analyser_v1(...)` function (NOT the v2 function as this won't check the input DNA sequence!)

### Important Info
* To replace a part of a string, use the `string.replace(...)` function, e.g.:
```
In [1]: mystr = "my testing string"
In [2]: mystr.replace("testing", "new")
Out[2]: 'my new string'
```
* To remove a substring from a string you can just `replace` it with an empty string (`""`)

Again, you may want to look at this page for more info:

https://docs.python.org/3/library/stdtypes.html#string-methods


Hopefully this has given you some more hands on experience with Python and allowed you to start solving problems using Python and coding. We will be building on what you've done here over the next few weeks!