# Strings and DNA

Strings are one of the most basic data types in coding. Strings are a sequence of characters, and they have many practical applications in the computational sciences. In this tutorial, we will learn how to manipulate strings, using a DNA sequence as an example.
DNA is composed of four distinct bases: adenine, cytosine, guanine, and thymine. Generally, these bases are presented as single characters, A, C, G, and T, respectively. From the many combinations of these four letters, an organism's genetic code is produced. 
Here's an example sequence:

AGGGCCCTTTT

Store this sequence in the variable 'sequence' below. *Remember to add quotation marks "" around your sequence- this tells the program it is a string.* 
Print out 'sequence' to confirm you successfully stored your sequence.

In [None]:
sequence = # Store your sequence here


#  Use replace() to create RNA from the Coding Strand
DNA exists as a molecule with two strands. In general terms, only one of these strands will be important for making a protein. This is called the coding strand, and it produces the RNA molecule that is translated to form a protein.

![DNA%20dogma.jpg](attachment:DNA%20dogma.jpg)

Notice how the RNA is just the coding sequence, where the "T"s are replaced by "U"s? RNA uses a base called "uracil" instead of thymine. Imagine our sequence is the coding sequence of a gene- what would the RNA look like?

We'd use a replace() function- this function manipulates a string by changing a character to another character.
See this in action below:

In [None]:
sentence = "This is a test." # Stored a string in the variable "sentence"
new_sentence = sentence.replace("This", "That") # Tell replace() what is being replaced and give it a new substring.
print(new_sentence) # Look at how our sentence has changed

Notice how we have "sentence." before the replace() function. The function must know what it is acting on, so we tell it to act on the sentence variable.
Now, try creating an RNA molecule from your sequence above (*Assume your sequence is the coding strand*). Store the RNA in a variable "RNA" and print it.

# For Loops
We've just printed out the full sequence. What if you wanted to perform a particular function on each nucleotide (*iterate* over them)? What if you wanted to only print out each individual letter? This means you would need to *loop* over each nucleotide in the sequence.
Here, we present the for loop.
An example is presented below.

In [None]:
k = "word"
for letter in k: # for every letter in "word"
    print(letter) # print the letter

Of course, it does not matter what we refer to the index (character) as- in this case, "letter" was used, but 'n', 'x', etc. can be used instead. 
Now, attempt to do the same for your "sequence" variable, printing out each individual nucleotide.

## Making Multiple Replacements
Suppose we wanted to change multiple characters in a string. Take this sequence for example:

*AAAATTTT*

Let's we wanted to replace "T" with "A" and "A" with "T", to read:

*TTTTAAAA*

However, the code below will not work:

In [None]:
seq1 = "AAAATTTT"
seq2 = seq1.replace("A", "T") # Replace A's with T's
seq2 = seq2.replace("T", "A") # Take output (seq2) of previous line and replace T's with A's
print(seq2) 

We're trying to replace two things that exist in the string with one another. It seems impossible, but it can be solved with a for loop! **Study the code carefully below to understand how the for loop handles each sequence character**.At the end of each iteration of the loop, the "new_sequence" will be printed to show you the process.

Notice how new_sequence is a blank string, and characters/nucelotides are being added to it- **two strings being added to one another is called concatenation**. After being concatenated, they are stored back into new_sequence, effectively "updating" the new_sequence variable. 

In [None]:
new_sequence = '' # This is a blank string
for n in seq1:
    if n == 'A': # Check if the nucleotide 'n' is 'A'
        new_sequence = new_sequence + "T" # If the nucleotide is 'A', let's add 'T' to the new_sequence string 
    if n == 'T': # Check if the nucleotide 'n' is 'T'
        new_sequence = new_sequence + "A" # If the nucleotide is 'T', add 'A' to the new sequence string
    print(new_sequence) # This is to show how the new_sequence is "growing" as the for loop continues
print(new_sequence)

## Creating the Complement/Template Strand
As shown in the image at the beginning of this tutorial, the template strand actually interacts with protein machinery to produce RNA (a copy of the coding strand). DNA has a specific way of binding two strands together- "A" binds to "T," and "C" binds to "G". These bases complement one another. 
Let's change our "sequence" string from the start to a template strand. You must replace all A's with T's, T's with A's, C's with G's, and G's with C's, like so:

ACGTGTG<br>
TGCACAC

Store your template sequence in a variable, "template":

# Creating the Reverse Complement Strand
DNA has direction when being read by protein machinery, just like words on a page are read from left to right. Reading in the wrong direction could produce an entirely different protein! 
This left-right direction is referred to as 5' to 3' (pronounced "5 prime" and "3 prime"). These numbers refer to the position of a carbon in the ribose sugar that makes DNA (*however, this knowledge is not relevant to the exercise*). 
DNA is antiparallel, meaning its strands travel in opposite directions. Demonstrated by our example sequences:

5' - AGGGCCCTTTT - 3'  *Coding strand*
<br>3' - TCCCGGGAAAA - 5'  *Template strand*

So, the template strand you printed out above is backwards! Instead we want to see the template as:

5' - AAAAGGGCCCT - 3'

DNA is always presented in the 5' to 3' direction. There is a reversed() function in Python, but this is for lists, not strings. Use your code above to create the reverse complement strand. <br> *Hint: How you order anything in your python code is very important- look closely.*