<h1 id="toctitle">Text manipulation</h1>
<ul id="toc"/>

##Parts of code
###Statements, functions, arguments
Start with the very short Python program from the last session:

In [3]:
print("Hello World")

Hello World


`print` is the name of a **function**.

`"Hello World"` is the function **arguments**.

The whole thing `print("Hello World")` is a **statement** (one instruction).

Things in quotes are strings (of characters) - either single or double is fine:

In [4]:
print("Hello world")

Hello world


In [5]:
print('Hello world')

Hello world


In [6]:
print("She said, 'Hello world'")

She said, 'Hello world'


###Comments
To include a bit of text for humans to read, start with `#`. We call this a __comment__:

In [9]:
# print a friendly greeting
# print("Hello world")
print("Hello martin")

Hello martin


Some characters in Python have a special meaning: for example `\n` means _start a new line_.

In [10]:
# print the same greeting, over two lines
print("Hello\nworld") 

Hello
world


###Variables
To store a bit of text to use later, we use a __variable__:

In [12]:
# store a DNA sequence in the variable my_dna
my_dna1 = "ATGCGTA"

# now print the DNA sequence
print(my_dna1)

# change the value of my_dna
my_dna2 = "TGGTCCA"

# print it again
print(my_dna2)

ATGCGTA
TGGTCCA


Variable names can be whatever we want - this code is exactly the same as the previous one:

In [13]:
# store a DNA sequence in the variable banana
banana = "ATGCGTA"

# now print the DNA sequence
print(banana)

# change the value of banana
banana = "TGGTCCA"

# print it again
print(banana)

ATGCGTA
TGGTCCA


##Manipulating text (strings)
###Concatenation
We can join two strings together with `+`:

In [19]:
"AAXXXXTT" + "GGCC"
4 * 6

24

__Important: anywhere we can use a string written with quotes, we can also use a variable that stores a string:__

In [20]:
upstream = "AAA"
my_dna = upstream + "ATGC"
my_dna

'AAAATGC'

In [21]:
upstream = "AAA"
downstream = "GGG"
my_dna = upstream + "ATGC" + downstream
my_dna

'AAAATGCGGG'

The result of concatenating two (or more) strings is also a string:

In [24]:
print("Hello" + " " + "world")

Hello world


###Lengths

To find the length of a string, use the `len()` function:

In [25]:
len("AGTC")

4

What you need to know about `len()`:

- `len()` is a function
- it takes one argument
- the argument has to be a string
- it returns the length of the string
- we have to do something with the returned value
- __the length is a number, not a string!__

Numbers and strings are different:

In [27]:
2 + 2

4

In [26]:
"abc" + "def"

'abcdef'

In [30]:
2 + "abc"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

###Division is strange in Python

By default, Python rounds decimal numbers down to the nearest whole number:

In [31]:
10 / 3

3

To make it behave properly, we need to include this line at the start of our programs:

In [33]:
from __future__ import division

Notice that the word `future` is surrounded by __two__ underscores. 

Now division will give the correct answer:

In [34]:
10 / 3

3.3333333333333335

Remember to put this line at the start of all your programs!

This has been fixed in Python 3. 

###Converting between numbers and strings

`str()` will take a number and turn it into a string:

In [35]:
4 + "abc"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [36]:
str(4) + "abc"

'4abc'

We often want to do this so that we can print a number:

In [39]:
my_dna = "ATGCGAGT"
dna_length = len(my_dna)
print("length of the DNA sequence is " + str(dna_length))

length of the DNA sequence is 8


`int()` will take a string and turn it into a number:

In [40]:
"6" + 7

TypeError: cannot concatenate 'str' and 'int' objects

In [42]:
int("6") + 7
6 + 7

13

We will see later why this is useful.

###Changing case

To turn a string into lower case, use a __method__ called `lower()`:

In [47]:
my_dna = "ATGC"
# print my_dna in lower case
print(my_dna.lower())

"ABC4".lower()

atgc


'abc4'

Look closely at how this works:

- `lower()` is a __method__, not a __function__
- we write the name of the variable first, then a dot, then the method name
- `lower()` has no arguments
- `lower()` only works on strings, not numbers
- __`lower()` doesn't change the variable, it returns a lower case version__

In [48]:
my_dna = "ATGC"

# print the variable
print("before: " + my_dna)

# run the lower method and store the result
lowercase_dna = my_dna.lower()
print(lowercase_dna)

# print the variable again
print("after: " + my_dna)

before: ATGC
atgc
after: ATGC


###Find and replace in strings

Use the `replace()` method:

In [53]:
protein = "vlspadktnv"
print(protein) 

# replace valine with tyrosine
print(protein.replace("v", "y"))

# we can replace more than one character
print(protein.replace("vls", "ymt" ))

# the original variable is not affected
print(protein)

vlspadktnv
ylspadktny
ymtpadktnv
vlspadktnv


###Extracting part of a string

To get just part of a string, put square brackets after the variable name with the start and stop positions separated by a colon. 

We start counting at __zero__ rather than one.
Positions are __inclusive__ at the start and __exclusive__ at the end.

Get positions 3 to 5 of the protein:

In [54]:
protein = "vlspadktnv"
protein[3:5]

'pa'

To go to the end of the string, just leave out the stop position:

In [58]:
protein[5:]

'dktnv'

###Counting substrings

The `count()` method will tell you how many times a short string occurs inside a long one:

In [59]:
protein = "vlspadktnv"
# count amino acid residues
valine_count = protein.count('v')
lsp_count = protein.count('lsp')
tryptophan_count = protein.count('w')
 
# now print the counts
print("valines: " + str(valine_count ))
print("lsp: " + str(lsp_count ))
print("tryptophans: " + str(tryptophan_count ))

valines: 2
lsp: 1
tryptophans: 0


###Finding positions of substrings

The `find()` will tell you the position of a short string inside a long one:

In [64]:
protein = "vlspadktnv"
print("the number of prolines is " + str(protein.find('p' )))
print(protein.find('kt' ))
print(protein.find('w' ))

the number of prolines is 3
6
-1


In [70]:
protein = "vlspadktnv"
residue = "v"

residue.find(protein)

-1

Remember:

- we start counting fromn zero
- if the short string isn't there, we get `-1`

In [71]:
"hello" - "o"

TypeError: unsupported operand type(s) for -: 'str' and 'str'

##Summary of all the things!

- Functions
    - `print()` for showing text on the screen
    - `len()` for finding the length of a string
    - `str()` for converting a number to a string
    - `int()` for converting a string to a number
- String methods
    - `upper()` for changing to upper case
    - `replace()` for changing parts of a string
    - `count()` for counting substrings
    - `find()` for locating substrings
- Types of things
    - strings are always surrounded by quotes
    - numbers are always written without quotes
- Other tools
    - `+` can add numbers together (also `/`, `*` and `-`)
    - `[1:2]` can get parts of a string
    - `\n` means start a new line
- Parts of code
    - function
    - method
    - arguments
    - statement
    - quotes
    - comments
    - variables

---

##Exercises

###Calculating AT content

Here's a short DNA sequence:

`ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT`

Write a program that will print out the AT content of this DNA sequence. Hint: you can use normal mathematical symbols like add (`+`), subtract (`-`), multiply (`*`), divide (`/`) and parentheses to carry out calculations on numbers in Python. 

###Complementing DNA

Here's a short DNA sequence:

`ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT`

Write a program that will print the complement of this sequence. (__Not the reverse complement!__)

###Restriction fragment lengths

Here's a short DNA sequence:

`ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT`

The sequence contains a recognition site for the _EcoRI_ restriction enzyme, which cuts at the motif `G*AATTC` (the position of the cut is indicated by an asterisk). Write a program which will calculate the size of the two fragments that will be produced when the DNA sequence is digested with _EcoRI_.

###Splicing out introns

Here's a short section of genomic DNA:

`ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT`

It comprises two exons and an intron. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. Write a program that will print just the coding regions of the DNA sequence. 

Write a program that will calculate what percentage of the DNA sequence is coding.

Write a program that will print out the original genomic DNA sequence with coding bases in uppercase and non-coding bases in lowercase. 

In [None]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

In [15]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")