<small><small><i>
All of these python notebooks are available at https://github.com/kipkurui/Python4Bioinformatics

# Working with strings

## The Print Statement

As seen previously, The **print()** function prints all of its arguments as strings, separated by spaces and follows by a linebreak:

    - print("Hello World")
    - print("Hello",'World')
    - print("Hello", <Variable Containing the String>)

Note that **print** is different in old versions of Python (2.7) where it was a statement and did not need parenthesis around its arguments.

In [1]:
print("Hello","World")

Hello World


The print has some optional arguments to control where and how to print. This includes `sep` the separator (default space) and `end` (end charcter) and `file` to write to a file.

In [1]:
dna="ACGTATA"

In [2]:
dna.count('A')

3

In [2]:
print("Hello","World",sep='...',end='!!')

Hello...World!!

You can find the additional arguments, and help on usage of print, and any other function, by appending a ? before it. 

In [3]:
?print()

[0;31mDocstring:[0m
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
[0;31mType:[0m      builtin_function_or_method


## String Formatting

There are lots of methods for formatting and manipulating strings built into python. Some of these are illustrated here.

String concatenation is the "addition" of two strings. Observe that while concatenating there will be no space between the strings.

In [4]:
string1='World'
string2='!'
print('Hello' + " "+ string1 + string2 + str(267.00))

Hello World!267.0


The **%** operator is used to format a string inserting the value that comes after. It relies on the string containing a format specifier that identifies where to insert the value. The most common types of format specifiers are:

    - %s -> string
    - %d -> Integer
    - %f -> Float
    - %o -> Octal
    - %x -> Hexadecimal
    - %e -> exponential

In [15]:
print("Hello %s" % string1)
print("Actual Number = %d" %18)
print("Float of the number = %.3f" % 18.87687)
print("Exponential equivalent of the number = %e" %18)

Hello World
Actual Number = 18
Float of the number = 18.877
Exponential equivalent of the number = 1.800000e+01


When referring to multiple variables parenthesis is used. Values are inserted in the order they appear in the paranthesis (more on tuples in the next lecture)

In [5]:
print("Hello %s %s. The meaning of life is %d" % (string1,string2,42))

Hello World !. The meaning of life is 42


We can also specify the width of the field and the number of decimal places to be used. For example:

In [20]:
print('Print width 10: |%10s|'%'my')
print('Print width 10: |%10s|'%'name') # left justified
print("The number pi = %.2f to 2 decimal places"%3.1415)
print("More space pi = %10.2f"%3.1415)
print("Pad pi with 0 = %010.2f"%3.1415) # pad with zeros

Print width 10: |        my|
Print width 10: |      name|
The number pi = 3.14 to 2 decimal places
More space pi =       3.14
Pad pi with 0 = 0000003.14


## Other String Methods

Multiplying a string by an integer simply repeats it

In [8]:
print("Hello World! "*5)

Hello World! Hello World! Hello World! Hello World! Hello World! 


Strings can be tranformed by a variety of functions:

Let's get back to our trna example. 

In [9]:
s="hello wOrld"
print(s.capitalize())
print(s.upper())
print(s.lower())
print('|%s|' % "Hello World".center(30)) # center in 30 characters
print('|%s|'% "     lots of space             ".strip()) # remove leading and trailing whitespace
print("Hello World".replace("World","Class"))

Hello world
HELLO WORLD
hello world
|         Hello World          |
|lots of space|
Hello Class


There are also lots of ways to inspect or check strings. Examples of a few of these are given here:

In [21]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(...)
 |      S.__format__(format_spec) -> str
 |      
 |      Return a formatted version of S as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getatt

In [8]:
trna='AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTTA'
print("The length of the sequence is %i" % len(trna),"nucleotides") # len() gives length

#count strings
print("There are %d 'G's but only %d C's in the sequence" % (trna.count('G'),trna.count('C')))
print('The "ATTAA" motif is at index',trna.find('ATTAA')) #index from 0 or -1

The length of the sequence is 69 nucleotides
There are 21 'G's but only 9 C's in the sequence
The "ATTAA" motif is at index 14


### Exercise

Calculate the % GC and % AT content in the trna sequence

In [11]:
#?trna.count()

Object `trna.count()` not found.
AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTTA


In [24]:
A_count=trna.count('A')
C_count=trna.count('C')
G_count=trna.count('G')
T_count=trna.count('T')

length_trna = len(trna)

# GC Content %
total_gc_content = G_count + C_count
percentage_gc = (total_gc_content / length_trna) * 100
print(percentage_gc)

# AT Content %
total_AT_content = T_count + A_count 
percentage_AT = (total_AT_content / length_trna) * 100
print(percentage_AT)


43.47826086956522
56.52173913043478


## String comparison operations
Strings can be compared in lexicographical order with the usual comparisons. In addition the `in` operator checks for substrings:

In [12]:
'abc' < 'bbc' <= 'bbc'

True

In [13]:
"ABC" in "This is the ABC of Python"

True

## Accessing parts of strings

Strings can be indexed with square brackets. Indexing starts from zero in Python. 

In [14]:
s = 'AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTT'
print('First nucleotide of the sequence is',s[0])
print('Last nucleotide of the sequence is',s[len(s)-1])

First nucleotide of the sequence is A
Last nucleotide of the sequence is T


Negative indices can be used to start counting from the back

In [15]:
print('First nucleotide of the sequence is',s[-len(s)])
print('Last nucleotide of the sequence is',s[-1])

First nucleotide of the sequence is A
Last nucleotide of the sequence is T


#### Slicing
Finally a substring (range of characters) can be specified as using $a:b$ to specify the characters at index $a,a+1,\ldots,b-1$. Note that the last charcter is *not* included. Now we can find the first codon in the sequence:

In [16]:
print("First codon in the sequence is",s[0:3])
print("The secodn codon in the sequence is",s[3:6])

First codon in the sequence is AAG
The secodn codon in the sequence is GGC


An empty beginning and end of the range denotes the beginning/end of the string:

In [17]:
print("First codon in the sequence is", s[:3])
print("Last codon in the sequence is", s[-3:])

First codon in the sequence is AAG
Last codon in the sequence is CTT


A colon without an index, returns the whole string. 

In [18]:
s[:]

'AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTT'

## Strings are immutable

It is important that strings are constant, immutable values in Python. While new strings can easily be created it is not possible to modify a string:

In [19]:
s='012345'
sX=s[:2]+'X'+s[3:] # this creates a new string with 2 replaced by X
print("creating new string",sX,"OK")
sX=s.replace('2','X') # the same thing
print(sX,"still OK")
s[2] = 'X' # an error!!!

creating new string 01X345 OK
01X345 still OK


TypeError: 'str' object does not support item assignment

### Exercise:

1. Given the following amino acid sequence (MNKMDLVADVAEKTDLSKAKATEVIDAVFA), find the first, last and the 5th amino acids in the sequence. 
2. The above amino acid is a bacterial restriction enzyme that recognizes "TCCGGA". Find the first restriction site in the following sequence: AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA

In [32]:
a = list('ATCG')

print(reversed(a))

<list_reverseiterator object at 0x7f30400a38b0>
