<small><small><i>
All of these python notebooks are available at https://github.com/kipkurui/Python4Bioinformatics

# Working with strings

## The Print Statement

As seen previously, The **print()** function prints all of its arguments as strings, separated by spaces and follows by a linebreak:

    - print("Hello World")
    - print("Hello",'World')
    - print("Hello", <Variable Containing the String>)

Note that **print** is different in old versions of Python (2.7) where it was a statement and did not need parenthesis around its arguments.

In [1]:
print("Hello","World")

Hello World


The print has some optional arguments to control where and how to print. This includes `sep` the separator (default space) and `end` (end charcter) and `file` to write to a file.

In [8]:
dna="ACGTATA"
dna.count('A')

3

In [4]:
dna.count('A')

3

In [None]:
print("Hello","World",sep='...',end='!!')

You can find the additional arguments, and help on usage of print, and any other function, by appending a ? before it. 

In [11]:
?tuple

[0;31mInit signature:[0m [0mtuple[0m[0;34m([0m[0miterable[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m [0;34m/[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple.
If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.
[0;31mType:[0m           type
[0;31mSubclasses:[0m     int_info, float_info, UnraisableHookArgs, hash_info, version_info, flags, thread_info, asyncgen_hooks, ExceptHookArgs, waitid_result, ...


## String Formatting

There are lots of methods for formatting and manipulating strings built into python. Some of these are illustrated here.

String concatenation is the "addition" of two strings. Observe that while concatenating there will be no space between the strings.

In [3]:
string1='World'
string2='!'
print('Hello' + " "+ string1 + string2 +' '+ str(267.00))

Hello World! 267.0


The **%** operator is used to format a string inserting the value that comes after. It relies on the string containing a format specifier that identifies where to insert the value. The most common types of format specifiers are:

    - %s -> string
    - %d -> Integer
    - %f -> Float
    - %o -> Octal
    - %x -> Hexadecimal
    - %e -> exponential

In [8]:
print("Hello %s" % string1)
print("Actual Number = %o" %18)
print("Float of the number = %.3f" % 18.87687)
print("Exponential equivalent of the number = %e" %18)

Hello World
Actual Number = 22
Float of the number = 18.877
Exponential equivalent of the number = 1.800000e+01


When referring to multiple variables parenthesis is used. Values are inserted in the order they appear in the paranthesis (more on tuples in the next lecture)

In [13]:
print("Hello %s%s. The meaning of life is %d" % (string1,string2,42))

Hello World!. The meaning of life is 42


We can also specify the width of the field and the number of decimal places to be used. For example:

In [39]:
print('Print width 10: |%10s|'%'my')
print('Print width 10: |%10s|'%'name') # left justified
print("The number pi = %.2f to 2 decimal places"%3.1415)
print("More space pi = %10.2f"%3.1415)
print("Pad pi with 0 = %010.2f"%3.1415) # pad with zeros

Print width 10: |        my|
Print width 10: |      name|
The number pi = 3.14 to 2 decimal places
More space pi =       3.14
Pad pi with 0 = 0000003.14


## Other String Methods

Multiplying a string by an integer simply repeats it

In [14]:
print("Hello World! "*5)

Hello World! Hello World! Hello World! Hello World! Hello World! 


Strings can be tranformed by a variety of functions:

Let's get back to our trna example. 

In [21]:
s="hello world"
print(s.capitalize())
print(s.upper())
print(s.lower())
print('|%s|' % "Hello World".center(30)) # center in 30 characters
print('|%s|'% "     lots of space             ".strip()) # remove leading and trailing whitespace
print("Hello World".replace("World","Class"))

Hello world
HELLO WORLD
hello world
|         Hello World          |
|lots of space|
Hello Class


There are also lots of ways to inspect or check strings. Examples of a few of these are given here:

In [22]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  

In [None]:
trna='AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTTA'
print("The length of the sequence is %i" % len(trna),"nucleotides") # len() gives length

#count strings
print("There are %d 'G's but only %d C's in the sequence" % (trna.count('G'),trna.count('C')))
print('The "ATTAA" motif is at index',trna.find('ATTAA')) #index from 0 or -1

### Exercise

Calculate the % GC and % AT content in the trna sequence

In [None]:
trna.count()

In [None]:
A_count=trna.count('A')
C_count=trna.count('C')
G_count=trna.count('G')
T_count=trna.count('T')

In [52]:
# calculating the % of AT and GC content in trna data.
# determine the total length of trna and individual counts of the nucleotides.
trna='AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTTA'
print(len(trna))
A_count=trna.count('A')
C_count=trna.count('C')
G_count=trna.count('G') 
T_count=trna.count('T')
print(A_count, C_count, G_count, T_count)
# find the sum of the GC and AT and assign the variable to the values.
GC_content = G_count + C_count
AT_content = A_count + T_count
GC_percent = (GC_content/len(trna))*100
AT_percent = (AT_content/len(trna))*100
print ('GC and AT percent values: %s %s' % (GC_percent, AT_percent))
# other ways to solve this,
print('The AT content  is %.3f'% (AT_percent),'%')


69
15 9 21 24
GC and AT percent values: 43.47826086956522 56.52173913043478
The AT and GC content  is 56.522 %


## String comparison operations
Strings can be compared in lexicographical order with the usual comparisons. In addition the `in` operator checks for substrings:

In [None]:
'abc' < 'bbc' <= 'bbc'

In [None]:
"ABC" in "This is the ABC of Python"

## Accessing parts of strings

Strings can be indexed with square brackets. Indexing starts from zero in Python. 

In [26]:
s = 'AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTT'
print('First nucleotide of the sequence is',s[0])
print('Last nucleotide of the sequence is',s[len(s)-1])

First nucleotide of the sequence is A
Last nucleotide of the sequence is T


Negative indices can be used to start counting from the back

In [28]:
print('First nucleotide of the sequence is',s[-len(s)])
print('Last nucleotide of the sequence is',s[-1])

First nucleotide of the sequence is A
Last nucleotide of the sequence is T


#### Slicing
Finally a substring (range of characters) can be specified as using $a:b$ to specify the characters at index $a,a+1,\ldots,b-1$. Note that the last charcter is *not* included. Now we can find the first codon in the sequence:

In [29]:
print("First codon in the sequence is",s[0:3])
print("The second codon in the sequence is",s[3:6])

First codon in the sequence is AAG
The second codon in the sequence is GGC


An empty beginning and end of the range denotes the beginning/end of the string:

In [33]:
print("First codon in the sequence is", s[:3])
print("Last codon in the sequence is", s[-3:])

First codon in the sequence is AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTC
Last codon in the sequence is CTT


A colon without an index, returns the whole string. 

In [53]:
s[:]

NameError: name 's' is not defined

## Strings are immutable

It is important that strings are constant, immutable values in Python. While new strings can easily be created it is not possible to modify a string:

In [54]:
s='012345'
sX=s[:2]+'X'+s[3:] # this creates a new string with 2 replaced by X
print("creating new string",sX,"OK")
sX=s.replace('2','X') # the same thing
print(sX,"still OK")
s[2] = 'X' # an error!!!

creating new string 01X345 OK
01X345 still OK


TypeError: 'str' object does not support item assignment

### Exercise:

1. Given the following amino acid sequence (MNKMDLVADVAEKTDLSKAKATEVIDAVFA), find the first, last and the 5th amino acids in the sequence. 
2. The above amino acid is a bacterial restriction enzyme that recognizes "TCCGGA". Find the first restriction site in the following sequence: AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA

In [12]:
# qn 01 on finding the first, last and fifth amino acids in the given sequence.
#let the sequence be X
X='MNKMDLVADVAEKTDLSKAKATEVIDAVFA'
print(X[1])
print(X[-1])
print(X[4])
# qn 02 to find the first restriction site
BA='AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'
RE='TCCGGA'
print(len(BA))
print(len(RE))
print(BA.find(RE) + len(RE))

# how to reverse the given sequence BA,
BA[::-1]




N
A
D
47
6
33


'AAAATATAATGCGGAGGCCTCGGGATATATCGGCGGAGCCCTAAAAA'

In [18]:
###In this part, am going to write funtions that automate the above two exercises.

### Funtion01; Exercise 01, book02

# calculating the % of AT and GC content in trna data.
# determine the total length of trna and individual counts of the nucleotides.
trna='AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTTA'
print(len(trna))
A_count=trna.count('A')
C_count=trna.count('C')
G_count=trna.count('G') 
T_count=trna.count('T')
print(A_count, C_count, G_count, T_count)
# find the sum of the GC and AT and assign the variable to the values.
GC_content = G_count + C_count
AT_content = A_count + T_count
GC_percent = (GC_content/len(trna))*100
AT_percent = (AT_content/len(trna))*100
print ('GC and AT percent values: %s %s' % (GC_percent, AT_percent))
# other ways to solve this,
print('The AT content  is %.3f'% (AT_percent),'%')




69
15 9 21 24
GC and AT percent values: 43.47826086956522 56.52173913043478
The AT content  is 56.522 %


In [119]:
#function 1
trna='AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTTA'
#BA='AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'
def percent_Base_content(dna):
    length=len(dna)
    A_count=dna.count('A')
    C_count=dna.count('C')
    G_count=dna.count('G') 
    T_count=dna.count('T')
    GC_content = G_count + C_count
    AT_content = A_count + T_count
    GC_percent = (GC_content/length)*100 
    AT_percent = (AT_content/length)*100
    #return print('AT and GC percent values:%s%s %s%s'% (AT_percent,'%',GC_percent,'%'))
    #return print('AT and GC percent values:%.3f%s %.3f%s'% (AT_percent,'%',GC_percent,'%'))
    return print(' AT percent value:%.3f%s%s'% (AT_percent,'%','\n'),'GC percent value:%.3f%s'% (GC_percent,'%'))

In [120]:
percent_Base_content(trna)

 AT percent value:56.522%
 GC percent value:43.478%


In [121]:
BA='AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'
percent_Base_content(BA)

 AT percent value:53.191%
 GC percent value:46.809%


In [30]:
trna='AAGGGCTTAGCTTAATTAAAGTGGCTGATTTGCGTTCAGTTGATGCAGAGTGGGGTTTTGCAGTCCTTA'
 A_count=trna.count('A')
C_count=trna.count('C')
G_count=trna.count('G') 
T_count=trna.count('T')
print(*_count)

IndentationError: unexpected indent (3479470042.py, line 2)

In [107]:
#function02,exercise02,book2
##part one; to find the first, last and fifth amino acid of the given protein sequence.
aa='MNKMDLVADVAEKTDLSKAKATEVIDAVFA'
def aa_postions(*args):
    first_pos=aa[0]
    last_pos=aa[-1]
    fifth_pos=aa[4]
    return print(' aa_first_position: %s%s'%(first_pos,'\n'),'aa_last_position: %s%s'%(last_pos,'\n'),'aa_fifth_position: %s'%(fifth_pos))

In [108]:
aa_postions()

 aa_first_position: M
 aa_last_position: A
 aa_fifth_position: D


In [126]:
##part two; to find the first restriction site for a given dna sequence BA and the specific restriction site RE
BA='AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'
RE='TCCGGA'
BA_length=len(BA)
RE_length=len(RE)
RE_position1=BA.find(RE)
RE_position2=BA.find(RE)+RE_length
#first_RE_postion = (RE_position1,RE_position2)
#print(BA_length, RE_length, first_RE_postion)
print(' RE_start_position:%s%s%s' %(RE_position1,'th base','\n'), 'RE_stop_position:%s%s' %(RE_position2,'th base'))

 RE_start_position:27th base
 RE_stop_position:33th base


In [133]:
##function02, on identifying the restriction site on the dna sequence.
BA='AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'
RE='TCCGGA'
#RX='TCC'
def first_restriction_site(X):
    BA_length=len(BA)
    RE_length=len(X)
    RE_position1=BA.find(X)
    RE_position2=BA.find(X)+RE_length
    #first_RE_postion = (RE_position1,RE_position2)
    #print(BA_length, RE_length, first_RE_postion)
    return print(' RE_start_position:%s%s%s' %(RE_position1,'th base','\n'), 'RE_stop_position:%s%s' %(RE_position2,'th base'))




In [134]:
first_restriction_site(RX)

 RE_start_position:5th base
 RE_stop_position:8th base
