## Working with text: Strings

In biology, many types of data are predominantly text, for example DNA sequences, protein sequences, survey replies or gene identifiers. Even where you're predominantly handling numeric data, you'll most like need to deal with text components such as column headers, meta data or other label information. 

Python allows for very convenient handling, analysis and processing of text and has a wide variety of inbuilt methods and functions specialised for the task. From the previous section on variables, you’ll remember that we refer to text as ‘strings’ and that string variables can be set using either double or single quotes, as long as both quotes are the same:

In [2]:
protein_sequence = 'CAWEWPRHRYT' #string assignment using single quotes (')

protein_sequence = "CAWEWPRHRYT" #string assignment using double quotes (")

#### Accessing characters in strings

Individual characters in a string can be accessed by specifying the character's index in square brackets after the string name. The index refers to the position of the character from the beginning, with that position being position 0. Therefore the first character is at index 0, the second character at index 1 and so on. This principle is known as zero-indexing - position counting always starts at zero in Python. 

It is also possible to specify characters based on their position relative to the end of the string: the last character can be accessed using index -1, the second last at index -2, etc. 

In [2]:
protein_sequence = 'CAWEWPRHRYT'
print(protein_sequence[0]) #Prints the first character (at index 0) - i.e. C
print(protein_sequence[3]) #Prints the fourth character (at index 3) - i.e. E
print(protein_sequence[-1]) #Prints the character at the end of the string (T)

C
E
T


Similarly, parts of strings (known as substrings) can be accessed using the slice operator, denoted by `[ : ]`. Start and stop indices are specified before and after the colon respectively. The character at the location of the stop index is not included, so `[3:7]` will print from the 4th to 7th characters (indices 3 to 6). Where no character is specified before the slice operator e.g. `[:5]`, this will automatically count from position 0 (first character). If only a start position is given e.g. `[5:]`, the substring will end at the last position in the string.

In [1]:
protein_sequence='CAWEWPRHRYT'
print(protein_sequence[0]) #Prints the first character (at index 0) - i.e. C
print(protein_sequence[1:5]) #Prints from the 2nd character (index 1) to the 5th (index 4) (AWEW)
print(protein_sequence[:7]) #Prints from the first character to the seventh, i.e. index6 (CAWEWPR)
print(protein_sequence[2:]) #Prints from the 3rd character to the end of the string (WEWPRHRYT)
print(protein_sequence[-1]) #Prints the character at the end of the string (T)

C
AWEW
CAWEWPR
WEWPRHRYT
T


#### Extended Slices

Python also has an extended slice operator, denoted by `[ : : ]`. Here the first and second numbers work as for the regular slice operator (start and end points), but the third character indicates the step size. Thus [0:10:2] will return **every second** character from index 0 to index 9. 

In [12]:
protein_sequence='CAWEWPRHRYT'
print(protein_sequence[0:10:2]) #CWWRR
print(protein_sequence[0:10:3]) #CERY

CWWRR
CERY


#### Reversing strings 

A special use of extended slices is to reverse the string - to do this the step size is given a value of negative 1 (`[::-1]`). There is no need to specify the range, but this can be done to limit the start and stop points if necessary (e.g. protein_sequence[::-1]

In [13]:
protein_sequence='CAWEWPRHRYT'
print(protein_sequence[::-1]) #TYRHRPWEWAC (reverse the string)
print(protein_sequence[::-2]) #TRRWWC (step backwards in twos)

TYRHRPWEWAC
TRRWWC


#### Joining Strings Together (concatenation)

Strings can joined together (referred to as 'concatenation') using the plus (+) sign.

In [4]:
protein_sequence = 'CAWEWPRHRYT' + 'GATTCGAWRPQY' 
print(protein_sequence)

CAWEWPRHRYTGATTCGAWRPQY


#### Escape Characters

Sometimes text includes 'invisible' characters that indicate how the text should be formatted - such as tabs or new line characters. Since it would be difficult to see these characters in our code, we use escape characters in Python strings instead. Escape characters are character sequences prefixed with a backslash, for example tabs and newline characters arre represented by the follwoing escape characters:

<table>
<tr>
<td>\n</td><td>Newline</td>
</tr>
<tr>
<td>\t</td><td>Tab</td>
</tr>
</table>

In [3]:
print("Line 1\nLine 2\nLine 3\n") #\n inserts a new line character

Line 1
Line 2
Line 3



Characters are also escaped if they would disrupt the string. For example to include a quotation mark inside a string, it needs to be escaped to prevent it prematurely terminating the string. In the case of quotation marks, this can also be achieved by using the alternative type of quotation mark, as shown in the following example which uses the common symbolic nomenclature for feet (') and inches (").

In [5]:
tree_girth = '3"' #this works because the alternative type of quotation marks can be used
tree_girth = "3'6\"" #double quotation marks must be escaped
tree_girth = '3\'6"' #single quotation marks must be escaped

## Useful string methods and functions

There are a variety of inbuilt Python functions and methods that can operate on strings.  A list of methods appropriate to an object can be shown by running the dir() function on a string - e.g. `dir(protein_sequence)`.

The main difference between functions and methods that you will notice just now is the way that they are written:

**Functions** are called as the function name followed by brackets - in the following example, calling the function len() on a string returns the length of the string. Print is also a function that prints whatever is in the brackets.

In [1]:
protein_sequence = 'CAWEWPRHRYT'
#The len() function returns the length of the string
length_of_sequence = len(protein_sequence) 
print(length_of_sequence)

11


**Methods** are specified using dot notation after the object (variable) name, e.g.:

In [9]:
protein_sequence = 'CAWEWPRHRYT'
#The .lower() method returns a string converted to lower case
lower_case_seq = protein_sequence.lower()
print(lower_case_seq)

cawewprhryt


A full list of string functions and methods is available in the [Python 3 documentation] (https://docs.python.org/3/library/stdtypes.html#string-methods). Most methods have additional arguments that can be specified such as start and end points, please see the Python documentation for this information. 

A few commonly used string functions and methods are described here:

#### str.capitalize()

This method returns a copy of the string with its first character capitalized and the remainder converted to lower case.

In [8]:
protein_sequence = 'agthyaHYctgr'
print(protein_sequence.capitalize())

Agthyahyctgr


#### str.lower()

Return a copy of the string with all the cased characters converted to lowercase.

In [9]:
dna_sequence = 'ATGGAGTCGATACGAATTCTGATAA'
print(dna_sequence.lower())

atggagtcgatacgaattctgataa


#### str.upper()
Return a copy of the string with all the cased characters converted to uppercase. 

In [10]:
dna_sequence = 'atggagtcgatacgaattctgataa'
print(dna_sequence.upper())

ATGGAGTCGATACGAATTCTGATAA


#### str.count(sub) 

Counts the number of times a sub-string (sub) occurs within a string. Occurrences that overlap are counted only once. The example below counts the number of times that the EcoRI restriction enzyme site, GAATTC, occurns in the sequence:

In [11]:
dna_sequence = 'ATCGTAGCGAATTCGATTCGAAGCTTAGGAATTCGTAG'
ecori_count = dna_sequence.count('GAATTC')
print(ecori_count)

2


#### str.endswith(sub)

This method returns True if the string ends with the specified suffix, otherwise it returns False. This method is useful for things such as checking filename extensions to limit which files are processed, or testing whether a primer sequence is present at the end of a string. 

In [13]:
filename = 'confocal_image.jpg' 
if filename.endswith(".jpg"): #this example uses a conditional statement - you'll learn about them in the next workbook
    print(filename + " is a JPEG image file")

confocal_image.jpg is a JPEG image file


#### str.find(sub)
Return the lowest index in the string where substring 'sub' is found. Returns -1 if sub is not found.

In [1]:
dna_sequence = "ATGGAGTCGATACGAATTCTGATAA"
found_site = dna_sequence.find("GAATTC")
print("EcoRI site is at position " + str(found_site))

EcoRI site is at position 13


#### str.format()

This method formats the string on which it is called. Replacement fields delimited by braces `{}` can be replaced by variables, and can be simultaneously formatted, for example by rounding to two decimal places. The method returns a copy of the string where each replacement field is replaced with the string value of the corresponding argument.

<div class="alert alert-info">
We will use the format method extensively because it allows us to very easily combine our programmatic output into a sentence in human-readable text, complete with units. This is shown in the examples below.
</div>


In [14]:
print("The sum of 1 + 2 is {0}".format(1+2))

The sum of 1 + 2 is 3


Replacement fields can be indexed by numbers, or if no index is specified then fields will be replaced in the order they are listed in the format method. 

In [15]:
print('{0}, {1}, {2}'.format('a', 'b', 'c')) #replace based on index 
print('{}, {}, {}'.format('a', 'b', 'c')) #replace based on position
print('{2}, {1}, {0}'.format('a', 'b', 'c')) #replace based on index

a, b, c
a, b, c
c, b, a


Replacements can also be made by accessing arguments by name:

In [14]:
print('Mean was {mean_cell_count} and standard deviation was {std_dev}'.format(std_dev = 23.2, mean_cell_count=1294))

Mean was 1294 and standard deviation was 23.2


Format can also change the display of numbers or strings, for example, it can be used to limit the number of decimal places or display numbers in scientific notation.

In [17]:
mol_wt = 892092.429084
print("The molecular weight of the protein is {0:.2f} Da".format(mol_wt))

The molecular weight of the protein is 892092.43 Da


In [18]:
avogadros_constant = 602214141070409084099072
print("The number of atoms or molecules in one mole of a substance is equal to {:.3E}".format(avogadros_constant))

The number of atoms or molecules in one mole of a substance is equal to 6.022E+23


More information on the very wide range of options is available in the [official Python 3 documentation](https://docs.python.org/3/library/string.html#formatspec).



#### str.replace(old, new)
Return a copy of the string with all occurrences of substring old replaced by new. 

In [22]:
dna_sequence = "ATGGAGTCGATACGAATTCTGATAA"
rna_sequence = dna_sequence.replace("T", "U")
print(rna_sequence)

AUGGAGUCGAUACGAAUUCUGAUAA


#### str.rstrip([chars])
Return a copy of the string with specified trailing characters removed. The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace. The chars argument is not a suffix; rather, all combinations of its values are stripped.

This method is particularly useful for removing newline characters `\n` from the end of lines when processing files, as we'll cover later in this course. 

In [20]:
sentence = 'Minions love bananas'
new_sentence = sentence.rstrip('sna') #this will rstrip all s, n or a characters
print(new_sentence)  #Minions love b

print('AN213214, TCR receptor homolog, Mus muculus\n'.rstrip()) #removes the newline character from the end of the line
print('Mus musculus'.rstrip('lums') ) #argument 'm' is ignored

Minions love b
AN213214, TCR receptor homolog, Mus muculus
Mus musc


A similar method, `str.lstrip()` returns a copy of the string with leading characters removed.

#### str.split(delimiter)

This is extremely useful method that splits a string at each occurrence of the delimiter to produce a list. In the example below, the delimiter is set to space, therefore the method splits at all occurences of space. Other common delimiters would be commas `","` or tabs `"\t"`.

We haven't covered lists yet - they are a data type in Python that will be covered soon, but like strings the individual elements in a list can be accessed by specifying the element index, as shown in the example below:

In [1]:
dna_sequence = "ATT ACG AAT TCT GAT AAT"
codons = dna_sequence.split(" ")
print("The third codon is " + codons[2])

The third codon is AAT


The split method is also really useful for processing data from common biological data files such as comma-separated values files, as it can split a line of data into subcomponents. We will make make extensive use of this method when we come to the section on "Working with Files"


# Exercises

* Complete the code below to remove the newline (\n) character from the string, and convert the string to UPPERCASE.

In [25]:
dna_seq = "gctacgatcgatcgatcgattagctagctgat\n"

Convert the following DNA sequence into RNA (replace thymine (T) bases with uracil (U) bases), then remove the spaces between codons.

In [26]:
dna_sequence = "ATG GCT GCG TCA ACT CGC AAA TGC GTA GGC CAA CCG"

* Complete the code below using the .format method to insert the string variables into the printed string:

In [3]:
mean_beak_depth_mm = 9.2
max_beak_depth_mm = 11.1
min_beak_depth_mm = 7.6
#use the .format method to insert variables into the string below
print("Beak depth measurements of Ground Finches ranged from {} mm to {} mm, with a mean of {} mm") 

Beak depth measurements of Ground Finches ranged from {} mm to {} mm, with a mean of {} mm
