# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---

# String Methods
There are a large number of methods to manipulate strings. These are simple single-word methods to achive dificult (but often boring) tasks, but they can be incredibly useful. Some of these we have seen already as they are so common but lets see the full set:

Cleaning and printing outputs
* ```.strip()``` cleans off whitespace, or other noise from the beginning and end of a string (whitespace meaning spaces (\s), tabs (\t), or newlines (\n))
* ```.upper()```, ```.title()```, and ```.lower()``` adjust the casing of your string.

Searching and modifying the string
* ```.replace()``` replaces all instances of a character/string in a string with another character/string.
* ```.find()``` searches a string for a character/string and returns the index value that character/string is found at.

Making/breaking lists
* ```.split()``` takes a string and creates a list of substrings.
* ```.join()``` takes a list of strings and creates a string.

A few examples of them in action:

In [None]:
my_gene = "ATGTCGACCAACTCGACCAATGCTCGACCAACGGCaaaaaaaaaaaaaa"
article = "\n  a study to show how one little bit of dna became the most important thing   \n\n"
stop_codon = "ATG"

# Format strings
print(my_gene.upper())
print(article.strip())
print(article.title())
print(article.strip().title())

ATGTCGACCAACTCGACCAATGCTCGACCAACGGCAAAAAAAAAAAAAA
a study to show how one little bit of dna became the most important thing

  A Study To Show How One Little Bit Of Dna Became The Most Important Thing   


A Study To Show How One Little Bit Of Dna Became The Most Important Thing


Lets use .split() to separate out the code. We can use anything as a split delimeter (usually a comma (```,```) or tab (```\t```) character) but here lets be bioinformatic and use a stop codon:

In [None]:
# Split the sequence at the stop codon
splitted_gene = my_gene.split("CGA")
print(splitted_gene)

# Output the second element - We'll see more of this in lists
middle_CDS = splitted_gene[1]
print(middle_CDS)

# For fun let's use the replace function to convert to RNA
print(middle_CDS.replace("T", "U"))


['', 'TCGACCAACTCGACCA', 'CTCGACCAACGGCaaaaaaaaaaaaaa']
TCGACCAACTCGACCA
UCGACCAACUCGACCA


---

# Exercise - Extreme strings & loops

Data is messy. Biologist data even more so. Here we have some data on bacterial abundance as collected by some well meaning scientists but unfortunately it's a bit of a mess. It is technically in a four column format liks this, howver when you look below it's mixed up:

```
| Collector | Percentage abundance | Dominant Phyla | Date |
```

Delimeters: 
- Between collected data samples: ```,``` 
- Between data fields per sample: ```-```

We want to clean it up and make some sense out of it. The objective is to output a count of samples dominated by each phyla. 

Here is a list of suggested steps. I recommend using ```print()``` functions after each step to check the output is as expected.

1. Look at the text file first so that you know what we are looking at!
2. Read in the file ```MessyData.txt``` as one object (it is too mixed-up to read line-by-line)
2. Split the data by commas into a list of records
2. Within a loop, split each record into the 4 data elements
3. Within a nested loop, clean the whitepace off each element (while keeping experiments together)
4. Create a list of all the dominant phyla per sample - some samples have multiple, so have to be split first!
5. Output a count of samples dominated by each phyla. Here is an example final line of code for you to use

```
for p in phyla:
  print("There are {} samples dominant in {}")
```

The list of phyla is below. To create this list I used the function ```set``` on the list of phyla out output unique ones like this: ```list(set(all_phyla))```. You can use that in your code if you want to generate the list yourself, or copy this list into your code below.
```
phyla = ['Actinomycetes', 'Proteobacteria', 'Cyanobacteria', 'Firmicutes', 'Chloroflexi', 'Acidobacteria', 'Bacillus']
```

Extension: If you've completed it and want more challenge, create a graphical output of the data