# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---

This session we will look at reading to and from files. Often not as important with beginners coding courses, but biological data is pretty typically depending on lots of large files being read in so lets find out how.

Note: You will need to add the Classdata files to the colab runtime to access them.

# Python I/O Handling

In Python, input/output (I/O) handling is the process of reading data from external sources and writing data to external destinations. This is a fundamental aspect of programming especially in bioinformatics given the amount of data files we work with.

## Reading a Text File

One of the most common I/O operations in Python is reading data from a file. In Python, we can read a text file using the `open()` function and the `read()` method. This will load the whole file into one variable.


In [1]:
# Open the file in read mode
with open('/content/CanisLupisCO1.fasta') as  inputFile:
  # Read the contents of the file
  data = inputFile.read()

print(data)

>U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I
ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG
CATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTT
ACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTC
ATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCAT
TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCTTTCTTCTACTATTAGCATCTTCTAT
GGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGA
GCATCCGTTGACCTTACAATTTTCTCCTTACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT
TCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATC
AGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTA
ACAGACCGGAATCTTAATACAACATTTTTTGATCCCGCTGGAGGAGGAGACCCTATCCTATATCAACACC
TATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACAT


An alternative is to read one line at a time as a temporary variable. This means the whole file isn't stored in memory (good for big files) but you don't have all the data available unless you add it to a variable. But you can also be more selective on what you print. Here is a basic example, and then a more complex to come back to when we've learned more about loops and conditionals

In [2]:
with open('/content/CanisLupisCO1.fasta') as dogDNA:
  for line in dogDNA.readlines():
    print(line)

>U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I

ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG

CATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTT

ACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTC

ATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCAT

TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCTTTCTTCTACTATTAGCATCTTCTAT

GGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGA

GCATCCGTTGACCTTACAATTTTCTCCTTACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT

TCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATC

AGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTA

ACAGACCGGAATCTTAATACAACATTTTTTGATCCCGCTGGAGGAGGAGACCCTATCCTATATCAACACC

TATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACAT


Somthing a bit more complex. We won't look at how the if statment works (although python is easy to read!) but see that it that skips the first line because it begins with a ```>``` character. Putting this example as a good reference to come back to later.

In [None]:
my_lines = []

with open('/content/CanisLupisCO1.fasta') as dogDNA:
  for line in dogDNA.readlines():
    if not line.startswith('>'):
      my_lines.append(line.strip())

print(my_lines)

['ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG', 'CATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTT', 'ACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTC', 'ATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCAT', 'TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCTTTCTTCTACTATTAGCATCTTCTAT', 'GGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGA', 'GCATCCGTTGACCTTACAATTTTCTCCTTACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT', 'TCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATC', 'AGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTA', 'ACAGACCGGAATCTTAATACAACATTTTTTGATCCCGCTGGAGGAGGAGACCCTATCCTATATCAACACC', 'TATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACAT']


There is also the similarly named ```.readline()``` (notice it's singular, not plural). That will read just one line from the file. This is powerful for extracting header or title lines without reading the whole file with a loop.

In [4]:
with open('/content/CanisLupisCO1.fasta') as dogDNA:
  header = dogDNA.readline()
  print(header)

>U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I





## Writing to a Text File

In addition to reading data from a file, we can of course also write data to a file in Python. Usually this will be more useful than just putting information onto the screen when working with big bioinformatic files.

To write to a file, we need to open the file in write mode using the `open()` function and use the `write()` method to choose what to output. 

Note: The only change in the ```open()``` function is the second parameter of 'w'. We didn't need a second parameter when reading as 'r' is the default.


In [5]:
# Open the file in write mode
with open('declaration.txt', 'w') as output_file:
  output_file.write('Hello, world! Python4Lyf!')

Check the file that you just created!


## Appending to a Text File

We can also append data to an existing file in Python using the `open()` function in append mode and the `write()` method. Here is an example:

In [6]:
# Open the file in append mode
with open('declaration.txt', 'a') as outputFile:
  # Append to the file
  outputFile.write('\nand again.')
  outputFile.write('\nand again..')
  outputFile.write('\nand again...')


## In closing...
In Python, it is important to close a file after reading from or writing to it. In the olden days of a few years ago we would use the `close()` method in python, but now it is recommended to use the ```with``` manager for more simple and readable code. I'm mostly including this here in case you see or use older code that uses the previous format.

Basically, because you're interacting with files outside of python it needs to be told when you're finished, for a few reasons:

1. Memory management: When a file is opened in Python, the operating system allocates memory to store the data read/written from/to a file. If the file is not closed properly, the memory used by the file remains allocated which can cause performance issues or crash the program.

2. Data corruption: If a file is not closed properly, any data that has not been written to the file may be lost (think of a USB unplugged too soon). This can result in corrupted or incomplete data, which can cause issues when the data is later read or used.

3. Resource management: When a file is opened, it is locked by the operating system to prevent other processes from modifying it. If a file is not closed properly, it remains locked, preventing other processes from accessing or modifying the file. This can cause issues if the file is needed by another program or process.

To summarise, always close the file! But if you're using ```with```, then it's automatic.

## Exercises

**1 - Read in the ```Homo-sapiens-chr8.txt``` file and convert it into a fasta formatted file with a header line beginning with ```>``` (remember the correct ```.fasta``` suffix)**

**2 - Output your converted data so that it will write a file with the length of the sequence in the filename**

In [None]:
# Write your script here