<h1 id="toctitle">Working with files exercise solutions</h1>
<ul id="toc"/>

##Splitting genomic DNA

We already have some of the solution from the previous session:

In [2]:
my_dna = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
exon1 = my_dna[0:63]
intron = my_dna[63:90]
exon2 = my_dna[90:]
print(exon1 + intron.lower() + exon2)

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAtcgatcgatc gatcgatcgatcatgcTATCATCGATCGATATCGATGCATCGACTACTAT


What changes do we need to make? First, read sequence from a file instead of writing it in the code:

In [3]:
dna_file = open("genomic_dna.txt")
my_dna = dna_file.read()
my_dna

'ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT\n'

Create two new file objects to hold the output:

In [4]:
coding_file = open("coding_dna.txt", "w")
noncoding_file = open("noncoding_dna.txt", "w")

Concatenate the two exon sequences and write them to the coding file, then write the intron sequence to the noncoding file;

In [6]:
coding_file.write(exon1 + exon2)
noncoding_file.write(intron)

Putting it all together:

In [8]:
# open the file and read its contents
dna_file = open("genomic_dna.txt")
my_dna = dna_file.read()
# extract the different bits of DNA sequence
exon1 = my_dna[0:62]
intron = my_dna[62:90]
exon2 = my_dna[90:]
# open the two output files
coding_file = open("coding_dna.txt", "w")
noncoding_file = open("noncoding_dna.txt", "w")
# write the sequences to the output files
coding_file.write(exon1 + exon2)
noncoding_file.write(intron)

##Writing a FASTA file

First, create some variables: three for sequences and three for headers:

In [9]:
header_1 = "ABC123"
header_2 = "DEF456"
header_3 = "HIJ789"
seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
seq_2 = "actgatcgacgatcgatcgatcacgact"
seq_3 = "ACTGAC-ACTGT--ACTGTA----CATGTG"

Start by printing FASTA output to the screen to check if it works:

In [10]:
print(header_1)
print(seq_1)
print(header_2)
print(seq_2)
print(header_3)
print(seq_3)

ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456
actgatcgacgatcgatcgatcacgact
HIJ789
ACTGAC-ACTGT--ACTGTA----CATGTG


We forgot the `>` character. Plus, we can do each header/sequence on one line by using a newline character:

In [11]:
print('>' + header_1 + '\n' + seq_1)
print('>' + header_2 + '\n' + seq_2)
print('>' + header_3 + '\n' + seq_3)

>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
actgatcgacgatcgatcgatcacgact
>HIJ789
ACTGAC-ACTGT--ACTGTA----CATGTG


Next, fix the sequences - change the second one to upper case and replace dashes in the third one:

In [12]:
print('>' + header_1 + '\n' + seq_1)
print('>' + header_2 + '\n' + seq_2.upper())
print('>' + header_3 + '\n' + seq_3.replace('-', ''))

>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
ACTGATCGACGATCGATCGATCACGACT
>HIJ789
ACTGACACTGTACTGTACATGTG


Now switch from printed output to file output. We need newlines at the end of each `write()` statement. Also, remember to close the file:

In [14]:
output = open("sequences.fasta", "w")
output.write('>' + header_1 + '\n' + seq_1 + '\n')
output.write('>' + header_2 + '\n' + seq_2.upper() + '\n')
output.write('>' + header_3 + '\n' + seq_3.replace('-', '') + '\n')
output.close()

All steps together:

In [16]:
# set the values of all the header variables
header_1 = "ABC123"
header_2 = "DEF456"
header_3 = "HIJ789"

# set the values of all the sequence variables

seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
seq_2 = "actgatcgacgatcgatcgatcacgact"
seq_3 = "ACTGAC-ACTGT—ACTGTA----CATGTG"

# make a new file to hold the output
output = open("sequences.fasta", "w")
# write the header and sequence for seq1
output.write('>' + header_1 + '\n' + seq_1 + '\n')
# write the header and uppercase sequences for seq2
output.write('>' + header_2 + '\n' + seq_2.upper() + '\n')
# write the header and sequence for seq3 with hyphens removed
output.write('>' + header_3 + '\n' + seq_3.replace('-', '') + '\n')

##Writing multiple FASTA files

Just a slight modification needed to the previous solution. We need to carefully construct the filenames for the output files by joining the header names with the `.fasta` extension:

In [17]:
output_1 = open(header_1 + ".fasta", "w")
output_2 = open(header_2 + ".fasta", "w")
output_3 = open(header_3 + ".fasta", "w")

Now we write each sequence out to a separate file:

In [18]:
# write one sequence to each output file
output_1.write('>' + header_1 + '\n' + seq_1 + '\n') 
output_2.write('>' + header_2 + '\n' + seq_2.upper() + '\n') 
output_3.write('>' + header_3 + '\n' + seq_3.replace('-', '') + '\n') 

and close each file

In [22]:
output_1.close()
output_2.close()
output_3.close()

All together:

In [None]:
# set the values of all the header variables 
header_1 = "ABC123" 
header_2 = "DEF456" 
header_3 = "HIJ789" 
 
# set the values of all the sequence variables 
seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG" 
seq_2 = "actgatcgacgatcgatcgatcacgact" 
seq_3 = "ACTGAC-ACTGT—ACTGTA----CATGTG" 
 
# make three files to hold the output 
output_1 = open(header_1 + ".fasta", "w") 
output_2 = open(header_2 + ".fasta", "w") 
output_3 = open(header_3 + ".fasta", "w") 
 
# write one sequence to each output file
output_1.write('>' + header_1 + '\n' + seq_1 + '\n') 
output_2.write('>' + header_2 + '\n' + seq_2.upper() + '\n') 
output_3.write('>' + header_3 + '\n' + seq_3.replace('-', '') + '\n') 

output_1.close()
output_2.close()
output_3.close()

There's a lot of repeated code here; next session we will look at how to avoid that.

In [1]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [3]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")