# Python for biologists, chapter 3

Geert Jan Bex ([geertjan.bex@uhasselt.be](mailto:geertjan.bex@uhasselt.be))

Exercises taken from 'Python for biologists', Martin Jones.

## Splitting genomic DNA

The file `genomic_dna_3.txt` contains a DNA sequence that has two exons, separated by an intron.  The first exon starts at the beginning of the sequence, and ends at the 63rd character, the second exon starts at the 91st character, and runs up to the end of the sequence.  Read the file, and write the coding and non-coding regions to two files, `genomic_dna_3_coding.txt` and `genomic_dna_3_noncoding.txt` respectively.

In [1]:
seq_filename = 'genomic_dna_3.txt'
coding_seq_filename = 'genomic_dna_3_coding.txt'
noncoding_seq_filename = 'genomic_dna_3_noncoding.txt'

In [4]:
with open(seq_filename, 'r') as seq_file:
    dna = seq_file.read()
exon1, intron, exon2 = dna[:63], dna[63:90], dna[90:]
with open(coding_seq_filename, 'w') as seq_file:
    seq_file.write(exon1 + exon2)
with open(noncoding_seq_filename, 'w') as seq_file:
    seq_file.write(intron)

In [6]:
%cat genomic_dna_3_coding.txt

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAATCATCGATCGATATCGATGCATCGACTACTAT


In [7]:
%cat genomic_dna_3_noncoding.txt

TCGATCGATCGATCGATCGATCATGCT

## Writing a FASTA file

Write a file `sequences_3.fasta` in FASTA format for the following three sequences and their respective headers.  Sequences must consist only of the symbols A, C, G, and T.

In [8]:
headers = ['ABC123', 'DEF456', 'HIJ789']
seqs = ['ATCGTACGATCGATCGATCGCTAGACGTATCG', 'actgatcgacgatcgatcgatcacgact', 'ACTGAC-ACTGT--ACTGTA----CATGTG']
fasta_filename = 'sequences_3.fasta'

In [9]:
with open(fasta_filename, 'w') as fasta_file:
    for i in range(len(headers)):
        seq = seqs[i].replace('-', '').upper()
        fasta_file.write('>{0}\n{1}\n'.format(headers[i], seq))

In [10]:
%cat sequences_3.fasta

>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
ACTGATCGACGATCGATCGATCACGACT
>HIJ789
ACTGACACTGTACTGTACATGTG


## Writing multiple FASTA files

Using the data for the previous part, write three, rather than one file, i.e., a FASTA file per sequence. Use the sequence headers as the respective file names, adding the `.fasta` extension at the end.

In [8]:
headers = ['ABC123', 'DEF456', 'HIJ789']
seqs = ['ATCGTACGATCGATCGATCGCTAGACGTATCG', 'actgatcgacgatcgatcgatcacgact', 'ACTGAC-ACTGT--ACTGTA----CATGTG']

In [18]:
for i in range(len(headers)):
    file_name = '{0}.fasta'.format(headers[i])
    with open(file_name, 'w') as fasta_file:
        seq = seqs[i].replace('-', '').upper()
        fasta_file.write('>{0}\n{1}\n'.format(headers[i], seq))

In [19]:
%cat ABC123.fasta
%cat DEF456.fasta
%cat HIJ789.fasta

>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
ACTGATCGACGATCGATCGATCACGACT
>HIJ789
ACTGACACTGTACTGTACATGTG
