![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

![filo_virion](https://user-images.githubusercontent.com/22747792/73687685-7111bc00-467f-11ea-906e-e16132529840.png)

# Python for Genomics 
## Section 5: Writing and Converting Sequence Files


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

In the last section, we created a large list containing GP gene sequences from a BioProject.  

We have the option of working with these sequences within BioPython, but sometimes we need to save this file to use in external programs for analysis.

In this case, we should write (or save) the file.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

Here is the code from the previous tutorial:

In [15]:
from Bio import SeqIO

bioproject = SeqIO.parse('data/PRJNA257197.gb', 'genbank')
GP_list = []

for record in bioproject:
    for feature in record.features:
        if feature.type == 'gene' and feature.qualifiers['gene'] == ['GP']:
            GP_gene = feature.extract(record.seq)
            GP_list.append(GP_gene)

GP_list[0:5]

[Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()),
 Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()),
 Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()),
 Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()),
 Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA())]

Let's write this list into a file. 

In the following tutorials we are going to perform an alignment, which is typically done in external programs that require you to load in a file.

👉🏼 the `SeqIO.write()` function requires the following :

1. sequences as SeqRecord objects, either a single record, or list is fine
2. a new filename
3. file format for your output


In [16]:
SeqIO.write(GP_list, 'GP_list.fasta', 'fasta')

AttributeError: 'Seq' object has no attribute 'id'

But because we are asking `SeqIO.write()` to create a fasta file for us we need an id to associate with every sequence.

(We've only created Seq Objects, with no identifier associated with them.)  


We have to create SeqRecords. 

Creating SeqRecords is just like creating a Seq Objects, like we did in our first notebook, just with more data (or attributes!).  

Let's create a made up seqrecord.

In [17]:
first_record = SeqRecord('ATCG', id='myid', name='some virus', description='my description')
first_record

SeqRecord(seq='ATCG', id='myid', name='some virus', description='my description', dbxrefs=[])

Let's rerun our script again, but this time create seqrecords (instead of seq objects.)

In [18]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord


bioproject = SeqIO.parse('data/PRJNA257197.gb', 'genbank')
GP_record_list = []

for record in bioproject:
    for feature in record.features:
        if feature.type == 'gene' and feature.qualifiers['gene'] == ['GP']:
            GP_gene_seq = feature.extract(record.seq)
            new_record = SeqRecord(GP_gene_seq,
                                  id=record.id, name=record.name,
                                  description=record.description)
            GP_record_list.append(new_record)


print(len(GP_record_list))
GP_record_list[:4]

249


[SeqRecord(seq=Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()), id='KR105345.1', name='KR105345', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G6089.1, partial genome', dbxrefs=[]),
 SeqRecord(seq=Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()), id='KR105328.1', name='KR105328', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G5844.1, partial genome', dbxrefs=[]),
 SeqRecord(seq=Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()), id='KR105323.1', name='KR105323', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G5743.1, partial genome', dbxrefs=[]),
 SeqRecord(seq=Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()), id='KR105302.1', name='KR105302', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G5295.1, pa

Now let's try to write this file again:

In [19]:
SeqIO.write(GP_record_list, 'data/output/PRJNA257197_GPgenes.fasta', 'fasta')

249

In [20]:
! pwd

/app/05 Converting-Writing Files


Let's see what the file looks like:

In [21]:
! head -50 data/output/PRJNA257197_GPgenes.fasta

>KR105345.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G6089.1, partial genome
GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA
GAGTAGGGGTCATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGAT
ATTGTGAGGCGACAACACAATGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATT
CAAGAGGACATCATTCTTTCTTTGGGTAATTATCCTTTTCCAAAGAACATTTTCCATCCC
GCTTGGAGTTATCCACAATAGTACATTACAGGTTAGTGATGTCGACAAACTAGTTTGTCG
TGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCGAGGGGAATGG
AGTGGCAACTGACGTGCCATCTGTGACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAA
AAAACCTGACGGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCG
GTGCCGGTATGTGCACAAAGTATCAGGAACGGGACCATGTGCCGGAGACTTTGCCTTCCA
CAAAGAGGGTGCTTTCTTCCTGTATGATCGACTTGCTTCCACAGTTATCTACCGAGGAAC
GACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCCCAAGCTAAGAAGGACTTCTT
CAGCTCACACCCCTTGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCGAGTGGCTATTA
TTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACTAATGAGACAGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCA

Now we have a fasta file with all of our proper fasta files ready to be aligned!

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## How do we convert from one file format into another?

A common requirement is to convert from one file format to another.


### 👉🏼 `SeqIO.convert()` parses, converts, and write files.

It requires 4 arguments:
1. input_filename
2. in_format
3. output_filename
4. out_format

SeqIO.convert() creates a new file with your designated name in the same folder, and returns the number of files it has converted.

In [None]:
count = SeqIO.convert('data/PRJNA257197.gb', 'genbank', 'data/output/PRJNA257197.fasta', 'fasta', alphabet=None)

In [None]:
count

In [None]:
! head -100 data/PRJNA257197.fasta