# Get *C. merolae* proteins
In this notebook, we'll walk through pulling specific protein sequences from the [*C. merolae* genome website](http://czon.jp).

In [3]:
import pandas as pd
from Bio import SeqIO

## Reading in the ID's to retrieve
Next, we need to get the ID's that we want to retrieve from this fasta. This is an excel file that I put in the same directory as this notebook.

In [4]:
enriched = pd.read_excel('CmerolaeCoIP_WistarProteomics.xlsx', sheet_name='141 Enriched proteins')
enriched.head()

FileNotFoundError: [Errno 2] No such file or directory: 'CmerolaeCoIP_WistarProteomics.xlsx'

The ID's that correspond to the *C. merolae* genome ID's are in the `Accession` column:

In [4]:
proteins_to_search = enriched.Accession.tolist()
proteins_to_search[:5]

['CMH170CT', 'CML232CT', 'CMR341CT', 'CMJ081CT', 'CMD011CT']

## Parsing the fasta
Now we use `SeqIO` to read in the fasta iteratively, keeping only the proteins that appear in our accession list.

In [12]:
records = [
    r for r in SeqIO.parse('proteins.fasta', 'fasta')
    if r.id.split('|')[-1] + 'T' in proteins_to_search # fasta ID's don't have the T at the end for some reason
]

In [14]:
print(f'{len(records)} of {len(proteins_to_search)} protein sequences have been recovered.')

141 of 141 protein sequences have been recovered.


## Saving out the fasta
Now we just need to save our results:

In [16]:
SeqIO.write(records, 'coIP_proteins.fasta', 'fasta')

141

Only thing to note here is that I've written them out with the ID that came from the genome, so it also doesn't have a T at the end -- can deal with that later on if need be.