-
Notifications
You must be signed in to change notification settings - Fork 3
Tutorial 3: Handling Protein Data
This tutorial will guide you through import of protein data from FASTA files and running a homology search from Metingear. It is also possible to link the proteins into an existing reconstruction which is explained in Tutorial: Handling Geneome Data.
Before you begin the tutorial there are a couple of files to download and optional configuration options. Some of these files are large and may take some time to download and configure. We will be integrating data on Bacillus Subtilis and require several data files for the basic part of the tutorial.
-
SLR16.1_prot.txt
: FASTA file of Bacillus Subtilis proteins - available from SubiList -
gb-2009-10-6-r69-s1.xls
: iBsu1103 reconstruction - available from Henry et al. 2009
Optional
-
uniprot_sprot.xml.gz (800MB)
: UniProt SwissProt XML used to index cross-references. Download the compressed file to you computer then from the Resources menu configure the UniProt Cross-references loader to use to this file. On an uncompressed file the index creation should take less then ten minutes but this will depend on your machine. -
blast+2.2.28
: blast+ suite required of Local Homology, download the appropriate version for you operating system -
uniprot_sport.fasta.gz
- SwissProt fasta sequences, required for Local Homology or ftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz
In order to search for sequences locally we need to make a database of SwissProt sequences to search.
Install blast+2.2.28
for you operating system and then navigate to or create a location where you would like your database.
$``cd /Users/johnmay/db/blast/2.2.28+
unzip the uniprot_sprot.fasta.gz
sequences
$``gunzip uniprot_sprot.fasta.gz
make the database using the makeblastdb
command
$``/Application/blast/bin/makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_sprot -parse_seqids -hash_index
With the databases made we must now configure Metingear with the correct locations. Open up the preferences window using Edit > Preferences
(or Metingear > Preferences
on OS X) and select the Tools
menu.
Enter the location of the blastp
executable and the version of blast you installed. As I installed blast at /Applications/blast/
my path for blastp
is in the bin/
directory of this location : /Applications/blast/bin/blastp
.
We also need to enter the folder where we built the SwissProt database : /Users/johnmay/db/blast/2.2.28+/
Unfortunately the original SLR16.1_prot.txt
from SubtiList does not have correctly formated lines. The FASTA format requires "the word following the ">" symbol is the identifier" however in the original file.
The word following the >
is just B.
example:
>B. subtilis 168|BG11037|AadK: aminoglycoside 6-adenylyltransferase
MRSEQEMMDIFLDFALNDERIRLVTLEGSRTNRNIPPDNFQDYDISYFVTDVESFKENDQ
WLEIFGKRIMMQKPEDMELFPPELGNWFSYIILFEDGNKLDLTLIPIREAEDYFANNDGL
VKVLLDKDSFINYKVTPNDRQYWIKRPTAREFDDCCNEFWMVSTYVVKGLARNEILFAID
HLNEIVRPNLLRMMAWHIASQKGYSFSMGKNYKFMKRYLSNKEWEELMSTYSVNGYQEMW
KSLFTCYALFRKYSKAVSEGLAYKYPDYDEGITKYTEGIYCSVK
>B. subtilis 168|BG12556|AapA: amino acid permease
MIGNSSKDNFGQQQKLSRGLKNRHIQLMAIGGAIGTGLFLGSGKSIHFAGPSILFAYLIT
GVFCFFIIRSLGELLLSNAGYHSFVDFVRDYLGNMAAFITGWTYWFCWISLAMADLTAVG
IYTQYWLPDVPQWLPGLLALIILLIMNLATVKLFGELEFWFALIKVIAILALIVTGILLI
To fix this the original file was modified with the following command and provided for this tutorial:
$
sed -e "s/B. subtilis 168\|BG[0-9]*\|[A-z]*:/\1/" SLR16.1_prot.txt > SLR16.1_prot.fasta
>BG11037 aminoglycoside 6-adenylyltransferase
MRSEQEMMDIFLDFALNDERIRLVTLEGSRTNRNIPPDNFQDYDISYFVTDVESFKENDQ
WLEIFGKRIMMQKPEDMELFPPELGNWFSYIILFEDGNKLDLTLIPIREAEDYFANNDGL
VKVLLDKDSFINYKVTPNDRQYWIKRPTAREFDDCCNEFWMVSTYVVKGLARNEILFAID
HLNEIVRPNLLRMMAWHIASQKGYSFSMGKNYKFMKRYLSNKEWEELMSTYSVNGYQEMW
KSLFTCYALFRKYSKAVSEGLAYKYPDYDEGITKYTEGIYCSVK
>BG12556 amino acid permease
MIGNSSKDNFGQQQKLSRGLKNRHIQLMAIGGAIGTGLFLGSGKSIHFAGPSILFAYLIT
GVFCFFIIRSLGELLLSNAGYHSFVDFVRDYLGNMAAFITGWTYWFCWISLAMADLTAVG
IYTQYWLPDVPQWLPGLLALIILLIMNLATVKLFGELEFWFALIKVIAILALIVTGILLI
You could also use the mirrored files of SubtiList which have correct FASTA headers - ftp.ebi.ac.uk/pub/databases/SubtiList/
With Metingear open, select the menu item, File > New Reconstruction
. In the organism name field start typing bacillus subtilis
and click Bacillus subtilis (strain 168)
from the drop down menu.
You may change the reconstruction identifier or leave it as it is (see Creating a Reconstruction). Click create
to add a new reconstruction to metingear.
The reconstruction should appear in the side bar.
To import the protein sequences select the menu item File > Import... > Peptides (.fasta)
.
Select the location of SLR16.1_prot.fasta
on your file system and open the file. Each sequence will create a new gene product. The product identifier will be the identifier from the FASTA header whilst the rest of the header will be loaded as the product name. As an example
>BG11037 aminoglycoside 6-adenylyltransferase
MRSEQEMMDIFLDFALNDERIRLVTLEGSRTNRNIPPDNFQDYDISYFVTDVESFKENDQ
...
will create an entity with id BG11037
and the name aminoglycoside 6-adenylyltransferase
.
With sequences loaded, it's a good idea to save the reconstruction
With one or more protein products selected, Metingear can run a sequence homology search for those sequences. In the gene product view select a couple of sequences in the table and choose the menu item Tools > Sequence Homology
.
The database selection depends on the location we set earlier but ensure this is set to SwissProt. Selecting okay here will not run the job but instead add one or more _Task_s to a queue.
We can view the current tasks by clicking Tasks
in the sidebar.
From any view we can start the tasks by invoking Run > Run Queued Tasks
. Selecting the tasks in the table will show information about it and the command line task being run. You can also check the output file from this information.
Once the task has finished our proteins will have been updated with several homology hits.
Now we have some proteins with homology information attached we can transfer annotations from these homologues. Select the entries in the table and choose the menu item Tools > Transfer Annotations
. Currently there are no rules for selecting which hits to transfer as the assumption is specific
enough parameters were using when creating the search.
When the annotation transfer is complete we can see there are multiple annotations which now have an evidence
button. Click this button will highlight those sequences which provided this annotation.
With information transfered from the homologous sequences one can now link this information to reactions. The Handling Genome Data tutorial provides an explanation of how this can be achieved.