Tutorial 3: Handling Protein Data

This tutorial will guide you through import of protein data from FASTA files and running a homology search from Metingear. It is also possible to link the proteins into an existing reconstruction which is explained in Tutorial: Handling Geneome Data.

Required files

Before you begin the tutorial there are a couple of files to download and optional configuration options. Some of these files are large and may take some time to download and configure. We will be integrating data on Bacillus Subtilis and require several data files for the basic part of the tutorial.

SLR16.1_prot.txt: FASTA file of Bacillus Subtilis proteins - available from SubiList
gb-2009-10-6-r69-s1.xls: iBsu1103 reconstruction - available from Henry et al. 2009

Optional

uniprot_sprot.xml.gz (800MB): UniProt SwissProt XML used to index cross-references. Download the compressed file to you computer then from the Resources menu configure the UniProt Cross-references loader to use to this file. On an uncompressed file the index creation should take less then ten minutes but this will depend on your machine.
blast+2.2.28: blast+ suite required of Local Homology, download the appropriate version for you operating system
uniprot_sport.fasta.gz - SwissProt fasta sequences, required for Local Homology or ftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz

## Configuring BLAST in Metingear

In order to search for sequences locally we need to make a database of SwissProt sequences to search.

Install blast+2.2.28 for you operating system and then navigate to or create a location where you would like your database.

$``cd /Users/johnmay/db/blast/2.2.28+

unzip the uniprot_sprot.fasta.gz sequences

$``gunzip uniprot_sprot.fasta.gz

make the database using the makeblastdb command $``/Application/blast/bin/makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_sprot -parse_seqids -hash_index

With the databases made we must now configure Metingear with the correct locations. Open up the preferences window using Edit > Preferences (or Metingear > Preferences on OS X) and select the Tools menu.

blast-bin

Enter the location of the blastp executable and the version of blast you installed. As I installed blast at /Applications/blast/ my path for blastp is in the bin/ directory of this location : /Applications/blast/bin/blastp.

blast-bin

We also need to enter the folder where we built the SwissProt database : /Users/johnmay/db/blast/2.2.28+/

blast-bin

Modifications to the FASTA file

Unfortunately the original SLR16.1_prot.txt from SubtiList does not have correctly formated lines. The FASTA format requires "the word following the ">" symbol is the identifier" however in the original file.

The word following the > is just B. example:

>B. subtilis 168|BG11037|AadK: aminoglycoside 6-adenylyltransferase
MRSEQEMMDIFLDFALNDERIRLVTLEGSRTNRNIPPDNFQDYDISYFVTDVESFKENDQ
WLEIFGKRIMMQKPEDMELFPPELGNWFSYIILFEDGNKLDLTLIPIREAEDYFANNDGL
VKVLLDKDSFINYKVTPNDRQYWIKRPTAREFDDCCNEFWMVSTYVVKGLARNEILFAID
HLNEIVRPNLLRMMAWHIASQKGYSFSMGKNYKFMKRYLSNKEWEELMSTYSVNGYQEMW
KSLFTCYALFRKYSKAVSEGLAYKYPDYDEGITKYTEGIYCSVK
>B. subtilis 168|BG12556|AapA: amino acid permease
MIGNSSKDNFGQQQKLSRGLKNRHIQLMAIGGAIGTGLFLGSGKSIHFAGPSILFAYLIT
GVFCFFIIRSLGELLLSNAGYHSFVDFVRDYLGNMAAFITGWTYWFCWISLAMADLTAVG
IYTQYWLPDVPQWLPGLLALIILLIMNLATVKLFGELEFWFALIKVIAILALIVTGILLI

To fix this the original file was modified with the following command and provided for this tutorial:

$ sed -e "s/B. subtilis 168\|BG[0-9]*\|[A-z]*:/\1/" SLR16.1_prot.txt > SLR16.1_prot.fasta

>BG11037 aminoglycoside 6-adenylyltransferase
MRSEQEMMDIFLDFALNDERIRLVTLEGSRTNRNIPPDNFQDYDISYFVTDVESFKENDQ
WLEIFGKRIMMQKPEDMELFPPELGNWFSYIILFEDGNKLDLTLIPIREAEDYFANNDGL
VKVLLDKDSFINYKVTPNDRQYWIKRPTAREFDDCCNEFWMVSTYVVKGLARNEILFAID
HLNEIVRPNLLRMMAWHIASQKGYSFSMGKNYKFMKRYLSNKEWEELMSTYSVNGYQEMW
KSLFTCYALFRKYSKAVSEGLAYKYPDYDEGITKYTEGIYCSVK
>BG12556 amino acid permease
MIGNSSKDNFGQQQKLSRGLKNRHIQLMAIGGAIGTGLFLGSGKSIHFAGPSILFAYLIT
GVFCFFIIRSLGELLLSNAGYHSFVDFVRDYLGNMAAFITGWTYWFCWISLAMADLTAVG
IYTQYWLPDVPQWLPGLLALIILLIMNLATVKLFGELEFWFALIKVIAILALIVTGILLI

You could also use the mirrored files of SubtiList which have correct FASTA headers - ftp.ebi.ac.uk/pub/databases/SubtiList/

Create a reconstruction

With Metingear open, select the menu item, File > New Reconstruction. In the organism name field start typing bacillus subtilis and click Bacillus subtilis (strain 168) from the drop down menu.

create-recon-form

You may change the reconstruction identifier or leave it as it is (see Creating a Reconstruction). Click create to add a new reconstruction to metingear.

create-recon-filled

The reconstruction should appear in the side bar.

create-recon-sidebar

Import protein sequences

To import the protein sequences select the menu item File > Import... > Peptides (.fasta).

import-fasta-menu

Select the location of SLR16.1_prot.fasta on your file system and open the file. Each sequence will create a new gene product. The product identifier will be the identifier from the FASTA header whilst the rest of the header will be loaded as the product name. As an example

>BG11037 aminoglycoside 6-adenylyltransferase
MRSEQEMMDIFLDFALNDERIRLVTLEGSRTNRNIPPDNFQDYDISYFVTDVESFKENDQ
...

will create an entity with id BG11037 and the name aminoglycoside 6-adenylyltransferase.

view-gp

With sequences loaded, it's a good idea to save the reconstruction

Sequence Homology

With one or more protein products selected, Metingear can run a sequence homology search for those sequences. In the gene product view select a couple of sequences in the table and choose the menu item Tools > Sequence Homology.

select-gp

homology-menu

The database selection depends on the location we set earlier but ensure this is set to SwissProt. Selecting okay here will not run the job but instead add one or more _Task_s to a queue.

homology-dialog

We can view the current tasks by clicking Tasks in the sidebar.

queued-task

From any view we can start the tasks by invoking Run > Run Queued Tasks. Selecting the tasks in the table will show information about it and the command line task being run. You can also check the output file from this information.

run-task

Once the task has finished our proteins will have been updated with several homology hits.

view-alignment

Transfer Annotations

Now we have some proteins with homology information attached we can transfer annotations from these homologues. Select the entries in the table and choose the menu item Tools > Transfer Annotations. Currently there are no rules for selecting which hits to transfer as the assumption is specific enough parameters were using when creating the search.

view-evidence

When the annotation transfer is complete we can see there are multiple annotations which now have an evidence button. Click this button will highlight those sequences which provided this annotation.

Linking to gene products

With information transfered from the homologous sequences one can now link this information to reactions. The Handling Genome Data tutorial provides an explanation of how this can be achieved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial 3: Handling Protein Data

Required files

Modifications to the FASTA file

Create a reconstruction

Import protein sequences

Sequence Homology

Transfer Annotations

Linking to gene products

Clone this wiki locally