Skip to content

Tutorial 2: Handling Genome Data

johnmay edited this page Apr 10, 2013 · 23 revisions

This tutorial will guide you through import of genome data and linking it to metabolic and reaction information. We will be working on Bacillus subtilis (strain. 168)

Required files

Before you begin the tutorial there are a couple of files to download and optional configuration options. Some of these files are large and may take some time to download and configure. We will be integrating data on Bacillus Subtilis and require several data files for the basic part of the tutorial.

Optional (required for linking reactions)

  • uniprot_sprot.xml.gz (800MB): UniProt SwissProt XML used to index cross-references. Download the compressed file to you computer then from the Resources menu configure the UniProt Cross-references loader to use to this file. On an uncompressed file the index creation should take less then ten minutes but this will depend on your machine.
## Limitations * genes cannot be created individually - _internally each gene must be associated to a chromosome which complicates creation of individual genes_ * currently only complete single chromosome genomes can be imported from the European Nucleotide Archive (ENA) - _this is due to a parsing limitation of how the format is handled interally_ * some entities will be omitted from ENA import because there is no equivalent value in a Metingear reconstruction - _known entities which will be skipped: misc feature, misc RNA_ * genome data can not currently be exported

What gets imported from ENA

The following outlines which values are imported from ENA and the XML attribute used for each. If possible each gene product will be associated with it's encoding gene.

  • Gene
    • id - automatically generated
    • abbreviation - blank
    • name - locus_tag
    • start, end - loaded and points to the chromosome sequence (also loaded) to give the sequence of the gene
  • Gene Product (tRNA, rRNA and proteins)
    • id - protein_id
    • abbreviation - locus_tag
    • name - product
    • sequence - translation
    • Note annotation - note
    • Cross-reference annotation
      • UniProt
      • InterPro - InterPro, protein sequence analysis & classification
      • GOA - Gene Ontology Annotation (UniProt-GOA) Database

Complete Genomes

The Genomes Page list completed genomes and links to their sequence page. To import a genome from the Genomes Page - Bacteria locate the sequence/html in the table and navigate to that page.

ena-genomes-table

On the page you can download the XML from this link.

ena-genome-entry

Create a reconstruction

With Metingear open, select the menu item, File > New Reconstruction. In the organism name field start typing bacillus subtilis and click Bacillus subtilis (strain 168) from the drop down menu.

create-recon-form

You may change the reconstruction identifier or leave it as it is (see Creating a Reconstruction). Click create to add a new reconstruction to metingear.

create-recon-filled

The reconstruction should appear in the side bar.

create-recon-sidebar

Importing the ENA XML file

With the newly created reconstruction, navigate to and select the menu item File > Import ENA Genome.

file-import-genome

The file will begin importing.

file-import-dial

When the import is finished there will be 4457 genes and 4371 gene products loaded. There will also be several warnings relating to the afore mentioned limitations which can be closed by clicking the cross to the left of the error message.

If everything was successful, it is a good idea to save the current state of the reconstruction before we add the other data. Select File > Save As or File > Save (home directory by default) to save the reconstruction.

Importing reactions and metabolites

With the active reconstruction, select the menu item File > Import Excel.

import-excel-menu

Choose the location of where you have downloaded gb-2009-10-6-r69-s1.xls. With the location selected a wizard dialog will show which prompt you for the sheets which contain the reactions and proteins. If the reactions are spread across multiple sheets, such as, a separate table for exchange reactions, then you can import these later by rerunning the wizard on a different sheet. Ensure the reactions and the metabolites selection is Table S2 and Table S1 respectively and press next.

import-excel-sheets

We now need to configure the metabolites table to indicate the location of each required column. Configure the dialog setting:

  • Data starts: 1
  • Data end: 1140
  • Identifier/Abbreviation: A
  • Name: B
  • Charge: G
  • Molecular Formula: C
  • KEGG cross-reference: D

You can read more about the configuration here.

import-excel-metabolites

When you have configured the dialog, go to the next page to configure the reaction import. The reaction fields should be set to:

  • Start row: 2
  • End row: 1437
  • Identifier/Abbreviation: A
  • Name: B
  • Reaction Equation: D - we could also choose C which would then use metabolite names to identify metabolites (also select the name as Identifier in the metabolite sheet) but these are more ambiguous and in this case will not import properly.
  • Classification: E
  • Subsystem/Reaction Type: G

import-excel-reactions

With the metabolites and reactions configured we click next (twice), and then okay to begin the import. If you have ChEBI Names loaded as a local index (see Resources) then the metabolites will automatically be referenced to ChEBI (in addition to the existing KEGG annotations).

import-wizzard-run

When the import is done you should have 1137 metabolites and 1436 - in this particular case there will be some warnings when the import is complete that it could not find information about the metabolite cpd00498 - this is due to an error in the input.

import-wizzard-done

Linking reactions to gene products

We now have data imported for genes, gene products, reactions and metabolites however although we can navigate between gene products and their encoding genes and metabolites and the reactions they participate in the reactions are not link to any gene products. There are several ways we can achieve this. One easy way is to link by the enzyme nomleclature EC number but we can also link by Locus if the reactions have locus tags we can link to locus tags in gene products or the gene product id (if it is a locus id). Before we can link the reactions to gene products we first need to have such annotations pressent.

We can add EC numbers to gene products:

  • manually - Edit > Add Annotation and select cross-reference
  • sequence homology - Tools > Sequence Homology and Tools > Transfer Annotations see Tutorial: Handling Protein Data
  • expanding cross-references - if we have some annotations on our gene products already we can expand out these references (i.e. transfer annotations)

You may have noticed we have UniProt annotations available on many gene products. We can expand out these annotations using the tool Tools > Annotation > Expand UniProt Annotations (see also Tools/Expand UniProt Annotations).

Select the gene product view in metingear.

view-gp

In this view press ctrl-A (⌘-A on OS X) to select all products.

selecta-all

With the selection active, choose Tools > Annotation > Expand UniProt Annotations. There are no configuration options. If the menu item is not available make sure you have the SwissProt cross-references loaded (see Resources).

exp-run

Sorting by enzyme classification, we can now see that around 600 products now have an EC number.

sort-ec

With the EC numbers annotated we can now link these to reactions. Select the menu item Tool > Associate > Reactions to Gene Products (see also Tools/Associate, Reactions to Gene Products.

asc-menu

In the dialog, select E.C. for both reactions and gene products and press okay.

asc-dialog

The products with EC numbers will now be updated and associated with reactions.

asc-view

Export

Currently not available, see limitations.