Converting non-tabular data into tabular data using Python

Integrating data of different formats is particularly challenging given the many different layouts of data. This project involved modifying a file that had non-tabular data displayed in a single column. Each term was separated by a [Term] entry, and fields for each term were separated by a colon, see below:


[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton. [GOC:mcc PMID:10873824 PMID:11389764]
synonym: mitochondrial inheritance EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome. [GOC:ai GOC:vw]
is_a: GO:0007005 ! mitochondrion organization
[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms. [GOC:go_curators GOC:isa_complete GOC:jl
ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
synonym: reproductive physiological process EXACT []
xref: Wikipedia:Reproduction

From above, you can see we have three terms (GO:0000001, GO:0000002, and GO:0000003). However, to join these to another dataset in tabular format we need to process this file into tabular format that appears as this:

GO	Name	Namespace	alt_id	Def
GO:0000001	mitochondrion inheritance	biological_process		The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton. [GOC:mcc, PMID:10873824, PMID:11389764]
GO:0000002	mitochondrial genome maintenance	biological_process		The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome. [GOC:ai, GOC:vw]
GO:0000003	reproduction	biological_process	GO:0019952,GO:0050876
GO:0000005	obsolete ribosomal chaperone activity	molecular_function		OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function. [GOC:jl, PMID:12150913]
GO:0000006	high-affinity zinc transmembrane transporter activity	molecular_function		Enables the transfer of zinc ions (Zn2+) from one side of a membrane to the other, probably powered by proton motive force. In high-affinity transport the transporter is able to bind the solute even if it is only present at very low concentrations. [TC:2.A.5.1.1]
GO:0000007	low-affinity zinc ion transmembrane transporter activity	molecular_function		Enables the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: Zn2+ = Zn2+, probably powered by proton motive force. In low-affinity transport the transporter is able to bind the solute only if it is present at very high concentrations. [GOC:mtg_transport, ISBN:0815340729]

To do this, we will run a python script that will search through the file for particular breakpoints, then grab selected fields and store them in a new table.

Additionally, this revised file will be used to make a joins to the annotation file in my data wrangling with tidyr and dplyr repository.

Background

This original file is a gene ontology definition file downloaded from http://geneontology.org/ in OBO format. The annotation file that this will be joined to is already in tabular format, so we need to run a simple script that will do the conversion needed.

Python script

Fields pulled from the OBO file include the ID, Name, Namespace, alt_id (these are concatenated into a single column - comma separated), and Def.

The script is the file goconversion.py and needs to be downloaded along with the files in the supporting.files folder. Running the script will create a go.obo.txt file that is tab delimited with the fields indicated above.

Citations

Ashburner, et al. (2000, 2000/05/01). Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1), 25-29. https://doi.org/10.1038/75556

Gene Ontology, C. (2021). The Gene Ontology resource: enriching a GOld mine. Nucleic acids research, 49(D1), D325-D334. https://doi.org/10.1093/nar/gkaa1113

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Converting non-tabular data into tabular data using Python

Background

Python script

Citations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
supporting.files		supporting.files
README.md		README.md
goconversion.py		goconversion.py

patmendoza330/geneontologyconversion

Folders and files

Latest commit

History

Repository files navigation

Converting non-tabular data into tabular data using Python

Background

Python script

Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages