Handle arbitrary GTFs and FASTA files, both local and remote #99

tavinathanson · 2015-07-16T03:11:33Z

Summary:

New Genome class replaces EnsemblRelease class.
New GenomeSource class specifies URLs to GTFs and FASTAs, extracted into its own class to prevent argument overload in Genome and to create an easy object to pass around for generating python/console install strings (e.g. pyensembl install --release 77 and pyensembl install --gtf_url blah). EnsemblReleaseSource extends GenomeSource to handle Ensembl URL creation and install strings.
GenomeSource accepts both local file paths and remote URLs. If the latter, datacache is bypassed.
Added logic to gene.py, transcript.py, gtf_parsing.py and genome.py to allow for missing gene names, transcript names and biotypes in GTF files. (I noticed that this can happen with UCSC GTF files from genome.ucsc.edu/cgi-bin/hgTables, such as the one included in this PR as a test.) See this commit: a232115
Added one basic test for a UCSC GTF file.
Modified shell.py to allow for specific URLs in addition to release numbers.

This needs more testing, but I want to get the PR started in case you think the overall strategy is problematic.

Also TODO, either in this PR or a follow-up:

Error gracefully when a transcript/protein sequence is requested, yet no URL was provided.
Error gracefully when a user tries to get a gene name when the DB has no gene names (and other issues like that).

tavinathanson · 2015-07-16T03:20:59Z

Since this view doesn't show what's changed with Genome vs. EnsemblRelease, here's a diff of old EnsemblRelease vs. new Genome: https://gist.github.com/tavinathanson/eb786789a022153d9470

tavinathanson · 2015-07-19T21:10:25Z

@iskandr Another thought: I'm thinking a better API would probably involve saving a collection of URLs/paths as a new "release" in the DB, rather than always requiring each path to be specified. This would also more easily allow topiary to refer to that collection of paths.

For example:

pyensembl add-genome --name "mouse_81" --gtf-path <URL> --transcript-fasta-path <URL>
and then
pyensembl install "mouse_81"
topiary --pyensembl-genome "mouse_81"

mouse_81 could be saved to the database, mapped to those paths.

I think I'll save that for a follow-up PR. Thoughts?

P.S. That PR would be a better place, I think, to address the naming situation that isn't currently all that smart. Namely, two local GTF files with the same file name wouldn't be able to co-exist as different DBs, since the DB is just based on the GTF filename. I'll create issues for all the above unless you think it needs to be addressed in this PR.

tavinathanson · 2015-07-20T19:01:40Z

pyensembl/gtf_parsing.py

-    if 'exon_id' not in df:
-        logging.info("Creating 'exon_id' column")
-        df = reconstruct_exon_id_column(df)
+    #if 'exon_id' not in df:


Not sure how this ended up here! Will uncomment.

iskandr · 2015-07-20T21:18:50Z

pyensembl/biotypes.py

@@ -244,6 +244,8 @@
    'IG_J_pseudogene',
    'IG_pseudogene',
    'IG_V_pseudogene',
+    # Found in ftp://ftp.ensembl.org/pub/release-81/gtf/mus_musculus
+    'IG_D_pseudogene',


Any ideas for automatically generating this biotypes list? I'm terrified of how perpetually out of date this list will always be.

No great ideas at the moment, but this "fix" concerns me too :\

iskandr · 2015-07-20T21:32:19Z

OK, here's a thought:

"genome" = "genome annotation" (GTF) + "genome sequence" (FASTA files)

So, we may eventually want to add genome sequences separately from annotations.

I like your idea below:

pyensembl add-genome --name "mouse_81" --gtf-path <URL> --transcript-fasta-path <URL>

but might want to even extend it to:

pyensembl add-genome-sequence "mouse" --transcript-fasta-path <URL>

pyensembl add-genome --name "mouse_81" --gtf-path <URL> --genomoe-sequence "mouse"

Agreed that we can figure this out in a later PR.

iskandr · 2015-07-21T18:36:13Z

pyensembl/genome.py

+        # genome annotations. Presents access to each feature
+        # annotations as a pandas.DataFrame.
+        self.gtf = GTF(
+            genome_source,


Does GTF need all the info from genome_source? It seems like the FASTA URLs shouldn't get passed here.

Discussed offline: genome_source is also used for the error string. For now, we'll pass in both the path/URL and the genome_source.

tavinathanson · 2015-07-21T20:45:38Z

"genome" = "genome annotation" (GTF) + "genome sequence" (FASTA files) sounds reasonable for talking about these entities for the time being, even though as we've discussed "genome" is still confusing.

Made #100 for add-genome

tavinathanson · 2015-07-21T20:49:38Z

TODO before merging, from offline discussion:

Add the actual path to SequenceData and GTF
Remove fasta_path from GenomeSource
Change path to path_or_url in most places
Move local_fasta_filename_func to EnsemblReleaseSource, and local/remote filename logic to GenomeSource
Rename local/remote to original/cached

tavinathanson · 2015-07-22T17:27:06Z

I believe all code review comments are now addressed.

@iskandr I ended up changing the role of GenomeSource: it now represents a single URL or path. It continues to return install messages, but now does other useful things: it handles the original/cached filename transform (like we discussed), and it also handles all the is_url_format and local copying logic. It's purely internal, now; I added the GTF/FASTA arguments to Genome itself.

It's not perfect, but I think/hope it's moving in the right direction. I won't be able to address further comments until next week. If you think it's ready to merge, feel free!

iskandr · 2015-07-22T19:26:07Z

I'm surprised that a Genome now has multiple GenomeSource objects for each GTF and FASTA file. I'm also surprised that EnsemblReleaseSource survived despite that change in the role of source objects. Still, since you're gone for a week and kept the API backward compatible I'd rather merge this now and then discuss this zoo of objects when you get back.

iskandr · 2015-07-22T19:27:16Z

I tried to run the unit tests and got the following error:

  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/Users/iskander/code/pyensembl/test/test_ucsc_gtf.py", line 38, in test_ucsc_refseq
    eq_(len(genome.genes()), 2)
  File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 56, in wrapped_fn
    value = fn(*args, **kwargs)
  File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 420, in genes
    gene_ids = self.gene_ids(contig=contig, strand=strand)
  File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 56, in wrapped_fn
    value = fn(*args, **kwargs)
  File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 566, in gene_ids
    strand=strand)
  File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 218, in all_feature_values
    return cached_object(pickle_path, compute_fn=run_query)
  File "/Users/iskander/code/pyensembl/pyensembl/compute_cache.py", line 120, in cached_object
    obj = compute_fn()
  File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 212, in run_query
    strand=strand)
  File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 54, in wrapped_fn
    return cache[cache_key]
  File "/Users/iskander/code/pyensembl/pyensembl/database.py", line 61, in __hash__
    return hash(self.gtf)
  File "/Users/iskander/code/pyensembl/pyensembl/gtf.py", line 57, in __hash__
    return hash(self.gtf_source)
nose.proxy.TypeError: unhashable type: 'GenomeSource'

iskandr · 2015-07-22T19:30:06Z

Running the unit tests with Python 2.7 I get different errors:

ERROR: test_all_gene_names : Make sure some known gene names such as
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nose-1.3.6-py2.7.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/iskander/code/pyensembl/test/test_common.py", line 35, in new_test_fn
    test_fn(ensembl)
  File "/Users/iskander/code/pyensembl/test/test_gene_names.py", line 23, in test_all_gene_names
    gene_names = ensembl.gene_names()
  File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 56, in wrapped_fn
    value = fn(*args, **kwargs)
  File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 519, in gene_names
    strand=strand)
  File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 218, in all_feature_values
    return cached_object(pickle_path, compute_fn=run_query)
  File "/Users/iskander/code/pyensembl/pyensembl/compute_cache.py", line 118, in cached_object
    obj = pickle.load(f)
ValueError: unsupported pickle protocol: 4

… hash of GenomeSource

iskandr · 2015-07-22T19:38:54Z

Problems fixed, though it seems like the pickling issues may require nuking all .pickle files in the cache.

Handle arbitrary GTFs and FASTA files, both local and remote

tavinathanson assigned iskandr Jul 16, 2015

tavinathanson reviewed Jul 20, 2015
View reviewed changes

This was referenced Jul 20, 2015

Allow Varcode to work with mouse data via Genome openvax/varcode#106

Merged

Get Topiary working with mice openvax/topiary#13

Closed

iskandr reviewed Jul 20, 2015
View reviewed changes

iskandr reviewed Jul 21, 2015
View reviewed changes

tavinathanson added 8 commits July 22, 2015 13:50

Allow arbitrary GTF/FASTA URLs

aeae052

Add local file support (vs. remote server)

dd2b7b1

Handle missing transcript/protein FASTA data

a24460e

Handle GTFs with missing biotypes and names

18891ac

Fix shell.py to actually work with arbitrary GTFs

8b35626

Fixes for shell.py script

e7b2876

Testing, clean up and version bump

36b5748

Address CR and change the role of GenomeSource

595527b

tavinathanson force-pushed the any_gtf branch from c265c3d to 595527b Compare July 22, 2015 17:52

added dependency on six for py2 vs. py3 pickling, fixed small typo in…

f335174

… hash of GenomeSource

iskandr added a commit that referenced this pull request Jul 22, 2015

Merge pull request #99 from hammerlab/any_gtf

79ea5a8

Handle arbitrary GTFs and FASTA files, both local and remote

iskandr merged commit 79ea5a8 into master Jul 22, 2015

tavinathanson mentioned this pull request Jul 31, 2015

Py2 vs. Py3 pickling #103

Closed

iskandr deleted the any_gtf branch July 28, 2018 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle arbitrary GTFs and FASTA files, both local and remote #99

Handle arbitrary GTFs and FASTA files, both local and remote #99

tavinathanson commented Jul 16, 2015

tavinathanson commented Jul 16, 2015

tavinathanson commented Jul 19, 2015

tavinathanson Jul 20, 2015

iskandr Jul 20, 2015

tavinathanson Jul 20, 2015

iskandr commented Jul 20, 2015

iskandr Jul 21, 2015

tavinathanson Jul 21, 2015

tavinathanson commented Jul 21, 2015

tavinathanson commented Jul 21, 2015

tavinathanson commented Jul 22, 2015

iskandr commented Jul 22, 2015

iskandr commented Jul 22, 2015

iskandr commented Jul 22, 2015

iskandr commented Jul 22, 2015

Handle arbitrary GTFs and FASTA files, both local and remote #99

Handle arbitrary GTFs and FASTA files, both local and remote #99

Conversation

tavinathanson commented Jul 16, 2015

tavinathanson commented Jul 16, 2015

tavinathanson commented Jul 19, 2015

tavinathanson Jul 20, 2015

Choose a reason for hiding this comment

iskandr Jul 20, 2015

Choose a reason for hiding this comment

tavinathanson Jul 20, 2015

Choose a reason for hiding this comment

iskandr commented Jul 20, 2015

iskandr Jul 21, 2015

Choose a reason for hiding this comment

tavinathanson Jul 21, 2015

Choose a reason for hiding this comment

tavinathanson commented Jul 21, 2015

tavinathanson commented Jul 21, 2015

tavinathanson commented Jul 22, 2015

iskandr commented Jul 22, 2015

iskandr commented Jul 22, 2015

iskandr commented Jul 22, 2015

iskandr commented Jul 22, 2015