-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle arbitrary GTFs and FASTA files, both local and remote #99
Conversation
Since this view doesn't show what's changed with |
@iskandr Another thought: I'm thinking a better API would probably involve saving a collection of URLs/paths as a new "release" in the DB, rather than always requiring each path to be specified. This would also more easily allow topiary to refer to that collection of paths. For example:
I think I'll save that for a follow-up PR. Thoughts? P.S. That PR would be a better place, I think, to address the naming situation that isn't currently all that smart. Namely, two local GTF files with the same file name wouldn't be able to co-exist as different DBs, since the DB is just based on the GTF filename. I'll create issues for all the above unless you think it needs to be addressed in this PR. |
if 'exon_id' not in df: | ||
logging.info("Creating 'exon_id' column") | ||
df = reconstruct_exon_id_column(df) | ||
#if 'exon_id' not in df: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this ended up here! Will uncomment.
@@ -244,6 +244,8 @@ | |||
'IG_J_pseudogene', | |||
'IG_pseudogene', | |||
'IG_V_pseudogene', | |||
# Found in ftp://ftp.ensembl.org/pub/release-81/gtf/mus_musculus | |||
'IG_D_pseudogene', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any ideas for automatically generating this biotypes list? I'm terrified of how perpetually out of date this list will always be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No great ideas at the moment, but this "fix" concerns me too :\
OK, here's a thought: "genome" = "genome annotation" (GTF) + "genome sequence" (FASTA files) So, we may eventually want to add genome sequences separately from annotations. I like your idea below:
but might want to even extend it to:
Agreed that we can figure this out in a later PR. |
# genome annotations. Presents access to each feature | ||
# annotations as a pandas.DataFrame. | ||
self.gtf = GTF( | ||
genome_source, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GTF need all the info from genome_source
? It seems like the FASTA URLs shouldn't get passed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: genome_source
is also used for the error string. For now, we'll pass in both the path/URL and the genome_source
.
"genome" = "genome annotation" (GTF) + "genome sequence" (FASTA files) sounds reasonable for talking about these entities for the time being, even though as we've discussed "genome" is still confusing. Made #100 for |
TODO before merging, from offline discussion:
|
I believe all code review comments are now addressed. @iskandr I ended up changing the role of It's not perfect, but I think/hope it's moving in the right direction. I won't be able to address further comments until next week. If you think it's ready to merge, feel free! |
I'm surprised that a |
I tried to run the unit tests and got the following error:
|
Running the unit tests with Python 2.7 I get different errors:
|
… hash of GenomeSource
Problems fixed, though it seems like the pickling issues may require nuking all |
Handle arbitrary GTFs and FASTA files, both local and remote
Summary:
Genome
class replacesEnsemblRelease
class.GenomeSource
class specifies URLs to GTFs and FASTAs, extracted into its own class to prevent argument overload inGenome
and to create an easy object to pass around for generating python/console install strings (e.g.pyensembl install --release 77
andpyensembl install --gtf_url blah
).EnsemblReleaseSource
extendsGenomeSource
to handle Ensembl URL creation and install strings.GenomeSource
accepts both local file paths and remote URLs. If the latter, datacache is bypassed.gene.py
,transcript.py
,gtf_parsing.py
andgenome.py
to allow for missing gene names, transcript names and biotypes in GTF files. (I noticed that this can happen with UCSC GTF files from genome.ucsc.edu/cgi-bin/hgTables, such as the one included in this PR as a test.) See this commit: a232115shell.py
to allow for specific URLs in addition to release numbers.This needs more testing, but I want to get the PR started in case you think the overall strategy is problematic.
Also TODO, either in this PR or a follow-up: