New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic #9
Dynamic #9
Conversation
Without these keys, the schema structure cannot be created
the input file may not have a "sample_name" field, so it should be created | ||
''' | ||
if 'sample_name' not in doc: | ||
doc['sample_name'] = "sample_{}".format(counter.next()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this method of formatting unknown samples and sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, my biggest comment is that we should consider how to format schema
. Should it be a class with its own methods or should it be a stand-alone dictionary that is operated on by dataset functions?
"strain_id": lambda x: x["strain_name"], | ||
"sample_id": lambda x: join_char.join([x["strain_name"], x["sample_name"]]), | ||
"sequence_id": lambda x: join_char.join([x["strain_name"], x["sample_name"], x["sequence_name"]]), | ||
"attribution_id": lambda x: "{}_{}_{}".format( # why does attribution_id use _ not | ?!? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, we should only use one delimiter for everything, unless there is a specific reason to mix |
and _
. We would just need to make sure that in id
cleaning functions we don't mistakenly allow those characters, but that should be a check in either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 - i was following the spec here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, probably just an artifact of it getting built up over time.
} } | ||
|
||
self.references = refs | ||
# def build_references_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably just remove this placeholder function.
self.sequences[sequence_id][field] = doc[field] | ||
for t_name, p_key in schema.tables_primary_keys.iteritems(): | ||
try: | ||
p_key_val = schema.make_primary_key[p_key](doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue for making schema
in to a class, or moving its methods to dataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a class works well - something like:
Each schema
instance is a collection of data with methods to make primary keys, reshape & merge, but not clean. So a cleaned input (docs
) is passed to a schema
instance where it's (internally) reshaped etc (and any merge conflicts are handled). Then when we merge this schema
instance into a central schema
instance again taking advantage of the merge resolution methods inside schema
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should give this some thought before committing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that could be something that is handled in a separate PR. I would still put reshaping and merging into dataset
, but use a schema
instance as an argument to those functions. That would make the schema itself more atomic for future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed on separate PR - that would allow us to work out the best place to put those methods
Thanks @barneypotter24 - are you happy that this is merged? |
This looks good—it can be merged. |
Hey @barneypotter24
This PR changes some of the reshape / merge / write code from a static-approach to a more dynamic-approach. I wrote this as it seemed the most logical way of extracting attribution data from a fasta file, while still maintaining dictionaries of the fields belonging to each table.
There are multiple merges happening (firstly when we reshape into
docs
, secondly going fromdocs
intoself.x
) which prompted the comments at the top ofschema.py
- namely, creating aschema
class which can handle these merges appropriately (as well as reshaping etc).Take a look and see if you think this approach is a good one.