Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic #9

Merged
merged 2 commits into from Dec 6, 2017
Merged

Dynamic #9

merged 2 commits into from Dec 6, 2017

Conversation

jameshadfield
Copy link
Member

Hey @barneypotter24

This PR changes some of the reshape / merge / write code from a static-approach to a more dynamic-approach. I wrote this as it seemed the most logical way of extracting attribution data from a fasta file, while still maintaining dictionaries of the fields belonging to each table.

There are multiple merges happening (firstly when we reshape into docs, secondly going from docs into self.x) which prompted the comments at the top of schema.py - namely, creating a schema class which can handle these merges appropriately (as well as reshaping etc).

Take a look and see if you think this approach is a good one.

the input file may not have a "sample_name" field, so it should be created
'''
if 'sample_name' not in doc:
doc['sample_name'] = "sample_{}".format(counter.next())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this method of formatting unknown samples and sequences.

Copy link
Contributor

@barneypotter24 barneypotter24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, my biggest comment is that we should consider how to format schema. Should it be a class with its own methods or should it be a stand-alone dictionary that is operated on by dataset functions?

"strain_id": lambda x: x["strain_name"],
"sample_id": lambda x: join_char.join([x["strain_name"], x["sample_name"]]),
"sequence_id": lambda x: join_char.join([x["strain_name"], x["sample_name"], x["sequence_name"]]),
"attribution_id": lambda x: "{}_{}_{}".format( # why does attribution_id use _ not | ?!?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we should only use one delimiter for everything, unless there is a specific reason to mix | and _. We would just need to make sure that in id cleaning functions we don't mistakenly allow those characters, but that should be a check in either case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - i was following the spec here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, probably just an artifact of it getting built up over time.

} }

self.references = refs
# def build_references_table(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just remove this placeholder function.

self.sequences[sequence_id][field] = doc[field]
for t_name, p_key in schema.tables_primary_keys.iteritems():
try:
p_key_val = schema.make_primary_key[p_key](doc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue for making schema in to a class, or moving its methods to dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a class works well - something like:
Each schema instance is a collection of data with methods to make primary keys, reshape & merge, but not clean. So a cleaned input (docs) is passed to a schema instance where it's (internally) reshaped etc (and any merge conflicts are handled). Then when we merge this schema instance into a central schema instance again taking advantage of the merge resolution methods inside schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should give this some thought before committing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that could be something that is handled in a separate PR. I would still put reshaping and merging into dataset, but use a schema instance as an argument to those functions. That would make the schema itself more atomic for future changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on separate PR - that would allow us to work out the best place to put those methods

@jameshadfield
Copy link
Member Author

Thanks @barneypotter24 - are you happy that this is merged?

@barneypotter24
Copy link
Contributor

This looks good—it can be merged.

@jameshadfield jameshadfield merged commit 0f6f680 into master Dec 6, 2017
@jameshadfield jameshadfield deleted the dynamic branch January 19, 2018 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants