Dynamic #9

jameshadfield · 2017-12-05T01:53:56Z

This PR changes some of the reshape / merge / write code from a static-approach to a more dynamic-approach. I wrote this as it seemed the most logical way of extracting attribution data from a fasta file, while still maintaining dictionaries of the fields belonging to each table.

There are multiple merges happening (firstly when we reshape into docs, secondly going from docs into self.x) which prompted the comments at the top of schema.py - namely, creating a schema class which can handle these merges appropriately (as well as reshaping etc).

Take a look and see if you think this approach is a good one.

Without these keys, the schema structure cannot be created

barneypotter24 · 2017-12-05T17:58:13Z

src/cleaning_functions/create/sample_name.py

+    the input file may not have a "sample_name" field, so it should be created
+    '''
+    if 'sample_name' not in doc:
+        doc['sample_name'] = "sample_{}".format(counter.next())


I really like this method of formatting unknown samples and sequences.

barneypotter24

Looks good, my biggest comment is that we should consider how to format schema. Should it be a class with its own methods or should it be a stand-alone dictionary that is operated on by dataset functions?

barneypotter24 · 2017-12-05T18:01:47Z

src/schema.py

+    "strain_id": lambda x: x["strain_name"],
+    "sample_id": lambda x: join_char.join([x["strain_name"], x["sample_name"]]),
+    "sequence_id": lambda x: join_char.join([x["strain_name"], x["sample_name"], x["sequence_name"]]),
+    "attribution_id": lambda x: "{}_{}_{}".format( # why does attribution_id use _ not | ?!?


For consistency, we should only use one delimiter for everything, unless there is a specific reason to mix | and _. We would just need to make sure that in id cleaning functions we don't mistakenly allow those characters, but that should be a check in either case.

👍 - i was following the spec here

Cool, probably just an artifact of it getting built up over time.

barneypotter24 · 2017-12-05T18:03:12Z

src/dataset.py

-        } }
-
-        self.references = refs
+    # def build_references_table(self):


We can probably just remove this placeholder function.

barneypotter24 · 2017-12-05T18:11:12Z

src/dataset.py

-                                self.sequences[sequence_id][field] = doc[field]
+            for t_name, p_key in schema.tables_primary_keys.iteritems():
+                try:
+                    p_key_val = schema.make_primary_key[p_key](doc)


I would argue for making schema in to a class, or moving its methods to dataset.

I think a class works well - something like:
Each schema instance is a collection of data with methods to make primary keys, reshape & merge, but not clean. So a cleaned input (docs) is passed to a schema instance where it's (internally) reshaped etc (and any merge conflicts are handled). Then when we merge this schema instance into a central schema instance again taking advantage of the merge resolution methods inside schema.

I guess we should give this some thought before committing.

Yeah, that could be something that is handled in a separate PR. I would still put reshaping and merging into dataset, but use a schema instance as an argument to those functions. That would make the schema itself more atomic for future changes.

agreed on separate PR - that would allow us to work out the best place to put those methods

jameshadfield · 2017-12-06T20:15:14Z

Thanks @barneypotter24 - are you happy that this is merged?

barneypotter24 · 2017-12-06T20:22:31Z

This looks good—it can be merged.

jameshadfield added 2 commits December 4, 2017 16:21

create sample/sequence name if not supplied

8a45e95

Without these keys, the schema structure cannot be created

dynamic reshape / merge functionality

ad0c1ed

jameshadfield requested a review from barneypotter24 December 5, 2017 01:53

barneypotter24 reviewed Dec 5, 2017

View reviewed changes

jameshadfield merged commit 0f6f680 into master Dec 6, 2017

jameshadfield deleted the dynamic branch January 19, 2018 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic #9

Dynamic #9

jameshadfield commented Dec 5, 2017

barneypotter24 Dec 5, 2017

barneypotter24 left a comment

barneypotter24 Dec 5, 2017

jameshadfield Dec 5, 2017

barneypotter24 Dec 5, 2017

barneypotter24 Dec 5, 2017

barneypotter24 Dec 5, 2017

jameshadfield Dec 5, 2017

jameshadfield Dec 5, 2017

barneypotter24 Dec 5, 2017

jameshadfield Dec 5, 2017

jameshadfield commented Dec 6, 2017

barneypotter24 commented Dec 6, 2017

Dynamic #9

Dynamic #9

Conversation

jameshadfield commented Dec 5, 2017

Choose a reason for hiding this comment

barneypotter24 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield commented Dec 6, 2017

barneypotter24 commented Dec 6, 2017