Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow output types #28

Merged
merged 24 commits into from
Mar 27, 2017
Merged

Allow output types #28

merged 24 commits into from
Mar 27, 2017

Conversation

jhprinz
Copy link
Contributor

@jhprinz jhprinz commented Mar 22, 2017

This implements all of the discussion in #23 .

You can now make arbitrary mixtures of selection/ stride for your engine and run and extend, etc.

Only point missing is that PyEmma will need either a reduced PDB or a selection string to work with not full atom sets (this is because my pyemma script computed backbone angles and without topology this is difficult)

This is pretty neat now. Also intelligent handling of frame numbers etc...

More tomorrow...

New features

  • output types: An engine has output_types that you can add. These contain information about striding and selections (atom subsets). You can have an arbitrary set of these output_types. Usually you would have a master with full selection and some stride and a subset like protein with native stride

    engine.add_output_type('master', 'master.dcd', stride=10)
    engine.add_output_type('protein', 'protein.dcd', stride=1, selection='protein')
    
  • Trajectory objects now require a engine property. I first thought to set this, when you actully run a trajectory, but it makes sense to set this upon creation. The engine contains information about topology and the output types so the trajectory is useless without this information. It also means that you can create the task directly from the trajectory. There are methods .run and .extend now for that.

    task = project.new_trajectory(pdb_file, 100, engine).run()
    
  • Engine has now two commands .run() and .extend() instead of the long names for generating tasks. These are the same for all engines

  • pyemma feature support: This is tricky. I added a way to express pyemma features. What you do is convert the calls of featurizer.add_[someting](arg1, arg2, ...) into a dict like {'add_[something]': [arg1, arg2]} where args can again be calls to the featurizer object. This will allow you basic featurizer construction. If you really need something fancy you have to write your own Analysis class.

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 22, 2017

I realized that some PR closed #23. So let's continue here.

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 22, 2017

I still need to update the examples. But after that we are good to go and have all the features that we wanted.

@franknoe
Copy link
Collaborator

franknoe commented Mar 22, 2017 via email

@jhprinz jhprinz changed the title [WIP] Allow output types Allow output types Mar 22, 2017
@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 22, 2017

So, examples are up. Please have a look! @nsplattner @thempel @franknoe

I think this is much more powerful now. I will update the docs some more and see to make a decent webpage.

@franknoe
Copy link
Collaborator

franknoe commented Mar 22, 2017 via email

@jhprinz jhprinz mentioned this pull request Mar 22, 2017
Merged
@thempel
Copy link
Member

thempel commented Mar 23, 2017

I just tested this PR as described in the tutorial updated in #34. The following happens when I add the engine to the project generators. Did I miss something?

>>> project.generators.add(engine)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-71-d101ab5a5a33> in <module>()
----> 1 project.generators.add(engine)

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/bundle.pyc in add(self, item)
    321         if self._set is not None and item not in self._set:
    322             logger.info('Added file of type `%s`' % item.__class__.__name__)
--> 323             self._set.save(item)
    324 
    325     @property

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/object.pyc in save(self, obj)
    703 
    704         try:
--> 705             self._save(obj)
    706             self.cache[uuid] = obj
    707 

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/object.pyc in _save(self, obj)
    488 
    489     def _save(self, obj):
--> 490         dct = self.storage.simplifier.to_simple_dict(obj)
    491         self._document.insert(dct)
    492         obj.__store__ = self

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in to_simple_dict(self, obj, base_type)
    524             '_cls': obj.__class__.__name__,
    525             '_obj_uuid': str(UUID(int=obj.__uuid__)),
--> 526             '_dict': self.simplify(obj.to_dict(), base_type),
    527             '_id': str(UUID(int=obj.__uuid__)),
    528             '_time': int(obj.__time__)}

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    557                         '_store': store.name}
    558 
--> 559         return super(UUIDObjectJSON, self).simplify(obj, base_type)
    560 
    561     def build(self, obj):

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    166             else:
    167                 result = {
--> 168                     key: self.simplify(o) for key, o in obj.iteritems()
    169                     if key not in self.excluded_keys
    170                 }

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in <dictcomp>((key, o))
    167                 result = {
    168                     key: self.simplify(o) for key, o in obj.iteritems()
--> 169                     if key not in self.excluded_keys
    170                 }
    171 

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    557                         '_store': store.name}
    558 
--> 559         return super(UUIDObjectJSON, self).simplify(obj, base_type)
    560 
    561     def build(self, obj):

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    145                 return None
    146         elif type(obj) is list:
--> 147             return [self.simplify(o, base_type) for o in obj]
    148         elif type(obj) is tuple:
    149             return {'_tuple': [self.simplify(o, base_type) for o in obj]}

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    557                         '_store': store.name}
    558 
--> 559         return super(UUIDObjectJSON, self).simplify(obj, base_type)
    560 
    561     def build(self, obj):

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    125                         '_cls': obj.__class__.__name__,
    126                         '_obj_uuid': str(UUID(int=obj.__uuid__)),
--> 127                         '_dict': self.simplify(obj.to_dict(), base_type)}
    128                 else:
    129                     return {

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    557                         '_store': store.name}
    558 
--> 559         return super(UUIDObjectJSON, self).simplify(obj, base_type)
    560 
    561     def build(self, obj):

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    166             else:
    167                 result = {
--> 168                     key: self.simplify(o) for key, o in obj.iteritems()
    169                     if key not in self.excluded_keys
    170                 }

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in <dictcomp>((key, o))
    167                 result = {
    168                     key: self.simplify(o) for key, o in obj.iteritems()
--> 169                     if key not in self.excluded_keys
    170                 }
    171 

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in simplify(self, obj, base_type)
    552                 if not obj._ignore:
    553                     store = self.storage._obj_store[obj.__class__]
--> 554                     store.save(obj)
    555                     return {
    556                         '_hex_uuid': hex(obj.__uuid__),

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/object.pyc in save(self, obj)
    703 
    704         try:
--> 705             self._save(obj)
    706             self.cache[uuid] = obj
    707 

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/object.pyc in _save(self, obj)
    488 
    489     def _save(self, obj):
--> 490         dct = self.storage.simplifier.to_simple_dict(obj)
    491         self._document.insert(dct)
    492         obj.__store__ = self

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.pyc in to_simple_dict(self, obj, base_type)
    524             '_cls': obj.__class__.__name__,
    525             '_obj_uuid': str(UUID(int=obj.__uuid__)),
--> 526             '_dict': self.simplify(obj.to_dict(), base_type),
    527             '_id': str(UUID(int=obj.__uuid__)),
    528             '_time': int(obj.__time__)}

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/file.pyc in to_dict(self)
    368     def to_dict(self):
    369         ret = super(File, self).to_dict()
--> 370         if self._file:
    371             ret['_file_'] = base64.b64encode(self._file)
    372 

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/syncvar.pyc in __get__(self, instance, owner)
     38             if instance.__store__ is not None:
     39                 idx = self._idx(instance)
---> 40                 value = self._update(instance.__store__, idx)
     41                 self.values[instance] = value
     42                 return value

/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/syncvar.pyc in _update(self, store, idx)
     23         if store is not None:
     24             return store._document.find_one(
---> 25                 {'_id': idx}).get(self.name)
     26 
     27         return None

AttributeError: 'NoneType' object has no attribute 'get'

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 23, 2017

This looks like to did not delete the project before restarting. Some of the internals have changed. Try

Project.delete(proj_name)

If that does not help could you post your engine definition?

That the message says is that you try to access an attribute ofan object that is marked as stored but does not exist in the db. That could happen if you reuse a pdb file e.g. after deletion of the project.

@thempel
Copy link
Member

thempel commented Mar 23, 2017

Ahh, thanks, I tried this but probably mixed something up. Now this error is resolved, but followed by another one:
DocumentTooLarge: BSON document too large (20205194 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
I'm using the same system as before and never had problems to load it into the DB. Looks like it is loading everything twice.

engine.items()

[('pdb_file_stage', 'init_adaptive.pdb'),
 ('integrator_file', 'integrator.xml'),
 ('_executable_file', 'openmmrun.py'),
 ('system_file_stage', 'system.xml'),
 ('pdb_file', 'init_adaptive.pdb'),
 ('integrator_file_stage', 'integrator.xml'),
 ('_executable_file_stage', 'openmmrun.py'),
 ('system_file', 'system.xml')]

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 23, 2017

well, strange... let me see...

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 23, 2017

All files were stored 2 before. But only the ones without _stage have content. Could you check that?

for k, v in engine.items():
    print len(v._file) if v._file is not None else 0

This works fine for me. So, I suspect that there is something else getting large.

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 23, 2017

Can you compare the file sizes with the original files? Just to make sure there is no overhead?

@thempel
Copy link
Member

thempel commented Mar 23, 2017

This seems to work and also the files on disc show the same number of characters. They have a total size of 11.5 M on disc, so it should be fine.

>>> for k, v in engine.items():
>>>    print v.short, len(v._file) if v._file is not None else 0

staging:///init_adaptive.pdb 0
file://{}/integrator.xml 117
file://{}/openmmrun.py 8828
staging:///system.xml 0
file://{}/init_adaptive.pdb 2204265
staging:///integrator.xml 0
staging:///openmmrun.py 0
file://{}/system.xml 8659243

@thempel
Copy link
Member

thempel commented Mar 23, 2017

Just scrolled through the above files in my notebook, there content seems fine. Is their anything else being copied?

@thempel thempel mentioned this pull request Mar 23, 2017
@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 23, 2017

This is the question. When exactly did this error happen. I assume you ran the setup from top with PDB, system.xml, etc and then, when storing the engine you got the error? So, it cannot have been caused by some other files, right? There are no other files present.

@jhprinz jhprinz mentioned this pull request Mar 23, 2017
@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 23, 2017

Found the bug/storage inefficiency. The file is really stored twice, which is definitely not intended. Will issue a quick fix. Still we should make it use the new storage option

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 23, 2017

Wow, this was a real tough one. Involving that weakref.WeakKeyDictionary uses hashing which depends on the pymongodb _id which in my implementation is set after object creation st. Due to the change of the hash you cannot find the same object in the WeakKeyDict... I should give the next seminar on that one...

No idea, how I found this one. That was probably the most hidden error so far...

Still, unfortunately #35 contains the fix and also allowing to store arbitrary large files now.

Problem is that when I will merge #35. This one will be merged as well... So let's at least finish this discussion. Additions from #35 are additional features while this PR changes the general concept of trajectories...

@franknoe
Copy link
Collaborator

I like the description of this task very much. The only point that concerns me a little bit is the last point (how to featurize), because it creates a relatively hard dependency on PyEMMA and our current naming conventions. There are two issues with this: (1) If you always depend on PyEMMA, this makes the dependencies very heavy (e.g. you also depend on things like matplotlib which are clearly irrelevant for this package) and many dependencies also means there are many ways for the package to break down if dependencies change. (2) Although we don't have a concrete plan for that, it is not impossible that the look+feel of PyEMMA featurization will change at some point. I know there are some deficiencies with the current one.

To address that, please check where you actually depend on PyEMMA and if possible find a way to make that dependency optional to your package, i.e. if the user doesn't need a certain functionality (e.g. writes their own analysis class), it shouldn't automatically install PyEMMA.
For the second point, since you basicly have to look up the PyEMMA API in order to write this pseudocode anyway, why not just use the PyEMMA function names directly (with the 'add_'). In any case this needs to be clearly documented, i.e. add a link to the PyEMMA featurizer in the present API docs.

Looking at the examples now...

@jhprinz
Copy link
Contributor Author

jhprinz commented Mar 27, 2017

merging this

@jhprinz jhprinz merged commit 226e17e into markovmodel:master Mar 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants