Transition Parser and Transformer to use a Serialized object model #735

Closed
mvriel opened this Issue Feb 5, 2013 · 14 comments

Projects

None yet

3 participants

@mvriel
phpDocumentor member

I have been working on this item for quite some while but I need to make the process more transparent and provide myself with a handle due to the size and complexity of this issue.

Currently

phpDocumentor parses the source code using Reflection and exports each reflected file to an XML file, the structure.xml, that acts as a sort of cache.
The transformer than loads the XML and calls on behaviours and transformations to do operations on that XML source.

This approach has a few downsides:

  • Does not scale, a single XML file is slow to generate and manipulate.
  • Is not flexible, the transformer is now tied to XML; if you'd want to export to another format than you'd need to write something that converts that to XML in order to use the transformer.
  • Slows down inheritance, because inheritance cannot dynamically query the XML due to sluggishness it needs to copy all inherited elements, slowing down operations even more.
  • hinders other templating engines, most templating engines in PHP work from an object model. SimpleXMLElements do not provide enough of a handle to practically use this.

Future

Due to the downsides I am implementing the following scenario:

phpDocumentor parses the source code using Reflection and converts the reflected information to a simple Entity (model) called a Descriptor. The entire project is stored in memory with all references intact. After parsing is the object model serialized to a single file.
The transformer gets the object model passed (or unserializes it from the cache file), applies behaviours on the object model and applies the transformations on the object model instead of a plain XML file.

Before starting this issue I was worried about memory usage and performance of having the object model in memory and (de)serializing it, as such I have done an analysis up front. The rough draft of the resulting report can be found here: https://github.com/mvriel/phpDocumentor2/blob/serializing/docs/manual/for-developers/serialization.rst

Benefits

  • That you are more flexible with an object model because it is easier to manipulate
  • It is faster because memory is faster than disk
  • Twig works much better with the object model
  • If my tests are correct then less memory or about the same amount is used.
  • Simpler to work with (less xpath queries)

Challenges

There are however issues or challenges to this:

  • BC is completely shattered with this change. The format of the cache file is a serialized or binary format instead of the XML file that people are used.
  • Huge refactoring, every bit of the system is touched and refactored because of the dependency on XML
  • XSL uses the XML output, this needs a nice solution

I will be trying to mitigate the BC impact as much as possible but I cannot prevent it all.

Goal

The goal of this change is to make phpDocumentor simpler, more stable and future-proof. With this change we can start migrating the templates to Twig, thereby removing a dependency, and fix all remaining issues with the templates.

Work required

  1. Refactor Parser
  2. Remove Exporter
  3. Create Descriptors
  4. Create Serializer(s)
  5. Refactor transformer
  6. Move behaviour functionality or refactor behaviours
  7. Refactor Writers (specifically the XSL)
  8. Add Twig writer (i.e. to test against)
  9. Create twig version of the responsive template
@mvriel mvriel was assigned Feb 5, 2013
@ashnazg
phpDocumentor member

(Take this suggestion from the perspective of one that does not yet knowing the internal flow of things ;-) )

In considering the BC aspect, what if you leave in a default behavior, though marked as deprecated, that takes the final serialized object model and generates the expected structures.xml. Would the overall XML creation be painful if it was just being created in one big swoop by reading the final serialized object model? This is an extra step, of course, only necessary for BC, but deprecating it for later removal could allow for the refactoring to be done while giving templaters more time to transition.

@mvriel
phpDocumentor member

Sounds like a sane idea; the extra performance cost is 'the cost of deprecation' and perhaps an incentive to switch?

I think that there are limitations; at least: I have not yet figured out a solution.
I can create a writer that outputs the XML as it used now and have the XSL writer use that; this would give that part a sort of BC. I am having difficulty making behaviours BC as these have their XML handed down from the transformer; which will have no knowledge about XML after this refactoring

@mvriel
phpDocumentor member

This does however remind me that a solid document on the architecture needs to be written, so much to do, so much to document :)

@boenrobot

The monster that is the project at #637 still haunts me, and it seems that even with the new format, loading the whole super-mega-huge project into memory would still take GBs of RAM. I mean, it's about 10 times (!!!) bigger than Magento 2, and that requires about a 1/4th GB of RAM.

Is there any chance that we could implement both RAM storage, and HDD storage, for cases like that where the user is willing to wait longer if it means they can parse the project without running out of memory?

Perhaps if the array (or whatever) where the descriptors are stored is abstracted away in an object, and we implement one which is simply an ArrayObject extension, and a different one that implements ArrayAccess, but uses the HDD instead? Users will be able to switch to the HDD if they need to.

(Obviously, by default, we'll use the RAM descriptor storage, since most projects aren't that big)

This does however remind me that a solid document on the architecture needs to be written

Gee, I wish we had a tool that could, like, read a PHP source, and generate human readable docs out of it... LOL.

@ashnazg
phpDocumentor member

;-) I thought that too, as in "document our own dog food" :-P

@ashnazg
phpDocumentor member

What if the in-memory object tracked the very minimal amount of info needed for a given element so that whatever key-to-key pairing lookup needed by another element could be read from in-memory... and all other aspects of a given element were stored in Sqlite? By default, the Sqlite storage could be :memory:, but a runtime option could set it to do file storage. This runtime option could be advertised as the way around memory limit failures, though obviously slower than default in-memory use.

@mvriel
phpDocumentor member

@boenrobot the project in #637 is less than half of magento 2; project #637 is 1.1MLOC and magento is 2.5MLOC.

For the rest I need to read your responses in detail this evening and reply adequetely

@boenrobot

Oh f.. yeah, it seems my mind added an extra "1" in there... sorry.

@mvriel
phpDocumentor member

;-) I thought that too, as in "document our own dog food" :-P

To be honest, our own DocBlocks could use a lot of love as well, they are not the best example of good DocBlocks (though there are much worse examples).

What if the in-memory object tracked the very minimal amount of info needed for a given element so that whatever key-to-key pairing lookup needed by another element could be read from in-memory... and all other aspects of a given element were stored in Sqlite? By default, the Sqlite storage could be :memory:, but a runtime option could set it to do file storage. This runtime option could be advertised as the way around memory limit failures, though obviously slower than default in-memory use.

This would require a mechanism to swap the current descriptors with another set of descriptors with the same interface. This is doable but requires a bit of additional engineering. As I see it you'd need a series of interfaces, one for each descriptor type, and the ability to use a different builder (which is supported in the new changes) or have the builder select a different set of descriptors (possibly using a factory?).

I am worried that the time invested doesn't weigh up to the benefits; I already feel pressure to finish this asap so that we can go to the beta phase and this would delay implementation.

Also, if Magento takes 260mb of ram (let's say 300 for sake of rounding) and is 1.3MLOC (sorry @boenrobot, I remembered incorrectly!) then for the cost of 900MB RAM we can process almost 4MLOC of code; this is an insane amount!

It would be nice if we can somehow reduce that amount even more but that would take a lot of research as that needs to be performance tested and thus build.

What I can do is extract the interfaces from the current Descriptors, which is a small task, and thus allow us (and other people) to create their own builders and serializers in the future. This may provide the necessary flexibility

@ashnazg
phpDocumentor member

Fair points. I like the idea of at least preparing that interface hierarchy now, thus allowing other priorities to go forward... optimizing the backend then becomes a BC-able refactoring as long as that interface hierarchy remains.

@mvriel
phpDocumentor member

I will do a new push after I implemented the interfaces so you can see

@mvriel
phpDocumentor member

The latest series of commits have been pushed including the new interfaces for the descriptors: https://github.com/mvriel/phpDocumentor2/tree/serializing

@mvriel
phpDocumentor member

I have created a new milestone Serializer and a series of issues on it. I use this to make the development process more transparent, allow people to comment and provide feedback and track progress.

The Serializer milestone is a prerequisite for the 2.0 milestone

@mvriel
phpDocumentor member

I have completed work on this item and merged it

@mvriel mvriel closed this Apr 30, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment