New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data storage format #32

Open
jacobwb opened this Issue Apr 16, 2014 · 9 comments

Comments

Projects
None yet
5 participants
@jacobwb
Owner

jacobwb commented Apr 16, 2014

Edit: Updated to reflect the current state of HashOver's support for different data storage formats. Serialized PHP is no longer being considered.


It seems advantageous to move away from XML as HashOver's default data storage format. The following is a list of considerable replacements. Any formats that require additional libraries or installation and/or configuration of server-side software (ie. MySQL) to employ or that are only available in PHP versions higher than 5.3, aren't being considered.

I won't use a database for the sake of using a database, most of the data formats below have more benefits within the context of HashOver than a database provides. The only benefit databases provide is a secure way to store password hash salts, but it is recommended that the comment files have no public read, write, nor execute permissions, thus securing the files just as significantly.

I'm considering the following things: human-readability, compatibility, speed, character set support, and interoperability. I will only use the format that best satisfies these conditions.

JSON

JSON seems the most logical since HashOver's default mode is JavaScript. JSON is very human-readable, very compatible, seems to have the same or greater character set support, and would be the easiest format to move to from XML.

The good

  • JSON is typed, in addition to string it supports integer, float and boolean, whereas everything in XML is treated as string type when parsed by PHP. This is inconvenient for a number of reasons, for example, adding a float value (1.66) from XML to an integer variable in PHP (5) returns a new integer with a value equal to the sum of the two (6), not a float (6.66), in other words it's a floored value. In XML it's also impossible to store an actual boolean value, and tags with the same name become arrays instead of overwriting each other.

  • JSON is easy to load asynchronously for those who desire to use the comment files that way. Simply loading the JSON comments as-is would be a security risk and should be avoided, however, loading the JSON comments, decoding them, removing the sensitive data, and returning the re-encoded data would be secure and very simple, while doing the same with XML files requires more code.

  • JSON is also slightly smaller than XML. The current overhead for each XML file is 204 bytes, JSON's overhead would be 153 bytes. 1,000,000 comments containing "Hello, World!" in XML would equal 217,000,000 bytes, in JSON 166,000,000 bytes, therefore...

    let xml = 217,000,000
    let json = 166,000,000
    
    xml ÷ 1,024² = 206.9 megabytes
    json ÷ 1,024² = 158.3 megabytes
    ((json − xml) / xml) × 100 = −23.5%
    

    That means JSON's overhead is 23.5% smaller than XML with XML taking up 48.6 megabytes more per million comments. That isn't very significant. Note that in a comment like this: comment #33 the XML overhead makes up for only 8.6% of the file, and in JSON only 6.4%.

The bad

  • Although JSON can be stored with pretty print, thus increasing its readability, it doesn't support multi-line items. This means most comments will appear on a very long single line when viewing or editing them in a text editor, whereas XML comments are stored with one paragraph per line.

SQLite

SQLite is well supported by PHP, it's a very well known data storage format, and unlike MySQL and many other *SQLs, SQLite doesn't require configuration of server-side software. PHP creates a file with a .sqlite extension and writes the data to that file.

The good

  • SQLite allows comments to be deleted securely by overwriting deleted content with zeros, making it harder to recover intentionally deleted comments.

    This isn't normally true of data formats where individual files are used, such as XML and JSON, unless the web server has an SSD and its filesystem is using TRIM, or if the deleted comments were to be shredded after being deleted.

  • SQLite may provide benefits relating to file size, however, as stated above, the current overhead isn't much, a database reducing it significantly is not likely.

  • SQLite may provide an increase in performance and querying speed, but not very significantly since all of the comment files are extremely small, modern filesystem overhead is negligible, and the code doesn't need to do any kind of query searches.

The bad

  • SQLite is the least human-readable data format on this list. Every SQL-based database requires specific -- often proprietary -- software capable of reading SQLite database files in order to access the data. When the data is made available to the user in a human-readable and editable way, it's often in the form of a single line, and editable in only a very limited way.

  • SQLite's secure delete feature has the negative side-effect of increasing the size of the database until the database can be "compacted".

  • All databases and other single-file formats, including SQLite, can be easily corrupted and/or compromised. If the database gets corrupted, everything is corrupted, meaning every comment on every page would be corrupted. If the database gets compromised, every e-mail and password every user ever used would be compromised at once.

    This isn't true of data formats where individual files are used, such as XML and JSON, since only a single file can be corrupted and/or compromised at a time.

Support custom data format classes

Similar to how the Login class provides the means to load any compatible class providing a mechanism for managing a login -- for which the DefaultLogin class provides the default login mechanism. Theoretically, this is already the case, ParseXML, ParseJSON, and ParseSQL are classes that provide the XML, JSON, and *SQL data formats, respectively.

What is needed here is documentation about how to write a proper data format class for HashOver, and a template from which a developer may base their class.

The good

  • Anyone could write and use additional data format classes so long as they return all the information HashOver expects. For example, an OAuth class could provide an OAuth login mechanism.

The bad

  • It is inconvenient for a user to have to write their own class to use a specific data format, therefore a minimum number of data formats should be supported by default.

Others

If anyone has recommendations, please feel free to comment here or open specific issues. It is likely that all of the above data formats will be supported upon release, with an additional option in the settings for choosing between them, in such a case the question of data formats becomes:

"Which should be the default?"

@jacobwb jacobwb added this to the 2.0 milestone Apr 16, 2014

@Lux-Delux

This comment has been minimized.

Show comment
Hide comment
@Lux-Delux

Lux-Delux Apr 16, 2014

Contributor

Just so I understand what's not considered, something like this is out of the question?
https://github.com/dcodeIO/ProtoBuf.js

In any case, I've been reading previously on XML and JSON... From what I gathered, we would need to be more careful with JSON when it comes to security, some interesting points raised here
http://stackoverflow.com/questions/4862310/json-and-xml-comparison

This is a good read
http://www.balisage.net/Proceedings/vol10/html/Lee01/BalisageVol10-Lee01.html#d69706e32

Contributor

Lux-Delux commented Apr 16, 2014

Just so I understand what's not considered, something like this is out of the question?
https://github.com/dcodeIO/ProtoBuf.js

In any case, I've been reading previously on XML and JSON... From what I gathered, we would need to be more careful with JSON when it comes to security, some interesting points raised here
http://stackoverflow.com/questions/4862310/json-and-xml-comparison

This is a good read
http://www.balisage.net/Proceedings/vol10/html/Lee01/BalisageVol10-Lee01.html#d69706e32

@jacobwb

This comment has been minimized.

Show comment
Hide comment
@jacobwb

jacobwb Apr 17, 2014

Owner

@Lux-Delux

Protocol Buffers is definitely not out of the question, I really like the overall idea of it. However, Protocol Buffers don't seem to be well supported by PHP currently, "not well supported" as in no native way to write or parse .proto files. Third-party libraries would be required, and I am not willing to delegate such an important task to third-party libraries, thus my similar opposition to jQuery.

Many of the libraries aren't actively developed either.
__
pb4php hasn't had a release since March 2009.
Protobuf-PHP hasn't had a release since 2011.
protoc-gen-php hasn't had a release since 2010-2011.

Developing a custom reader, parser and writer is possible, but that's a job best left to the developers of PHP itself. Overall, Protocol Buffers seems to be a possible JSON successor, and will more than likely eventually have proper native support in PHP, when that happens it will be a serious consideration, however, it would also mean increasing the minimum PHP version to whatever version gets the support first, which is undesirable.

Having it as an optional storage format isn't out of the question, either.

Owner

jacobwb commented Apr 17, 2014

@Lux-Delux

Protocol Buffers is definitely not out of the question, I really like the overall idea of it. However, Protocol Buffers don't seem to be well supported by PHP currently, "not well supported" as in no native way to write or parse .proto files. Third-party libraries would be required, and I am not willing to delegate such an important task to third-party libraries, thus my similar opposition to jQuery.

Many of the libraries aren't actively developed either.
__
pb4php hasn't had a release since March 2009.
Protobuf-PHP hasn't had a release since 2011.
protoc-gen-php hasn't had a release since 2010-2011.

Developing a custom reader, parser and writer is possible, but that's a job best left to the developers of PHP itself. Overall, Protocol Buffers seems to be a possible JSON successor, and will more than likely eventually have proper native support in PHP, when that happens it will be a serious consideration, however, it would also mean increasing the minimum PHP version to whatever version gets the support first, which is undesirable.

Having it as an optional storage format isn't out of the question, either.

@Lux-Delux

This comment has been minimized.

Show comment
Hide comment
@Lux-Delux

Lux-Delux May 15, 2014

Contributor

I launched my website just as you updated the storage formats... am now considering whether to stick with xml or switch to json before comments start to pop.

Any drawbacks you have found with json?

The size benefits aren't that big but it's still something I appreciate.

Contributor

Lux-Delux commented May 15, 2014

I launched my website just as you updated the storage formats... am now considering whether to stick with xml or switch to json before comments start to pop.

Any drawbacks you have found with json?

The size benefits aren't that big but it's still something I appreciate.

@jacobwb

This comment has been minimized.

Show comment
Hide comment
@jacobwb

jacobwb May 16, 2014

Owner

@Lux-Delux

I haven't found any obvious drawbacks yet. JSON support isn't really well tested, though, meaning certain input might throw errors, which would cause the browser to throw JavaScript syntax errors as well. JSON support has also not been tested very much for security issues.

Though, JSON's main security issues seem to be related to eval() and AJAX, and since HashOver uses neither of those it seems safe to say any major security risks aren't an issue. As long as you keep your HashOver version up-to-date, the likelihood of security and performance issues will only decrease.

Unrelated to JSON: Just to be safe, you might want to make sure people can't view anything under hashover/pages/ by using .htaccess or its equivalent in Nginx, and/or proper file permissions.

Owner

jacobwb commented May 16, 2014

@Lux-Delux

I haven't found any obvious drawbacks yet. JSON support isn't really well tested, though, meaning certain input might throw errors, which would cause the browser to throw JavaScript syntax errors as well. JSON support has also not been tested very much for security issues.

Though, JSON's main security issues seem to be related to eval() and AJAX, and since HashOver uses neither of those it seems safe to say any major security risks aren't an issue. As long as you keep your HashOver version up-to-date, the likelihood of security and performance issues will only decrease.

Unrelated to JSON: Just to be safe, you might want to make sure people can't view anything under hashover/pages/ by using .htaccess or its equivalent in Nginx, and/or proper file permissions.

@Lux-Delux

This comment has been minimized.

Show comment
Hide comment
@Lux-Delux

Lux-Delux May 16, 2014

Contributor

I guess I'll stick with XML for now, for stability.

As for security, I have the permissions from the 1.0 tar file and trying to access xml comments directly throws 404 error

Contributor

Lux-Delux commented May 16, 2014

I guess I'll stick with XML for now, for stability.

As for security, I have the permissions from the 1.0 tar file and trying to access xml comments directly throws 404 error

@jacobwb

This comment has been minimized.

Show comment
Hide comment
@jacobwb

jacobwb May 16, 2014

Owner

@Lux-Delux

404 errors are perfect behavior.

Owner

jacobwb commented May 16, 2014

@Lux-Delux

404 errors are perfect behavior.

@aoloe

This comment has been minimized.

Show comment
Hide comment
@aoloe

aoloe Oct 25, 2015

Contributor

a few comments from my side:

  • json is for my case the perfect solution
  • for php i'm not sure that sqlite is better than mysql (which is omnipresent where php is there...). it's a really cool technology but it does not add much if you have json on the one side and mysql on the other.
  • there should be a setting for the directory where the comments are placed (possibly pointing outside of the path served through http; .htaccess should not be the only way of protecting the files).
  • i'd prefer to have all the comments concerning one page in the same json file (you always or mostly want to read all comments relating to a page). it should at least be an option. but it might also be by design: if you need more, you should use mysql. json would then only be for small lists of comments. which makes sense to me.
  • we could apply a "weak" crpyto protection to the fields that are not shown in the UI (like the email address of the commenter)

i can work on patch for getting all the comments in one single file per page and for defining the target path for the those files.

Contributor

aoloe commented Oct 25, 2015

a few comments from my side:

  • json is for my case the perfect solution
  • for php i'm not sure that sqlite is better than mysql (which is omnipresent where php is there...). it's a really cool technology but it does not add much if you have json on the one side and mysql on the other.
  • there should be a setting for the directory where the comments are placed (possibly pointing outside of the path served through http; .htaccess should not be the only way of protecting the files).
  • i'd prefer to have all the comments concerning one page in the same json file (you always or mostly want to read all comments relating to a page). it should at least be an option. but it might also be by design: if you need more, you should use mysql. json would then only be for small lists of comments. which makes sense to me.
  • we could apply a "weak" crpyto protection to the fields that are not shown in the UI (like the email address of the commenter)

i can work on patch for getting all the comments in one single file per page and for defining the target path for the those files.

@fuzzy76

This comment has been minimized.

Show comment
Hide comment
@fuzzy76

fuzzy76 Jun 29, 2016

I did a similar comparison a while back, and ended with json. The final argument came when I benchmarked json_decode() against unserialize() and found the json equivalent to be much faster.

Another pro for you would be that javascript talks json natively.

fuzzy76 commented Jun 29, 2016

I did a similar comparison a while back, and ended with json. The final argument came when I benchmarked json_decode() against unserialize() and found the json equivalent to be much faster.

Another pro for you would be that javascript talks json natively.

@mro

This comment has been minimized.

Show comment
Hide comment
@mro

mro Jun 29, 2016

A pro for xml: it can be rendered to html with JS disabled.

A sidenote concerning yaml: json is actually a subset of yaml, so every yaml parser must parse json (but not the way round). So that's IMO a pro for json.

mro commented Jun 29, 2016

A pro for xml: it can be rendered to html with JS disabled.

A sidenote concerning yaml: json is actually a subset of yaml, so every yaml parser must parse json (but not the way round). So that's IMO a pro for json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment