-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICAT schema extension with columns for investigation and dataset sizes #233
ICAT schema extension with columns for investigation and dataset sizes #233
Conversation
We discussed this in the collaboration meeting last week and in particular the question of preventing the client to write these new attributes. The decision was to do nothing for the moment. The main use case for the size attributes is to provide the user a a hint on the size of an investigation or a dataset in the web user interface, so that he might think twice before clicking on download for an investigation having several tens of gigabytes. For this purpose, accuracy is not critical. Furthermore, it's not clear whether errors due to overwriting the values will be an issue in practice. So we decided to wait and see how it works in production and fix that only if it turns out to be a problem. It would be rather simple to set up a maintenance task that runs in background from time to time and checks these sizes. This would even be easier if the attributes are writable. |
Hi, Take into account that in some cases the value is and has to be exact. It can be used to create a quota system that can block the archiving, calculate the money that cost the archival during ten years of an industrial experiment or is used to flag in real time when an experiment is exploding the maximum number of files allowed, for instance. |
@antolinos, nothing prevents you from verifying this. You may for instance run a maintenance script to check the sizes. There is even already a prototype for such a script. |
The triggers on a fresh install are missing. The database triggers to update |
Size attributes are not updated if NULL. If the size attributes are set to an integer, the triggers do their job: >>> # Get an investigation having investigationSize set to zero
>>> inv = client.assertedSearch("Investigation [name='test-Zero']")[0]
>>> inv.investigationSize
0
>>> inv.investigationSize is None
False
>>> # Create a dataset
>>> dataset = client.new("dataset", name="test", investigation=inv, type=ds_type, complete=False, datasetSize=0)
>>> dataset.create()
>>> # Fetch the new dataset from the server to verify datasetSize is set
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
0
>>> dataset.datasetSize is None
False
>>> # Create a datafile
>>> datafile = client.new("datafile", name="test.dat", dataset=dataset, datafileFormat=df_format, fileSize=38)
>>> datafile.create()
>>> # Fetch the dataset and the investigation again to verify the updated size attributes
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
38
>>> inv = inv.get("Investigation")
>>> inv.investigationSize
38 But if the size attributes are not set, the triggers do not work: >>> inv = client.assertedSearch("Investigation [name='test-None']")[0]
>>> inv.investigationSize
>>> inv.investigationSize is None
True
>>> dataset = client.new("dataset", name="test", investigation=inv, type=ds_type, complete=False)
>>> dataset.create()
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
>>> dataset.datasetSize is None
True
>>> datafile = client.new("datafile", name="test.dat", dataset=dataset, datafileFormat=df_format, fileSize=38)
>>> datafile.create()
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
>>> dataset.datasetSize is None
True
>>> inv = inv.get("Investigation")
>>> inv.investigationSize
>>> inv.investigationSize is None
True Note that the update script does not initialize the size attributes if a dataset or investigation has no files. I tried this with a MariaDB backend. |
The reason for this is obviously the arithmetic in MariaDB / MySQL: MariaDB [(none)]> select NULL + 38;
+-----------+
| NULL + 38 |
+-----------+
| NULL |
+-----------+
1 row in set (0.00 sec) |
Hi @RKrahl If done by triggers, will there be a cost in the performance? |
My assumption would be that the performance with triggers will be way better then the performance of updating the value by the client. (Obviously not keeping the value up to date at all will always be the fastest option.) But this is still to be tested. |
I agree but it does not mean that will be acceptable. I was wondering that if ICAT runs slower because of the calculations someone might prefer not to calculate these values however if done with triggers there is no choice (at least you remove the triggers and somehow diverge from the ICAT's standard deployment). Also, you might want to spend time calculating the size of the investigations carried out by users but you don't care about the in-house, for instance. |
Tested. This fix seems to work, at least for MariaDB / MySQL. The update script still leaves the size attributes to be |
I still found a case where the triggers (for MariaDB / MySQL) are not working correctly. If you have a dataset with some datafiles having >>> # create a brand new investigation
>>> investigation = client.new("investigation", facility=facility, type=inv_type, name="sizetest", visitId="N/A", title="size attribute test")
>>> investigation.create()
>>> investigation.get("Investigation")
(investigation){
createId = "simple/root"
createTime = 2020-06-22 15:50:31+02:00
id = 10
modId = "simple/root"
modTime = 2020-06-22 15:50:31+02:00
name = "sizetest"
title = "size attribute test"
visitId = "N/A"
}
>>> investigation.investigationSize is None
True
>>> # add a dataset with some files, but do not set the fileSize
>>> dataset = client.new("dataset", investigation=investigation, type=ds_type, name="testds", complete=False)
>>> for df_count in range(10):
... datafile = client.new("datafile", name="df_%04d" % df_count)
... dataset.datafiles.append(datafile)
...
>>> dataset.create()
>>> # check that dataset.datasetSize and investigation.investigationSize are 0
>>> dataset.get("Dataset")
(dataset){
createId = "simple/root"
createTime = 2020-06-22 15:51:31+02:00
id = 771
modId = "simple/root"
modTime = 2020-06-22 15:51:31+02:00
complete = False
datasetSize = 0
name = "testds"
}
>>> investigation.get("Investigation")
(investigation){
createId = "simple/root"
createTime = 2020-06-22 15:50:31+02:00
id = 10
modId = "simple/root"
modTime = 2020-06-22 15:50:31+02:00
investigationSize = 0
name = "sizetest"
title = "size attribute test"
visitId = "N/A"
}
>>> # now update the datafiles, setting the fileSize
>>> query = Query(client, "Datafile", conditions={ "dataset.id": "= %d" % dataset.id }, includes="1")
>>> datafiles = client.search(query)
>>> assert len(datafiles) == 10
>>> for datafile in datafiles:
... datafile.fileSize = 997
... datafile.update()
...
>>> # dataset.datasetSize and investigation.investigationSize should be updated now, but they aren't
>>> dataset.get("Dataset")
(dataset){
createId = "simple/root"
createTime = 2020-06-22 15:51:31+02:00
id = 771
modId = "simple/root"
modTime = 2020-06-22 15:51:31+02:00
complete = False
datasetSize = 0
name = "testds"
}
>>> investigation.get("Investigation")
(investigation){
createId = "simple/root"
createTime = 2020-06-22 15:50:31+02:00
id = 10
modId = "simple/root"
modTime = 2020-06-22 15:50:31+02:00
investigationSize = 0
name = "sizetest"
title = "size attribute test"
visitId = "N/A"
}
>>> # verify the overall size of the datafiles
>>> ds_size_query = Query(client, "Datafile", conditions={ "dataset.id": "= %d" % dataset.id }, attribute="fileSize", aggregate="SUM")
>>> client.assertedSearch(ds_size_query)[0]
9970
>>> I have no idea why this is not working. |
The problem is that the trigger only updates the ELSEIF NEW.FILESIZE != OLD.FILESIZE THEN Evidently, if either the new ELSEIF IFNULL(NEW.FILESIZE, 0) != IFNULL(OLD.FILESIZE, 0) THEN Now the trigger should work correctly even if the I am not sure if Oracle has this problem too, but just to be safe I will change the Oracle trigger as well. |
Tested. This fix seems to work, at least for MariaDB / MySQL.
Yes. In any case, I'd say we should try to keep both version as similar as possible. Thus, I'd opt for adding the analogous fix also to the Oracle version. |
Now I also did some performance testing. The test script is available in icat-contrib. I ran the same script against the current icat.server release 4.10.0 and against an icat.server built from this branch having the DB triggers in place. I only tested it with a MariaDB backend. Here is the output with the timings for the 4.10.0 release:
And here for this PR code:
As you can see, as expected, there is a performance penalty from the triggers. But it is barely measurable and lies within the range of fluctuation. I'd appreciate someone trying it with an Oracle backend. |
Thanks @RKrahl Question, what is the difference between test case 1 and test case 2? I see in the code:
Then I interpret that on test case 1 the datasets already exist and files are attached and in test case 2 datasets with datafiles are created?
Why the test 2 is twice faster? |
@antolinos, the difference between case 1 and case 2 is that case 2 uses cascading. In case 1, a dataset is created first and then 1000 datafiles are created in that dataset with a separate call each. In case 2, the dataset and 1000 datafiles are created in one single call at once. |
I'd say it would be rather easy to also add a |
There has been the suggestion in the monthly meeting that the triggers should be optional. |
Here are the use cases for DLS:
|
I think it is a good idea |
At DLS, we already have a trigger in the ICAT Datafile table, but this is used for a different purpose. Each time a datafile is added, updated or deleted it will add/remove/update datafiles in another database schema called FUSE. create or replace TRIGGER "TESTICAT_DLS45"."UPDATE_FILESYSTEM_LOG" after insert or delete or update of location or update of filesize on datafile |
As suggested in #238, this PR also adds The triggers have been modified to automatically update these The upgrade script automatically initializes the As discussed in the last meeting, the triggers are now optional, so the upgrade script no longer creates them by default. Instead, there are These scripts are also the easiest way to add the triggers after a fresh ICAT installation, or to remove them at any point. |
It has been decided in today's meeting:
The rationale for the second item is that it is better to have the attributes not set at all rather than to have some value that will never update and be completely wrong soon in the case that a site decides not to install the triggers. E.g. any site that chooses not to install the triggers would need to decide whether to initialize the attributes or not. |
In Oracle, there are some errors (
|
I did some performance testing on Oracle on my local env. We can do more tests on our test server later. I ran the script against the icat.server release 4.11.1 and against an icat.server Here is the output with the timings for the 4.11.1 release:
And here for this PR code 5.0.0-SNAPSHOT with triggers
So it's globally the same trend as for mysql. |
I'm adding the results from running these tests on against our Oracle development database. The database used was a copy of the Diamond ICAT schema. It contained a reduced number of Datafiles but still over 500 million! The results from the first run on ICAT 4.10.0 (no triggers):
And from the second run on an ICAT 5.0.0 snapshot (with triggers):
So the strange anomaly here is that the Test 1 was actually faster on the second run! This is not unusual. I have done similar tests before and had results like this. I put them down to the fact that the ICAT is running on a VM sharing resources with other VMs on the same hypervisor, and this is connected across the site network to our departmental development Oracle database which is also running numerous other databases. So the results vary depending on how busy the VM cluster, network and Oracle database are at any particular point in time. The table below summarises the results from HZB, ESRF and STFC with the numbers being the percentage increase in time taken to run the test with the triggers in place (the negative number indicating the decrease in the first STFC test).
The good news from a DLS point of view is that Test 2 using the createMany method (used to create most Datafiles in the DLS ICAT) shows both the smallest increase in time taken and the smallest variation across the 3 sites. |
This is a proposal for the schema extension discussed in #211. It includes the following:
Dataset.datasetSize
andInvestigation.investigationSize
to the schema.icat.server
should not be running while this is done to avoid inconsistencies.Note that the update is done incrementally: e.g. when a new Datafile is added, its size simply gets added to the size of the related Dataset and Investigation (this obviously assumes that the previous values were already correct). This approach has advantages (good performance) and disadvantages (might lead to inconsistencies in certain edge cases).
I also considered a different implementation where the sizes of Datasets and Investigations are always re-calculated from scratch (which arguably is more reliable) but I found that this not only comes with a computational overhead, but it also leads to issues with Oracle database backends (which apparently don't like it when the table that caused a trigger is accessed inside that trigger).
I have done a few simple tests so far to see how the Java Persistence API behaves in conjunction with the triggers, and it appears to be working fine.
What still needs to be done:
Closes #211