Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICAT schema extension with columns for investigation and dataset sizes #233

Merged
merged 15 commits into from
Jun 8, 2022
Merged

ICAT schema extension with columns for investigation and dataset sizes #233

merged 15 commits into from
Jun 8, 2022

Conversation

EmilJunker
Copy link
Contributor

This is a proposal for the schema extension discussed in #211. It includes the following:

  • Added properties Dataset.datasetSize and Investigation.investigationSize to the schema.
  • The upgrade script automatically calculates and initializes the sizes of existing Datasets and Investigations. Note that icat.server should not be running while this is done to avoid inconsistencies.
  • The upgrade script also adds SQL triggers in the database backend: whenever a Datafile is added/modified/deleted, the sizes of the related Dataset and Investigation are updated automatically.

Note that the update is done incrementally: e.g. when a new Datafile is added, its size simply gets added to the size of the related Dataset and Investigation (this obviously assumes that the previous values were already correct). This approach has advantages (good performance) and disadvantages (might lead to inconsistencies in certain edge cases).
I also considered a different implementation where the sizes of Datasets and Investigations are always re-calculated from scratch (which arguably is more reliable) but I found that this not only comes with a computational overhead, but it also leads to issues with Oracle database backends (which apparently don't like it when the table that caused a trigger is accessed inside that trigger).

I have done a few simple tests so far to see how the Java Persistence API behaves in conjunction with the triggers, and it appears to be working fine.

What still needs to be done:

  • Extensive testing (for each database backend) to make sure everything works as expected, including the performance.
  • Come up with a mechanism to prevent a client from overwriting the calculated size of a Dataset or Investigation.

Closes #211

@RKrahl RKrahl added enhancement schema this involves changes to the ICAT schema labels May 28, 2020
@RKrahl
Copy link
Member

RKrahl commented Jun 3, 2020

We discussed this in the collaboration meeting last week and in particular the question of preventing the client to write these new attributes. The decision was to do nothing for the moment.

The main use case for the size attributes is to provide the user a a hint on the size of an investigation or a dataset in the web user interface, so that he might think twice before clicking on download for an investigation having several tens of gigabytes. For this purpose, accuracy is not critical. Furthermore, it's not clear whether errors due to overwriting the values will be an issue in practice. So we decided to wait and see how it works in production and fix that only if it turns out to be a problem.

It would be rather simple to set up a maintenance task that runs in background from time to time and checks these sizes. This would even be easier if the attributes are writable.

@antolinos
Copy link

Hi,

Take into account that in some cases the value is and has to be exact. It can be used to create a quota system that can block the archiving, calculate the money that cost the archival during ten years of an industrial experiment or is used to flag in real time when an experiment is exploding the maximum number of files allowed, for instance.

@RKrahl
Copy link
Member

RKrahl commented Jun 12, 2020

@antolinos, nothing prevents you from verifying this. You may for instance run a maintenance script to check the sizes. There is even already a prototype for such a script.

@RKrahl
Copy link
Member

RKrahl commented Jun 15, 2020

The triggers on a fresh install are missing.

The database triggers to update Dataset.datasetSize and Investigation.investigationSize are defined in the upgrade scripts. So for an upgrade from from 4.10 everything looks fine. But when doing a fresh install of the new version, the code initializing the ICAT database does not create the triggers.

@RKrahl
Copy link
Member

RKrahl commented Jun 15, 2020

Size attributes are not updated if NULL.

If the size attributes are set to an integer, the triggers do their job:

>>> # Get an investigation having investigationSize set to zero
>>> inv = client.assertedSearch("Investigation [name='test-Zero']")[0]
>>> inv.investigationSize
0
>>> inv.investigationSize is None
False
>>> # Create a dataset
>>> dataset = client.new("dataset", name="test", investigation=inv, type=ds_type, complete=False, datasetSize=0)
>>> dataset.create()
>>> # Fetch the new dataset from the server to verify datasetSize is set
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
0
>>> dataset.datasetSize is None
False
>>> # Create a datafile
>>> datafile = client.new("datafile", name="test.dat", dataset=dataset, datafileFormat=df_format, fileSize=38)
>>> datafile.create()
>>> # Fetch the dataset and the investigation again to verify the updated size attributes
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
38
>>> inv = inv.get("Investigation")
>>> inv.investigationSize
38

But if the size attributes are not set, the triggers do not work:

>>> inv = client.assertedSearch("Investigation [name='test-None']")[0]
>>> inv.investigationSize
>>> inv.investigationSize is None
True
>>> dataset = client.new("dataset", name="test", investigation=inv, type=ds_type, complete=False)
>>> dataset.create()
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
>>> dataset.datasetSize is None
True
>>> datafile = client.new("datafile", name="test.dat", dataset=dataset, datafileFormat=df_format, fileSize=38)
>>> datafile.create()
>>> dataset = dataset.get("Dataset")
>>> dataset.datasetSize
>>> dataset.datasetSize is None
True
>>> inv = inv.get("Investigation")
>>> inv.investigationSize
>>> inv.investigationSize is None
True

Note that the update script does not initialize the size attributes if a dataset or investigation has no files.

I tried this with a MariaDB backend.

@RKrahl
Copy link
Member

RKrahl commented Jun 15, 2020

Size attributes are not updated if NULL.

The reason for this is obviously the arithmetic in MariaDB / MySQL:

MariaDB [(none)]> select NULL + 38;
+-----------+
| NULL + 38 |
+-----------+
|      NULL |
+-----------+
1 row in set (0.00 sec)

@antolinos
Copy link

Hi @RKrahl

If done by triggers, will there be a cost in the performance?

@RKrahl
Copy link
Member

RKrahl commented Jun 15, 2020

If done by triggers, will there be a cost in the performance?

My assumption would be that the performance with triggers will be way better then the performance of updating the value by the client. (Obviously not keeping the value up to date at all will always be the fastest option.) But this is still to be tested.

@antolinos
Copy link

I agree but it does not mean that will be acceptable. I was wondering that if ICAT runs slower because of the calculations someone might prefer not to calculate these values however if done with triggers there is no choice (at least you remove the triggers and somehow diverge from the ICAT's standard deployment). Also, you might want to spend time calculating the size of the investigations carried out by users but you don't care about the in-house, for instance.
Just some thoughts.

@EmilJunker
Copy link
Contributor Author

Size attributes are not updated if NULL

The reason for this is obviously the arithmetic in MariaDB / MySQL

I think I fixed it now by using the MySQL IFNULL function and Oracle's NVL function, respectively.

@RKrahl
Copy link
Member

RKrahl commented Jun 16, 2020

Size attributes are not updated if NULL

I think I fixed it now [...]

Tested. This fix seems to work, at least for MariaDB / MySQL.

The update script still leaves the size attributes to be NULL for Investigations and Datasets having no Datafiles. But I guess that is ok and it is consistent with the behavior of the trigger for Investigations and Datasets created after the upgrade.

@RKrahl
Copy link
Member

RKrahl commented Jun 22, 2020

I still found a case where the triggers (for MariaDB / MySQL) are not working correctly. If you have a dataset with some datafiles having fileSize not set, the dataset.datasetSize should be 0. If you now update the datafiles, setting a positive fileSize, dataset.datasetSize and investigation.investigationSize are not updated accordingly. Consider the following example:

>>> # create a brand new investigation
>>> investigation = client.new("investigation", facility=facility, type=inv_type, name="sizetest", visitId="N/A", title="size attribute test")
>>> investigation.create()
>>> investigation.get("Investigation")
(investigation){
   createId = "simple/root"
   createTime = 2020-06-22 15:50:31+02:00
   id = 10
   modId = "simple/root"
   modTime = 2020-06-22 15:50:31+02:00
   name = "sizetest"
   title = "size attribute test"
   visitId = "N/A"
 }
>>> investigation.investigationSize is None
True
>>> # add a dataset with some files, but do not set the fileSize
>>> dataset = client.new("dataset", investigation=investigation, type=ds_type, name="testds", complete=False)
>>> for df_count in range(10):                
...     datafile = client.new("datafile", name="df_%04d" % df_count)
...     dataset.datafiles.append(datafile)
... 
>>> dataset.create()
>>> # check that dataset.datasetSize and investigation.investigationSize are 0
>>> dataset.get("Dataset")
(dataset){
   createId = "simple/root"
   createTime = 2020-06-22 15:51:31+02:00
   id = 771
   modId = "simple/root"
   modTime = 2020-06-22 15:51:31+02:00
   complete = False
   datasetSize = 0
   name = "testds"
 }
>>> investigation.get("Investigation")
(investigation){
   createId = "simple/root"
   createTime = 2020-06-22 15:50:31+02:00
   id = 10
   modId = "simple/root"
   modTime = 2020-06-22 15:50:31+02:00
   investigationSize = 0
   name = "sizetest"
   title = "size attribute test"
   visitId = "N/A"
 }
>>> # now update the datafiles, setting the fileSize
>>> query = Query(client, "Datafile", conditions={ "dataset.id": "= %d" % dataset.id }, includes="1")
>>> datafiles = client.search(query)
>>> assert len(datafiles) == 10
>>> for datafile in datafiles:
...     datafile.fileSize = 997
...     datafile.update()
... 
>>> # dataset.datasetSize and investigation.investigationSize should be updated now, but they aren't
>>> dataset.get("Dataset")
(dataset){
   createId = "simple/root"
   createTime = 2020-06-22 15:51:31+02:00
   id = 771
   modId = "simple/root"
   modTime = 2020-06-22 15:51:31+02:00
   complete = False
   datasetSize = 0
   name = "testds"
 }
>>> investigation.get("Investigation")
(investigation){
   createId = "simple/root"
   createTime = 2020-06-22 15:50:31+02:00
   id = 10
   modId = "simple/root"
   modTime = 2020-06-22 15:50:31+02:00
   investigationSize = 0
   name = "sizetest"
   title = "size attribute test"
   visitId = "N/A"
 }
>>> # verify the overall size of the datafiles
>>> ds_size_query = Query(client, "Datafile", conditions={ "dataset.id": "= %d" % dataset.id }, attribute="fileSize", aggregate="SUM")
>>> client.assertedSearch(ds_size_query)[0]
9970
>>> 

I have no idea why this is not working.

@EmilJunker
Copy link
Contributor Author

EmilJunker commented Jun 23, 2020

I still found a case where the triggers (for MariaDB / MySQL) are not working correctly

The problem is that the trigger only updates the dataset.datasetSize and investigation.investigationSize if the fileSize has changed. This is checked like this:

ELSEIF NEW.FILESIZE != OLD.FILESIZE THEN

Evidently, if either the new fileSize or the old fileSize is NULL then this condition is not met for some reason.
This can be fixed by changing the above line to:

ELSEIF IFNULL(NEW.FILESIZE, 0) != IFNULL(OLD.FILESIZE, 0) THEN

Now the trigger should work correctly even if the fileSize is/was NULL.

I am not sure if Oracle has this problem too, but just to be safe I will change the Oracle trigger as well.

@RKrahl
Copy link
Member

RKrahl commented Jun 23, 2020

Now the trigger should work correctly even if the fileSize is/was NULL.

Tested. This fix seems to work, at least for MariaDB / MySQL.

I am not sure if Oracle has this problem too, but just to be safe I will change the Oracle trigger as well.

Yes. In any case, I'd say we should try to keep both version as similar as possible. Thus, I'd opt for adding the analogous fix also to the Oracle version.

@RKrahl
Copy link
Member

RKrahl commented Jun 23, 2020

Now I also did some performance testing. The test script is available in icat-contrib. I ran the same script against the current icat.server release 4.10.0 and against an icat.server built from this branch having the DB triggers in place. I only tested it with a MariaDB backend.

Here is the output with the timings for the 4.10.0 release:

INFO: Test case 1: create datafiles having a positive fileSize
INFO: Test case 1: done 100 datasets having 1000 datafiles each
INFO: Test case 1: min/max/avg time per dataset: 23.413 s / 37.519 s / 27.161 s
INFO: Test case 2: create datasets with datafiles having a positive fileSize
INFO: Test case 2: done 100 datasets having 1000 datafiles each
INFO: Test case 2: min/max/avg time per dataset: 11.937 s / 16.303 s / 12.764 s
INFO: Test case 3: update datafile.fileSize from not set to a positive value
INFO: Test case 3: done 100 datasets having 1000 datafiles each
INFO: Test case 3: min/max/avg time per dataset: 13.346 s / 19.657 s / 16.722 s
INFO: Test case 4: update datafile.fileSize to a different positive value
INFO: Test case 4: done 100 datasets having 1000 datafiles each
INFO: Test case 4: min/max/avg time per dataset: 13.322 s / 19.460 s / 16.654 s

And here for this PR code:

INFO: Test case 1: create datafiles having a positive fileSize
INFO: Test case 1: done 100 datasets having 1000 datafiles each
INFO: Test case 1: min/max/avg time per dataset: 25.233 s / 41.447 s / 27.944 s
INFO: Test case 2: create datasets with datafiles having a positive fileSize
INFO: Test case 2: done 100 datasets having 1000 datafiles each
INFO: Test case 2: min/max/avg time per dataset: 12.375 s / 24.787 s / 13.264 s
INFO: Test case 3: update datafile.fileSize from not set to a positive value
INFO: Test case 3: done 100 datasets having 1000 datafiles each
INFO: Test case 3: min/max/avg time per dataset: 13.975 s / 19.984 s / 17.234 s
INFO: Test case 4: update datafile.fileSize to a different positive value
INFO: Test case 4: done 100 datasets having 1000 datafiles each
INFO: Test case 4: min/max/avg time per dataset: 14.281 s / 20.696 s / 16.886 s

As you can see, as expected, there is a performance penalty from the triggers. But it is barely measurable and lies within the range of fluctuation.

I'd appreciate someone trying it with an Oracle backend.

@antolinos
Copy link

Thanks @RKrahl

Question, what is the difference between test case 1 and test case 2? I see in the code:

1. Create datafiles having a positive fileSize.
2. Create datasets with datafiles (in one call using cascading) having
   a positive fileSize.

Then I interpret that on test case 1 the datasets already exist and files are attached and in test case 2 datasets with datafiles are created?

INFO: Test case 1: create datafiles having a positive fileSize
INFO: Test case 1: done 100 datasets having 1000 datafiles each
INFO: Test case 1: min/max/avg time per dataset: 25.233 s / 41.447 s / 27.944 s

INFO: Test case 2: create datasets with datafiles having a positive fileSize
INFO: Test case 2: done 100 datasets having 1000 datafiles each
INFO: Test case 2: min/max/avg time per dataset: 12.375 s / 24.787 s / 13.264 s

Why the test 2 is twice faster?

@RKrahl
Copy link
Member

RKrahl commented Jun 23, 2020

@antolinos, the difference between case 1 and case 2 is that case 2 uses cascading. In case 1, a dataset is created first and then 1000 datafiles are created in that dataset with a separate call each. In case 2, the dataset and 1000 datafiles are created in one single call at once.

@RKrahl
Copy link
Member

RKrahl commented Jun 25, 2020

I'd say it would be rather easy to also add a fileCount to Investigation and Dataset as suggested in #238.

@RKrahl
Copy link
Member

RKrahl commented Jun 25, 2020

There has been the suggestion in the monthly meeting that the triggers should be optional.

@dfq16044
Copy link

Here are the use cases for DLS:

  • Currently a user cannot download an entire investigation, they need to select datasets. The idea was to add a select all datasets or select an investigation to download when the size of the investigation is lower than 10 TB. We cannot do it at the moment because the calculation of the size of the investigation is not efficient.
  • We started to use the number of files per dataset in TopCat instead of dataset size because of performance.
  • The calculation of the total size of the investigation is in a hidden tab and we need to press a button to do the calculation.
  • In the download cart, the number of files and volume are being calculated before the download. To give a sense of progression, the user needs to look at the number of files or size changes in the front end before he can continue to the next stage of the download

@antolinos
Copy link

There has been the suggestion in the monthly meeting that the triggers should be optional.

I think it is a good idea

@dfq16044
Copy link

At DLS, we already have a trigger in the ICAT Datafile table, but this is used for a different purpose. Each time a datafile is added, updated or deleted it will add/remove/update datafiles in another database schema called FUSE.
FUSE is used to show the datafiles available on tape in DLS filesystem. It has been there for many years and the ingest performance is ok.
Here is the trigger sql code present in our test system:

create or replace TRIGGER "TESTICAT_DLS45"."UPDATE_FILESYSTEM_LOG" after insert or delete or update of location or update of filesize on datafile
for each row
begin
case
WHEN inserting THEN
insert into TESTICAT_DLS45.filesystem_log(operation ,new_location,seq) values('I',:NEW.location,TESTICAT_DLS45.filesystem_seq.nextval);
when deleting then
INSERT INTO TESTICAT_DLS45.filesystem_log(operation ,old_location,seq) VALUES('D',:OLD.LOCATION,TESTICAT_DLS45.filesystem_seq.nextval);
INSERT INTO TESTICAT_DLS45.icatdls44_report_deleted_files(dataset_id, filesize, file_Id,operation) VALUES(:old.dataset_id,:old.filesize,:old.id,'d');
when updating('location') then
INSERT INTO TESTICAT_DLS45.filesystem_log(operation ,old_location,new_location,seq) VALUES('U',:OLD.LOCATION,:NEW.LOCATION,TESTICAT_DLS45.filesystem_seq.nextval);
WHEN updating('filesize') THEN
INSERT INTO TESTICAT_DLS45.icatdls44_report_deleted_files(dataset_id, filesize, file_Id, operation) VALUES(:old.dataset_id,:old.filesize-:NEW.filesize,:old.id,'u');
end case;
END;

@EmilJunker
Copy link
Contributor Author

As suggested in #238, this PR also adds fileCount columns to the Investigation and Dataset tables.

The triggers have been modified to automatically update these fileCount columns in addition to the datasetSize and investigationSize columns whenever a datafile or dataset is added/modified/deleted.

The upgrade script automatically initializes the fileCount, datasetSize and investigationSize columns for all existing datasets and investigations.

As discussed in the last meeting, the triggers are now optional, so the upgrade script no longer creates them by default. Instead, there are create_triggers_*.sql and drop_triggers_*.sql scripts available for both MySQL and Oracle that do the job.

These scripts are also the easiest way to add the triggers after a fresh ICAT installation, or to remove them at any point.

@RKrahl
Copy link
Member

RKrahl commented Aug 11, 2021

It has been decided in today's meeting:

  • to use simply size rather than investigationSize and datasetSize as attribute names,
  • to move the initialization of the new attributes from the upgrade scripts to the create trigger scripts.

The rationale for the second item is that it is better to have the attributes not set at all rather than to have some value that will never update and be completely wrong soon in the case that a site decides not to install the triggers. E.g. any site that chooses not to install the triggers would need to decide whether to initialize the attributes or not.

@bodinm
Copy link

bodinm commented Aug 18, 2021

In Oracle, there are some errors (ORA-00972: identifier is too long) while executing the scripts src/main/scripts/upgrade_oracle_5_0.sql and src/main/scripts/create_triggers_oracle.sql, due to the procedures or triggers names (must be at most 30 characters)

CREATE TRIGGER RECALCULATE_SIZES_FILECOUNT_ON_DATAFILE_INSERT AFTER INSERT ON DATAFILE
               *
ERROR at line 1:
ORA-00972: identifier is too long


CREATE TRIGGER RECALCULATE_SIZES_FILECOUNT_ON_DATAFILE_UPDATE AFTER UPDATE ON DATAFILE
               *
ERROR at line 1:
ORA-00972: identifier is too long


CREATE TRIGGER RECALCULATE_SIZES_FILECOUNT_ON_DATAFILE_DELETE AFTER DELETE ON DATAFILE
               *
ERROR at line 1:
ORA-00972: identifier is too long


CREATE TRIGGER RECALCULATE_SIZES_FILECOUNT_ON_DATASET_UPDATE AFTER UPDATE ON DATASET
               *
ERROR at line 1:
ORA-00972: identifier is too long


CREATE TRIGGER RECALCULATE_SIZES_FILECOUNT_ON_DATASET_DELETE AFTER DELETE ON DATASET
               *
ERROR at line 1:
ORA-00972: identifier is too long

@bodinm
Copy link

bodinm commented Aug 19, 2021

I did some performance testing on Oracle on my local env. We can do more tests on our test server later. I ran the script against the icat.server release 4.11.1 and against an icat.server icat.server-5.0.0-SNAPSHOT built from this branch (with the triggers that I locally renamed for the tests - see my previous comment #233 (comment) about the oracle error).

Here is the output with the timings for the 4.11.1 release:

INFO: Test case 1: create datafiles having a positive fileSize
INFO: Test case 1: done 100 datasets having 1000 datafiles each
INFO: Test case 1: min/max/avg time per dataset: 10.746 s / 21.662 s / 11.686 s
INFO: Test case 2: create datasets with datafiles having a positive fileSize
INFO: Test case 2: done 100 datasets having 1000 datafiles each
INFO: Test case 2: min/max/avg time per dataset: 4.704 s / 6.540 s / 5.373 s
INFO: Test case 3: update datafile.fileSize from not set to a positive value
INFO: Test case 3: done 100 datasets having 1000 datafiles each
INFO: Test case 3: min/max/avg time per dataset: 8.036 s / 10.965 s / 9.818 s
INFO: Test case 4: update datafile.fileSize to a different positive value
INFO: Test case 4: done 100 datasets having 1000 datafiles each
INFO: Test case 4: min/max/avg time per dataset: 8.474 s / 10.484 s / 9.750 s

And here for this PR code 5.0.0-SNAPSHOT with triggers

INFO: Test case 1: create datafiles having a positive fileSize
INFO: Test case 1: done 100 datasets having 1000 datafiles each
INFO: Test case 1: min/max/avg time per dataset: 11.552 s / 22.449 s / 13.296 s
INFO: Test case 2: create datasets with datafiles having a positive fileSize
INFO: Test case 2: done 100 datasets having 1000 datafiles each
INFO: Test case 2: min/max/avg time per dataset: 4.935 s / 7.009 s / 5.437 s
INFO: Test case 3: update datafile.fileSize from not set to a positive value
INFO: Test case 3: done 100 datasets having 1000 datafiles each
INFO: Test case 3: min/max/avg time per dataset: 8.692 s / 12.123 s / 10.832 s
INFO: Test case 4: update datafile.fileSize to a different positive value
INFO: Test case 4: done 100 datasets having 1000 datafiles each
INFO: Test case 4: min/max/avg time per dataset: 8.825 s / 11.859 s / 10.784 s

So it's globally the same trend as for mysql.

@EmilJunker
Copy link
Contributor Author

use simply size rather than investigationSize and datasetSize as attribute names

Done 04d0b92

move the initialization of the new attributes from the upgrade scripts to the create trigger scripts

Done 7432b51

procedures or triggers names (must be at most 30 characters)

Fixed 6fceb3e

@kevinphippsstfc
Copy link
Contributor

I'm adding the results from running these tests on against our Oracle development database.

The database used was a copy of the Diamond ICAT schema. It contained a reduced number of Datafiles but still over 500 million!

The results from the first run on ICAT 4.10.0 (no triggers):

INFO: Test case 1: create datafiles having a positive fileSize
INFO: Test case 1: done 100 datasets having 1000 datafiles each
INFO: Test case 1: min/max/avg time per dataset: 28.765 s / 32.471 s / 30.265 s
INFO: Test case 2: create datasets with datafiles having a positive fileSize
INFO: Test case 2: done 100 datasets having 1000 datafiles each
INFO: Test case 2: min/max/avg time per dataset: 10.921 s / 11.538 s / 11.158 s
INFO: Test case 3: update datafile.fileSize from not set to a positive value
INFO: Test case 3: done 100 datasets having 1000 datafiles each
INFO: Test case 3: min/max/avg time per dataset: 21.624 s / 25.210 s / 23.240 s
INFO: Test case 4: update datafile.fileSize to a different positive value
INFO: Test case 4: done 100 datasets having 1000 datafiles each
INFO: Test case 4: min/max/avg time per dataset: 21.628 s / 23.580 s / 22.465 s

And from the second run on an ICAT 5.0.0 snapshot (with triggers):

INFO: Test case 1: create datafiles having a positive fileSize
INFO: Test case 1: done 100 datasets having 1000 datafiles each
INFO: Test case 1: min/max/avg time per dataset: 27.483 s / 31.766 s / 29.059 s
INFO: Test case 2: create datasets with datafiles having a positive fileSize
INFO: Test case 2: done 100 datasets having 1000 datafiles each
INFO: Test case 2: min/max/avg time per dataset: 11.325 s / 11.859 s / 11.614 s
INFO: Test case 3: update datafile.fileSize from not set to a positive value
INFO: Test case 3: done 100 datasets having 1000 datafiles each
INFO: Test case 3: min/max/avg time per dataset: 25.899 s / 28.538 s / 26.472 s
INFO: Test case 4: update datafile.fileSize to a different positive value
INFO: Test case 4: done 100 datasets having 1000 datafiles each
INFO: Test case 4: min/max/avg time per dataset: 25.794 s / 29.515 s / 26.600 s

So the strange anomaly here is that the Test 1 was actually faster on the second run! This is not unusual. I have done similar tests before and had results like this. I put them down to the fact that the ICAT is running on a VM sharing resources with other VMs on the same hypervisor, and this is connected across the site network to our departmental development Oracle database which is also running numerous other databases. So the results vary depending on how busy the VM cluster, network and Oracle database are at any particular point in time.

The table below summarises the results from HZB, ESRF and STFC with the numbers being the percentage increase in time taken to run the test with the triggers in place (the negative number indicating the decrease in the first STFC test).

Site Test 1 Test 2 Test 3 Test 4
HZB 3 4 3 1
ESRF 13 1 10 11
STFC -4 4 14 18

The good news from a DLS point of view is that Test 2 using the createMany method (used to create most Datafiles in the DLS ICAT) shows both the smallest increase in time taken and the smallest variation across the 3 sites.

@kevinphippsstfc kevinphippsstfc merged commit fcd8edd into icatproject:master Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement schema this involves changes to the ICAT schema
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add volume and fileCount to Investigation and Dataset Additional column for investigation and dataset sizes
6 participants