Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a metadata service to store file metadata #31839

Merged
merged 2 commits into from Apr 14, 2022
Merged

Conversation

CarlSchwan
Copy link
Member

@CarlSchwan CarlSchwan commented Apr 4, 2022

@CarlSchwan CarlSchwan added the 2. developing Work in progress label Apr 4, 2022
@CarlSchwan CarlSchwan added this to the Nextcloud 24 milestone Apr 4, 2022
@CarlSchwan CarlSchwan self-assigned this Apr 4, 2022
@CarlSchwan CarlSchwan marked this pull request as draft April 4, 2022 21:16
@blizzz blizzz mentioned this pull request Apr 7, 2022
@CarlSchwan CarlSchwan marked this pull request as ready for review April 8, 2022 14:37
core/Application.php Outdated Show resolved Hide resolved
lib/private/Server.php Outdated Show resolved Hide resolved
@juliushaertl
Copy link
Member

Just curious, is searching for files by metadata something in scope? I could imagine this a useful scenario for future enhancements of the photos app to search by location for example, as with JSON-only metadata db field that would not scale.

@CarlSchwan
Copy link
Member Author

Just curious, is searching for files by metadata something in scope? I could imagine this a useful scenario for future enhancements of the photos app to search by location for example, as with JSON-only metadata db field that would not scale.

Unfortunately, this seems to be a bit tricky. If we want to search photos by location in an efficient way we need more than just the simple GPS coordinate but we need to map this to a geographic area. This is a bit tricky to implement but we can inspire outself from something similar I have in Koko: https://invent.kde.org/graphics/koko/-/blob/master/src/reversegeocoder.cpp#L30 (this requires to be done in a background job as loading the data is quite expensive already in pure C/C++ with an optimized kdtree data structure)

This geographic area should probably be stored in a separate table e.g. https://invent.kde.org/graphics/koko/-/blob/master/src/imagestorage.cpp#L132

But the biggest issue is that while this is fairly simple to implement in the one user use-case, for a Nextcloud we need to retrieve the list of locations or the list of photos in one location on a per user basis and this is really tricky unless we limit ourself only to files stored in the user storage (so no group folders, no sharing, no circles ,...). @PVince81 @AndyScherzinger Would this is be ok?

In that case, we could adopt the following DB structure for location information.

CREATE TABLE oc_file_metadata (
   id INTEGER,
   group TEXT,
   data TEXT, # nullable
   foreign INTEGER, # nullable
   UNIQUE(id, group)
)

CREATE TABLE oc_location (
    id INTEGER PRIMARY KEY,
    country TEXT,
    state TEXT,
    city TEXT,
    UNIQUE(country, state, city)
)

The PHP API needs to be expanded to describe if the data is directly retrievable in the JSON data column, or to be joined with another table (e.g for locations with the oc_location table). To be able only to retrieve the locations for one user, it should be possible to join with oc_storage and oc_filecache to get only the ids of files from the user. This might still be expensive when a user has a log of photos but I don't really see a better way to do it, unfortunately :(

@AndyScherzinger
Copy link
Member

I think the location data needs a more detailed discussion, so from my perspective we can skip that for now. The only thing we "need" for 24 is height/width for the Clients to implement a different layout for their media views. (kinda like a gallery).

So if there if leaving location out doesn't create a future blocker just skip it for now. Because this needs a lot more thinking from my perspective because with with/height it is just extraction+exposure, for anything else it is likely more than that, like "places" (like you also outlined), "people" only other simple info I can come up with would be certain exif data (like exposure, aperture, cam-model, lense) but since we also don't make use of that for now, this can also be skipped.

So from my point of view for v24 - keep it simple by limiting it to width/height only. What do you think @PVince81 ?

@PVince81
Copy link
Member

yes, keep it simple for now.

if we need to cache/compute more data for GPS search later, it will be a separate task anyway

@PVince81
Copy link
Member

had a quick chat with @CarlSchwan about the location data:

  1. I don't think it's a good idea to link oc_location to oc_file_metadata because the latter table also contains non-GPS data, so it feels a bit weird to link them there

  2. an alternative would be to connect oc_locations with oc_filecache. Since it's a 1-N relationship, we'd need a new table oc_location_mapping or similar to connect: oc_filecache->oc_location_mapping->oc_locations. This way it becomes possible to find all file ids for a specific location

We both agreed that 2) is a better approach and will remove the need to add anything else to oc_file_metadata, thus deblocking this PR.

@tobiasKaminsky
Copy link
Member

@CarlSchwan can I test this somehow already?
How can I regenerate metadata for existing images?

@CarlSchwan
Copy link
Member Author

@CarlSchwan can I test this somehow already? How can I regenerate metadata for existing images?

Just run the branch and you should be able to see the metadata with the PROPFIND/SEARCH parameter.

Unfortunately, the metadata regeneration job is not implemented yet and is task for another PR (probably not 24.0.0) so you need to handle the case where the metadata are not available

@tobiasKaminsky
Copy link
Member

Unfortunately, the metadata regeneration job is not implemented yet and is task for another PR (probably not 24.0.0) so you need to handle the case where the metadata are not available

So it might be that clients receive e.g. 500 images, but only 200 images have a valid metadata.
Is this something clients can now before?
Media searches are done in batches, so it might be that first batch has valid metadata and thus client shows new view (which uses metadata), but then second batch is without valid metadata and thus the view is mixed up…

@jancborchardt ^ as we spoke in our call about it, but I have no idea how to display this hybrid view

@jancborchardt
Copy link
Member

@CarlSchwan @PVince81 btw what is the reason for the metadata capability being optional? Customer requirement (don’t see that in the customer issue), not running on all setups, or something else?
(Just asking cause wouldn’t it be easier and more consistent to just always have it?)

@PVince81
Copy link
Member

@CarlSchwan @PVince81 btw what is the reason for the metadata capability being optional? Customer requirement (don’t see that in the customer issue), not running on all setups, or something else? (Just asking cause wouldn’t it be easier and more consistent to just always have it?)

it's enabled by default.
the config option is like for previews to be able to disable it if there are use cases where metadata is not relevant, or for performance concerns

@CarlSchwan
Copy link
Member Author

@CarlSchwan @PVince81 btw what is the reason for the metadata capability being optional? Customer requirement (don’t see that in the customer issue), not running on all setups, or something else? (Just asking cause wouldn’t it be easier and more consistent to just always have it?)

Also we want to support in the future the possibility for apps to provide their own metadata extractor and dynamically indicate in the capability that metadata is available for each mimetype

Copy link
Member

@PVince81 PVince81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

please see my one comment about the interface

@tobiasKaminsky
Copy link
Member

@CarlSchwan if there is none metadata, "[ ]" is returned, but with metadata it is:
{"width":960,"height":1280}

Json parser gets a hiccup here, as it expects "{}" in case of empty.

Signed-off-by: Carl Schwan <carl@carlschwan.eu>
@CarlSchwan
Copy link
Member Author

@CarlSchwan if there is none metadata, "[ ]" is returned, but with metadata it is: {"width":960,"height":1280}

Json parser gets a hiccup here, as it expects "{}" in case of empty.

@tobiasKaminsky this should be fixed. Strange that the json parser doesn't support [] :(

And update autoloader

Signed-off-by: Carl Schwan <carl@carlschwan.eu>
@PVince81 PVince81 merged commit 3ca7971 into master Apr 14, 2022
@PVince81 PVince81 deleted the feat/metadata-server branch April 14, 2022 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2. developing Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants