Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web): determine duplication of upload on client #8825

Merged
merged 4 commits into from
May 2, 2024

Conversation

truongsinh
Copy link
Contributor

This PR brings feature parity of determining asset duplicates to web (only via HTTPS, not HTTP, due to availability crypto API).

That being said, the whole design of checkBulkUpload followed by upload requiring 2 roundtrips to the servers has performance and functionality problems. We instead can have a single endpoint that does both checks and uploads using headers for example, x-immich-checksum. Server can then process headers first, and determine whether to consume body stream, or not at all, and return 201. With that approach, we can save bandwidth for duplicated assets even from different users/owners. That is, on server, we can check the whole DB for the checksum. If checksum is found, but the assets belong to another user, we can do hardlink or copy --reflink=always (instead of uploading the whole file again), and create DB record. It would save both bandwidth and storage.

@truongsinh truongsinh changed the title feat[web]: determine duplication of upload on client feat(web): determine duplication of upload on client Apr 15, 2024
@bo0tzz
Copy link
Member

bo0tzz commented Apr 16, 2024

I like the idea of the checksum header, but I think it might come with a few caveats. Can you create a discussion either here or on discord with a proposal so we can discuss it more in depth?

@truongsinh
Copy link
Contributor Author

I like the idea of the checksum header, but I think it might come with a few caveats. Can you create a discussion either here or on discord with a proposal so we can discuss it more in depth?

Will do. Meanwhile, I hope we can get this merged soon. The checksum header would need most of this PR (calculating the checksum on the client) anyway. There's a prettier build fail. Please let me know if we need any other change for this PR to be merge-ready.

@jrasm91
Copy link
Contributor

jrasm91 commented Apr 16, 2024

Why exactly do we need/want to try to implement hashing in the web client...?

@truongsinh
Copy link
Contributor Author

truongsinh commented Apr 16, 2024

Why exactly do we need/want to try to implement hashing in the web client...?

I would argue it is the same reason we implemented hashing on mobile client or CLI client (hence, feature parity). That is, to be bandwidth efficient. If a 2GB files with the hash already exist on server client-hashing, the client saves 2GB bandwidth because it does not need to transmit those data.

@jrasm91
Copy link
Contributor

jrasm91 commented Apr 16, 2024

When you talk about bandwidth I assume you mean while accessing a remote service, since this isn't really a concern on a local network.

I would argue that we probably don't even want exact parity between the clients. The mobile app, for example, has an offline database and sync process which enables offline usage. Also, the mobile app doesn't use the timebucket apis. It loads everything locally, while the web client depends on the remote server for everything. In general they serve completely different use cases and I don't know if there is valid use case where remote users are going to use the web client to upload assets (that are large files) that are already on the server.

@truongsinh
Copy link
Contributor Author

When you talk about bandwidth I assume you mean while accessing a remote service, since this isn't really a concern on a local network.

Not necessarily. For instance, even with WAN setup assuming a 10MBps upload speed, uploading a single 2GB GoPro file could take as long as 26 minutes. On a LAN, using a somewhat repurposed server, this duration reduces to about 5 minutes for a 2GB file. My typical use case involves uploading large volumes of footage (GoPro, Drone, etc.) as soon as possible when traveling. Often, I'm unable to upload everything daily due to bandwidth constraints. Once home, I continue uploading to ensure all content is transferred to Immich, frequently spending up to 30 minutes to upload hundreds of files, only to discover that fewer than 10% are new assets.

The mobile app, for example, has an offline database and sync process which enables offline usage

So why can't web app (esp moving toward PWA) have the same 😉. Anyway, it's a different topic. When talking about feature parity, I mentioned both Mobile and (Desktop) CLI as well. What is your thought on feature disparity between Web app and CLI upload? Currently, using the CLI can mitigate some of my issues with bandwidth and time, both on LAN and WAN, though it’s not the most user-friendly solution to force upon users.

Also, keep in mind web client hashing is pre-req for #8847 as well.

On the other hand, assuming from your perspective, you see no value of web client hashing. Could you express any concern or harm having this on Web client? That is, I want to learn your perspective regarding any reason "not to" have this on web client. Probably if we can sort out your concern, we can get it merged even if some of us don't find value in it.

@bo0tzz
Copy link
Member

bo0tzz commented Apr 17, 2024

What is your thought on feature disparity between Web app and CLI upload?

We've generally had the stance that for anything beyond uploading just a handful of assets, people should use the CLI rather than the web. The CLI can do folder crawling, make better use of parallelization etc, so for large uploads it's better suited.

@truongsinh
Copy link
Contributor Author

truongsinh commented Apr 17, 2024

We've generally had the stance that for anything beyond uploading just a handful of assets, people should use the CLI rather than the web. The CLI can do folder crawling, make better use of parallelization etc, so for large uploads it's better suited.

It seems what you describe is a large number of small files ("handful of assets," "folder crawling"). My use case is the opposite, a small number of large files (Drone, 360, GoPro) in a single folder (DCIM). Say 10 files of 4GB (40GB), not 20'000 files of 2MB.

make better use of parallelization

It might be another topic, but given current browser technology, I don't think we are limited to anything in terms of parallelization/concurrency. Probably, it's more about whether we want it in the browser and the maintenance effort.

@jrasm91
Copy link
Contributor

jrasm91 commented Apr 17, 2024

Do you mind doing some performance benchmarks to see the implications of hashing vs no hashing? Both on a mobile client and on a desktop client, on a local and remote connection?

The idea is basically in most situations there is little value add to hashing when uploading files on the web. Basically this means unnecessarily hashing files in most situations.

There is a specific situation that you are describing: uploading large, duplicate videos on the web on a slow/remote connection. The consequence is (in addition to more code to maintain/debug/etc) that everything else has to be hashed, always.

Maybe a middle ground would be something like adding a user preference for pre-hashing uploads. It is off by default. If enabled, it does client-side hashing (when available). This is how the CLI is supposed to work afaik. It has pre-hashing options, but that can be disabled (if you know everything is new).

FWIW I would have preferred to see #8847 built first as this is something the mobile and cli can take advantage of right now.

@truongsinh
Copy link
Contributor Author

truongsinh commented Apr 17, 2024

Do you mind doing some performance benchmarks to see the implications of hashing vs no hashing? Both on a mobile client and on a desktop client, on a local and remote connection?

Yes, I can spend some time doing so. However, I am not clear on what metrics you are looking at. This is my assumption, please correct:

Setup:

  • CLI/Mobile client (hashing on) + Server (hashing on)
  • CLI/Mobile client (hashing off) + Server (hashing on)
  • CLI/Mobile client (hashing on) + Server (hashing off) - basically, server blindly trusts mobile to calculate hash correctly
  • Web client (hashing on) + Server (hashing on)
  • Web client (hashing off) + Server (hashing on)
  • Web client (hashing on) + Server (hashing off) - basically, server blindly trusts mobile to calculate hash correctly

Actually, is it fine if I do web and CLI only, but not mobile? It's easier for me to tweak typescript code on server/cli/web, rather than flutter and rebuild android app.

Test data:

  • 10 files of 4GB (40GB)
  • 20'000 files of ~2MB

Maybe a middle ground would be something like adding a user preference for pre-hashing uploads. It is off by default. If enabled, it does client-side hashing (when available). This is how the CLI is supposed to work afaik.

Yes, this makes sense. I would make that option to be ternary:

  • Hashing on server only
  • Hashing on both client and server (that is, even if client send hashing, server will do it one more time, and compare, and probably "scold" client the 2 hashes do no match)
  • Prefer hashing on client (and only hashing on server when client does not provide)

Basically this means unnecessarily hashing files in most situations.

AFAIK we already do hashing (on server) regardless, it's about moving hashing (from server) to (and trust) client if the operator wants to.

@bo0tzz
Copy link
Member

bo0tzz commented Apr 17, 2024

Prefer hashing on client (and only hashing on server when client does not provide)

This isn't really an option. If we rely on the client and it provides a bad hash, a bunch of existing assumptions go out the window.

@NicholasFlamy
Copy link
Contributor

Do you mind doing some performance benchmarks to see the implications of hashing vs no hashing? Both on a mobile client and on a desktop client, on a local and remote connection?

Yes, I can spend some time doing so. However, I am not clear on what metrics you are looking at. This is my assumption, please correct:

Setup:

  • CLI/Mobile client (hashing on) + Server (hashing on)
  • CLI/Mobile client (hashing off) + Server (hashing on)
  • CLI/Mobile client (hashing on) + Server (hashing off) - basically, server blindly trusts mobile to calculate hash correctly
  • Web client (hashing on) + Server (hashing on)
  • Web client (hashing off) + Server (hashing on)
  • Web client (hashing on) + Server (hashing off) - basically, server blindly trusts mobile to calculate hash correctly

Actually, is it fine if I do web and CLI only, but not mobile? It's easier for me to tweak typescript code on server/cli/web, rather than flutter and rebuild android app.

Test data:

  • 10 files of 4GB (40GB)
  • 20'000 files of ~2MB

Maybe a middle ground would be something like adding a user preference for pre-hashing uploads. It is off by default. If enabled, it does client-side hashing (when available). This is how the CLI is supposed to work afaik.

Yes, this makes sense. I would make that option to be ternary:

  • Hashing on server only
  • Hashing on both client and server (that is, even if client send hashing, server will do it one more time, and compare, and probably "scold" client the 2 hashes do no match)
  • Prefer hashing on client (and only hashing on server when client does not provide)

Basically this means unnecessarily hashing files in most situations.

AFAIK we already do hashing (on server) regardless, it's about moving hashing (from server) to (and trust) client if the operator wants to.

I think the point about performance is will having a file on client take longer for a mix of small and large files than how long it takes the way immich currently works. So how will this PR affect performance?
I would suggest an option to enable or disable client hashing. And if you want to implement it to have the server trust the client hash then you'd need to put this option in administrative settings for security purposes. Maybe 2 separate options. One in client for whether or not to hash, and then an administrator setting about trusting the client hash once the server receives the file. (So if you disabled this but the client had client hash enabled, if it determined it was not duplicate and uploaded the file, the server would then have to has the file even though the client already did that because you have the option disabled.) And you could consult with the immich team on whether or not these should be enabled/disabled by default.

FYI, I'm not part of the immich team, I'm just adding my idea.

@jrasm91
Copy link
Contributor

jrasm91 commented Apr 17, 2024

We're not going to turn the server hashing off either way. I'm fine if you want to skip the benchmark and just add clientside hashing to the web behind a flag that defaults to false.

The server PR for dedupe by hash checksum can be separate from the mobile app using it. There is no expectation that you do the mobile implementation either

@NicholasFlamy
Copy link
Contributor

We're not going to turn the server hashing off either way.

That simplifies things.

@SandiyosDev
Copy link
Contributor

SandiyosDev commented Apr 18, 2024

Trying to balance both sides here...
How about performing a size check first (which is a lot less expensive), then proceed with client-side hashing on size match, along with giving users the option to specify whether they want hashing enabled during an upload prompt, and only set hashing as available on certain sizes in Immich's Admin panel.
This will add an additional trip for traffic, but it can be a functionality that can be toggled at will on serverside.

tldr; add size check prior to hashing, make hashing conditional clientside, and if serverside hashing is also enabled, only hash when the file is between a user-defined threshold

@jrasm91
Copy link
Contributor

jrasm91 commented May 2, 2024

We instead can have a single endpoint that does both checks and uploads using headers for example, x-immich-checksum

I actually implemented this, but it doesn't work. Turns out the browser basically dies when the server terminates the request early. Chrome does not provide a way to access the response body (or anything related to the response at all) and just shows a connection reset network error.

https://stackoverflow.com/a/18370751

That is, on server, we can check the whole DB for the checksum. If checksum is found, but the assets belong to another user, we can do hardlink

There are security implications to the optimization, so we won't be doing this one either.

@jrasm91
Copy link
Contributor

jrasm91 commented May 2, 2024

This seems to work pretty good. We can leave it enabled by default for secure contexts, etc. and see how it goes.

web/src/lib/stores/upload.ts Show resolved Hide resolved
@jrasm91 jrasm91 merged commit ec4e6a1 into immich-app:main May 2, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants