Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write checksum to filecache table #13397

Conversation

tomasz-grobelny
Copy link
Contributor

Currently the filecache table contains checksum column but it is empty. This pull request updates value in this column during file modification and using occ files:scan.

Information about file checksum can be used eg. for finding duplicate files.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>
@kesselb
Copy link
Contributor

kesselb commented Jan 6, 2019

Ref #11138

@kesselb
Copy link
Contributor

kesselb commented Jan 6, 2019

Copy link
Member

@icewind1991 icewind1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A naive approach like this will be terrible inefficient when dealing with large or remote files as the file will have to be downloaded from the remote again to calculate the checksum.

The only efficient way to get the checksum is to re-use the upload stream and calculate the checksum from that.

@MorrisJobke
Copy link
Member

Also we should make it possible to use different algorithms, or at least encode the algorithm into the checksum to be able to change it in the future.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>
@tomasz-grobelny
Copy link
Contributor Author

@icewind1991 How about this approach (see my new commit). Now it calculates md5 sum on the fly while uploading file. Implementation is obviously not yet complete (eg. need to work on chunked upload and restructure the code), but I wanted to ask if the approach would be acceptable?

@tomasz-grobelny
Copy link
Contributor Author

Also we should make it possible to use different algorithms, or at least encode the algorithm into the checksum to be able to change it in the future.

Just to make it clear: something like "md5:290dbe614e79b16b6204975614939c8e" would be ok?

@MorrisJobke
Copy link
Member

Just to make it clear: something like "md5:290dbe614e79b16b6204975614939c8e" would be ok?

Yep - something like that I had in mind.

@icewind1991
Copy link
Member

The method of calculating the hash is good.

Keep in mind that there are apps that use the lower level api's to write file data, any logic that generates checksums, will have to ensure that the checksum is either re-calculated or invalidates on every write.

This is one of the main reasons why this hasn't been added yet, if we end up missing a single code path that writes to the file and forget to update the checksum it'll lead to nextcloud thinking the file is corrupted.

@tomasz-grobelny
Copy link
Contributor Author

Keep in mind that there are apps that use the lower level api's to write file data, any logic that generates checksums, will have to ensure that the checksum is either re-calculated or invalidates on every write.

Can you suggest any paths/apps that should be checked? What comes to my mind is file copy/move in files app, upload, chunked upload and editing txt file in text editor app. Anything more?

This is one of the main reasons why this hasn't been added yet, if we end up missing a single code path that writes to the file and forget to update the checksum it'll lead to nextcloud thinking the file is corrupted.

Well, for now nothing actually uses those checksums so wrong checksum (calculated again on download) could be treated as internal warning and then number of such warnings could be treated as part of usage statistics. Would that make sense?

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>
Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>
@tomasz-grobelny
Copy link
Contributor Author

Keep in mind that there are apps that use the lower level api's to write file data, any logic that generates checksums, will have to ensure that the checksum is either re-calculated or invalidates on every write.

Can you suggest any paths/apps that should be checked? What comes to my mind is file copy/move in files app, upload, chunked upload and editing txt file in text editor app. Anything more?

Now the code is somewhat cleaner and supports prefix. While writing the code I started wondering if having up to date checksums is viable at all given that files can be stored externally (smb, dav, etc.) and how far we should go when updating them. Right now I tried catching two paths when writing files + occ:files scan command, but obviously it may still happen that files will be updated outside of nextcloud. Should we try to recalculate hash when reading files (download, generating previews, etc.)?

BTW, what is chunked file upload in nextcloud? In code I see it depends on $_SERVER['HTTP_OC_CHUNKED'] but this variable is set only in tests which to my untrained eye makes this code dead in any real-life scenario (and can be removed). Or am I wrong? Anyway, when uploading 14MB file it was sent in chunks of 10MB+4MB, but it didn't use the chunking code in apps/dav/lib/Connector/Sabre/File.php

This is one of the main reasons why this hasn't been added yet, if we end up missing a single code path that writes to the file and forget to update the checksum it'll lead to nextcloud thinking the file is corrupted.

Well, for now nothing actually uses those checksums so wrong checksum (calculated again on download) could be treated as internal warning and then number of such warnings could be treated as part of usage statistics. Would that make sense?

After further thought - no :-) As mentioned above - we will never get 100% accuracy.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>
@MorrisJobke
Copy link
Member

Now the code is somewhat cleaner and supports prefix. While writing the code I started wondering if having up to date checksums is viable at all given that files can be stored externally (smb, dav, etc.)

The problem with SMB is, that it could be that there are multiple TB or tens of TB stored in the SMB storage. So a calculation for those files is not really something we should trigger at all. Also there the files can update without Nextcloud getting any hint there. So keeping checksums for the storage types that can also be accessed without Nextcloud in between does not make any sense.

So this setting (for if the checksum is calculated or not) should be on a per storage basis. So that you can also say: yes it is SMB, but it is used solely via Nextcloud and not directly.

This was referenced Mar 4, 2019
@rullzer rullzer modified the milestones: Nextcloud 16, Nextcloud 17 Mar 18, 2019
@stale stale bot added the stale Ticket or PR with no recent activity label Jun 5, 2019
@skjnldsv skjnldsv added the 2. developing Work in progress label Jun 12, 2019
@ghost ghost removed the stale Ticket or PR with no recent activity label Jun 12, 2019
@MorrisJobke MorrisJobke mentioned this pull request Jul 16, 2019
28 tasks
@MorrisJobke MorrisJobke removed this from the Nextcloud 17 milestone Jul 16, 2019
@MorrisJobke MorrisJobke added this to the Nextcloud 18 milestone Jul 16, 2019
@jasongwq
Copy link

1.允许文件的哈希值为空。
2.我们需要记录文件修改时间和文件哈希,
3.当使用文件哈希时我们必须校验该文件实际修改时间和记录的修改时间,校验通过才认为哈希值可信,校验不通过则重新计算哈希值和记录时间。
3.1.比如去重,当在记录中查找到相同哈希值的文件时校验时间戳。
3.2.比如andriod上传文件时,在上传侧计算哈希值,若服务器记录中上存在相同哈希的文件且可信,则认为该文件已存在。当服务器记录中不存在相同哈希的文件且上传目录下有同名文件时校验时间戳,如果校验不通过,将重新生成哈希值后再次对比。
以上策略可解决及规避效率问题和未捕获到文件修改行为造成的异常。
以上策略依赖于文件系统对文件修改时间的维护,越过文件系统修改文件的行为不被保护。

  1. Allow the hash value of the file to be empty.
  2. We need to record the file modification time and file hash,
  3. When using file hashing, we must verify the actual modification time of the file and the modification time of the record. The verification value is considered to be trustworthy. If the verification fails, the hash value and recording time are recalculated.
    3.1. For example, deduplicate, check the timestamp when a file with the same hash value is found in the record.
    3.2. For example, when andriod uploads a file, the hash value is calculated on the upload side. If the same hash file exists in the server record and is trusted, the file is considered to exist. When the same hash file does not exist in the server record and there is a file with the same name in the upload directory, the timestamp is checked. If the check fails, the hash value will be regenerated and then compared.
    The above strategy can solve and avoid efficiency problems and anomalies caused by uncaptured file modification behavior.
    The above strategy relies on the file system to maintain the modification time of the file. The behavior of modifying the file across the file system is not protected.
  4. Allow the hash value of the file to be empty.
  5. We need to record the file modification time and file hash,
  6. When using file hashing, we must verify the actual modification time of the file and the modification time of the record. The verification value is considered to be trustworthy. If the verification fails, the hash value and recording time are recalculated.
    3.1. For example, deduplicate, check the timestamp when a file with the same hash value is found in the record.
    3.2. For example, when andriod uploads a file, the hash value is calculated on the upload side. If the same hash file exists in the server record and is trusted, the file is considered to exist. When the same hash file does not exist in the server record and there is a file with the same name in the upload directory, the timestamp is checked. If the check fails, the hash value will be regenerated and then compared.
    The above strategy can solve and avoid efficiency problems and anomalies caused by uncaptured file modification behavior.
    The above strategy relies on the file system to maintain the modification time of the file. The behavior of modifying the file across the file system is not protected.

@rullzer rullzer mentioned this pull request Dec 11, 2019
43 tasks
@MichaIng
Copy link
Member

MichaIng commented Dec 11, 2019

This means the oc_filecache table will grow even larger, right? It is already a very large one, usually by orders of magnitutes larger then the whole remaining database. This leads to large writes on database servers which split files by tables, e.g. MySQL/MariaDB.

I saw cases where in office all users add the same external storage, used to store all office-wide shared files. This leads to all files being in cache in duplicates by the amount of users. When one change is done, the database is accordingly changed on all those entries the same way. Of course this is simply an inefficient way of sharing files throughout Nextcloud, but these are edge cases where the oc_filecache table (file) grows to GiB sizes easily, leading to an enormous R/W effort.

So what I want to say by this: If there is any chance, IMO this table should hold less information by splitting it, instead of adding more and increasing the mentioned "issues" by this.

@rullzer rullzer mentioned this pull request Dec 12, 2019
34 tasks
@nextcloud nextcloud deleted a comment from stale bot Dec 13, 2019
@kesselb
Copy link
Contributor

kesselb commented Dec 13, 2019

@tomasz-grobelny could you rebase the branch?

@rullzer rullzer modified the milestones: Nextcloud 18, Nextcloud 19 Dec 16, 2019
This was referenced Apr 4, 2020
This was referenced Apr 15, 2020
@rullzer rullzer mentioned this pull request Apr 23, 2020
11 tasks
@rullzer rullzer removed this from the Nextcloud 19 milestone Apr 24, 2020
@ChristophWurst
Copy link
Member

Closing due to lack of activity and because this does not appear like a patch we want to take with our file index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants