Write checksum to filecache table #13397

tomasz-grobelny · 2019-01-06T20:04:40Z

Currently the filecache table contains checksum column but it is empty. This pull request updates value in this column during file modification and using occ files:scan.

Information about file checksum can be used eg. for finding duplicate files.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

kesselb · 2019-01-06T20:53:59Z

Ref #11138

kesselb · 2019-01-06T21:48:01Z

https://drone.nextcloud.com/nextcloud/server/14285 💥

icewind1991

A naive approach like this will be terrible inefficient when dealing with large or remote files as the file will have to be downloaded from the remote again to calculate the checksum.

The only efficient way to get the checksum is to re-use the upload stream and calculate the checksum from that.

MorrisJobke · 2019-01-07T09:57:51Z

Also we should make it possible to use different algorithms, or at least encode the algorithm into the checksum to be able to change it in the future.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

tomasz-grobelny · 2019-01-10T20:57:14Z

@icewind1991 How about this approach (see my new commit). Now it calculates md5 sum on the fly while uploading file. Implementation is obviously not yet complete (eg. need to work on chunked upload and restructure the code), but I wanted to ask if the approach would be acceptable?

tomasz-grobelny · 2019-01-10T20:58:35Z

Also we should make it possible to use different algorithms, or at least encode the algorithm into the checksum to be able to change it in the future.

Just to make it clear: something like "md5:290dbe614e79b16b6204975614939c8e" would be ok?

MorrisJobke · 2019-01-11T13:48:38Z

Just to make it clear: something like "md5:290dbe614e79b16b6204975614939c8e" would be ok?

Yep - something like that I had in mind.

icewind1991 · 2019-01-11T14:22:06Z

The method of calculating the hash is good.

Keep in mind that there are apps that use the lower level api's to write file data, any logic that generates checksums, will have to ensure that the checksum is either re-calculated or invalidates on every write.

This is one of the main reasons why this hasn't been added yet, if we end up missing a single code path that writes to the file and forget to update the checksum it'll lead to nextcloud thinking the file is corrupted.

tomasz-grobelny · 2019-01-11T19:38:27Z

Keep in mind that there are apps that use the lower level api's to write file data, any logic that generates checksums, will have to ensure that the checksum is either re-calculated or invalidates on every write.

Can you suggest any paths/apps that should be checked? What comes to my mind is file copy/move in files app, upload, chunked upload and editing txt file in text editor app. Anything more?

This is one of the main reasons why this hasn't been added yet, if we end up missing a single code path that writes to the file and forget to update the checksum it'll lead to nextcloud thinking the file is corrupted.

Well, for now nothing actually uses those checksums so wrong checksum (calculated again on download) could be treated as internal warning and then number of such warnings could be treated as part of usage statistics. Would that make sense?

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

tomasz-grobelny · 2019-01-12T21:45:07Z

Keep in mind that there are apps that use the lower level api's to write file data, any logic that generates checksums, will have to ensure that the checksum is either re-calculated or invalidates on every write.

Can you suggest any paths/apps that should be checked? What comes to my mind is file copy/move in files app, upload, chunked upload and editing txt file in text editor app. Anything more?

Now the code is somewhat cleaner and supports prefix. While writing the code I started wondering if having up to date checksums is viable at all given that files can be stored externally (smb, dav, etc.) and how far we should go when updating them. Right now I tried catching two paths when writing files + occ:files scan command, but obviously it may still happen that files will be updated outside of nextcloud. Should we try to recalculate hash when reading files (download, generating previews, etc.)?

BTW, what is chunked file upload in nextcloud? In code I see it depends on $_SERVER['HTTP_OC_CHUNKED'] but this variable is set only in tests which to my untrained eye makes this code dead in any real-life scenario (and can be removed). Or am I wrong? Anyway, when uploading 14MB file it was sent in chunks of 10MB+4MB, but it didn't use the chunking code in apps/dav/lib/Connector/Sabre/File.php

This is one of the main reasons why this hasn't been added yet, if we end up missing a single code path that writes to the file and forget to update the checksum it'll lead to nextcloud thinking the file is corrupted.

Well, for now nothing actually uses those checksums so wrong checksum (calculated again on download) could be treated as internal warning and then number of such warnings could be treated as part of usage statistics. Would that make sense?

After further thought - no :-) As mentioned above - we will never get 100% accuracy.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

MorrisJobke · 2019-01-14T11:31:29Z

Now the code is somewhat cleaner and supports prefix. While writing the code I started wondering if having up to date checksums is viable at all given that files can be stored externally (smb, dav, etc.)

The problem with SMB is, that it could be that there are multiple TB or tens of TB stored in the SMB storage. So a calculation for those files is not really something we should trigger at all. Also there the files can update without Nextcloud getting any hint there. So keeping checksums for the storage types that can also be accessed without Nextcloud in between does not make any sense.

So this setting (for if the checksum is calculated or not) should be on a per storage basis. So that you can also say: yes it is SMB, but it is used solely via Nextcloud and not directly.

jasongwq · 2019-07-26T17:28:36Z

1.允许文件的哈希值为空。
2.我们需要记录文件修改时间和文件哈希，
3.当使用文件哈希时我们必须校验该文件实际修改时间和记录的修改时间，校验通过才认为哈希值可信，校验不通过则重新计算哈希值和记录时间。
3.1.比如去重，当在记录中查找到相同哈希值的文件时校验时间戳。
3.2.比如andriod上传文件时，在上传侧计算哈希值，若服务器记录中上存在相同哈希的文件且可信，则认为该文件已存在。当服务器记录中不存在相同哈希的文件且上传目录下有同名文件时校验时间戳，如果校验不通过，将重新生成哈希值后再次对比。
以上策略可解决及规避效率问题和未捕获到文件修改行为造成的异常。
以上策略依赖于文件系统对文件修改时间的维护，越过文件系统修改文件的行为不被保护。

Allow the hash value of the file to be empty.
We need to record the file modification time and file hash,
When using file hashing, we must verify the actual modification time of the file and the modification time of the record. The verification value is considered to be trustworthy. If the verification fails, the hash value and recording time are recalculated.
3.1. For example, deduplicate, check the timestamp when a file with the same hash value is found in the record.
3.2. For example, when andriod uploads a file, the hash value is calculated on the upload side. If the same hash file exists in the server record and is trusted, the file is considered to exist. When the same hash file does not exist in the server record and there is a file with the same name in the upload directory, the timestamp is checked. If the check fails, the hash value will be regenerated and then compared.
The above strategy can solve and avoid efficiency problems and anomalies caused by uncaptured file modification behavior.
The above strategy relies on the file system to maintain the modification time of the file. The behavior of modifying the file across the file system is not protected.
Allow the hash value of the file to be empty.
We need to record the file modification time and file hash,
When using file hashing, we must verify the actual modification time of the file and the modification time of the record. The verification value is considered to be trustworthy. If the verification fails, the hash value and recording time are recalculated.
3.1. For example, deduplicate, check the timestamp when a file with the same hash value is found in the record.
3.2. For example, when andriod uploads a file, the hash value is calculated on the upload side. If the same hash file exists in the server record and is trusted, the file is considered to exist. When the same hash file does not exist in the server record and there is a file with the same name in the upload directory, the timestamp is checked. If the check fails, the hash value will be regenerated and then compared.
The above strategy can solve and avoid efficiency problems and anomalies caused by uncaptured file modification behavior.
The above strategy relies on the file system to maintain the modification time of the file. The behavior of modifying the file across the file system is not protected.

MichaIng · 2019-12-11T17:48:18Z

This means the oc_filecache table will grow even larger, right? It is already a very large one, usually by orders of magnitutes larger then the whole remaining database. This leads to large writes on database servers which split files by tables, e.g. MySQL/MariaDB.

I saw cases where in office all users add the same external storage, used to store all office-wide shared files. This leads to all files being in cache in duplicates by the amount of users. When one change is done, the database is accordingly changed on all those entries the same way. Of course this is simply an inefficient way of sharing files throughout Nextcloud, but these are edge cases where the oc_filecache table (file) grows to GiB sizes easily, leading to an enormous R/W effort.

So what I want to say by this: If there is any chance, IMO this table should hold less information by splitting it, instead of adding more and increasing the mentioned "issues" by this.

kesselb · 2019-12-13T20:19:14Z

@tomasz-grobelny could you rebase the branch?

ChristophWurst · 2020-07-02T18:47:51Z

Closing due to lack of activity and because this does not appear like a patch we want to take with our file index.

Write checksum to filecache table

1397e4a

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

tomasz-grobelny added enhancement feature: files labels Jan 6, 2019

tomasz-grobelny added this to the Nextcloud 16 milestone Jan 6, 2019

tomasz-grobelny requested review from icewind1991 and MorrisJobke January 6, 2019 20:06

icewind1991 requested changes Jan 7, 2019

View reviewed changes

Calculate checksum during upload

d3c5f2b

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

tomasz-grobelny added 2 commits January 12, 2019 21:54

Tidy up md5 calculating code

1207f16

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

Add hash prefix

e3a39dd

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

Update autoload files

4fe0d30

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.net>

This was referenced Mar 4, 2019

16.0.0 Alpha 1 #14501

Merged

16.0.0 Beta 1 #14573

Merged

rullzer modified the milestones: Nextcloud 16, Nextcloud 17 Mar 18, 2019

stale bot added the stale Ticket or PR with no recent activity label Jun 5, 2019

skjnldsv added the 2. developing Work in progress label Jun 12, 2019

ghost removed the stale Ticket or PR with no recent activity label Jun 12, 2019

MorrisJobke mentioned this pull request Jul 16, 2019

17.0.0 Beta 1 #16408

Merged

28 tasks

MorrisJobke removed this from the Nextcloud 17 milestone Jul 16, 2019

MorrisJobke added this to the Nextcloud 18 milestone Jul 16, 2019

rullzer mentioned this pull request Dec 11, 2019

18 Beta 2 #18343

Merged

43 tasks

rullzer mentioned this pull request Dec 12, 2019

18 beta 3 #18370

Merged

34 tasks

nextcloud deleted a comment from stale bot Dec 13, 2019

rullzer modified the milestones: Nextcloud 18, Nextcloud 19 Dec 16, 2019

This was referenced Apr 4, 2020

19 beta 2 #20303

Merged

19 beta 3 #20386

Merged

This was referenced Apr 15, 2020

19 beta 4 #20506

Merged

19 Beta 5 #20510

Merged

rullzer mentioned this pull request Apr 23, 2020

19 beta 6 #20611

Merged

11 tasks

rullzer removed this from the Nextcloud 19 milestone Apr 24, 2020

ChristophWurst closed this Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write checksum to filecache table #13397

Write checksum to filecache table #13397

tomasz-grobelny commented Jan 6, 2019

kesselb commented Jan 6, 2019

kesselb commented Jan 6, 2019

icewind1991 left a comment

MorrisJobke commented Jan 7, 2019

tomasz-grobelny commented Jan 10, 2019

tomasz-grobelny commented Jan 10, 2019

MorrisJobke commented Jan 11, 2019

icewind1991 commented Jan 11, 2019

tomasz-grobelny commented Jan 11, 2019

tomasz-grobelny commented Jan 12, 2019

MorrisJobke commented Jan 14, 2019

jasongwq commented Jul 26, 2019

MichaIng commented Dec 11, 2019 •

edited

kesselb commented Dec 13, 2019

ChristophWurst commented Jul 2, 2020

Write checksum to filecache table #13397

Write checksum to filecache table #13397

Conversation

tomasz-grobelny commented Jan 6, 2019

kesselb commented Jan 6, 2019

kesselb commented Jan 6, 2019

icewind1991 left a comment

Choose a reason for hiding this comment

MorrisJobke commented Jan 7, 2019

tomasz-grobelny commented Jan 10, 2019

tomasz-grobelny commented Jan 10, 2019

MorrisJobke commented Jan 11, 2019

icewind1991 commented Jan 11, 2019

tomasz-grobelny commented Jan 11, 2019

tomasz-grobelny commented Jan 12, 2019

MorrisJobke commented Jan 14, 2019

jasongwq commented Jul 26, 2019

MichaIng commented Dec 11, 2019 • edited

kesselb commented Dec 13, 2019

ChristophWurst commented Jul 2, 2020

MichaIng commented Dec 11, 2019 •

edited