Skip to content
This repository has been archived by the owner on Apr 16, 2022. It is now read-only.

Only duplicate media on export when there are truly unique #77

Closed
yanokwa opened this issue Mar 22, 2017 · 15 comments · Fixed by #107
Closed

Only duplicate media on export when there are truly unique #77

yanokwa opened this issue Mar 22, 2017 · 15 comments · Fixed by #107

Comments

@yanokwa
Copy link
Member

yanokwa commented Mar 22, 2017

Collect uses timestamps to name files. On a very large campaign, two devices can take images at the same time, upload them both to Aggregate, then download them to Briefcase. When those images are exported, what happens?

@rclakmal
Copy link
Contributor

rclakmal commented Mar 25, 2017

@yanokwa Tested on OS X 10.12 with Birds form

There were two images with the same timestamp in two difference instance folders.

snip20170325_3

After exporting, in the media folder, one was renamed with the suffix -2

snip20170325_4

@shivam-tripathi
Copy link
Contributor

shivam-tripathi commented Mar 25, 2017

@yanokwa: Confirmed the same thing as @rclakmal on Linux Ubuntu 16.04.
selection_016
selection_017

@shivam-tripathi
Copy link
Contributor

shivam-tripathi commented Mar 25, 2017

But noticed something weird :
selection_018
selection_019

It appears that when encountering two separate instance media with same time stamp, it creates a duplicate of only one already copied.
I renamed one of the instance media as one existing in another instance. Also it appears while re-fetching form, if the instance media data has been tampered with - briefcase doesn't verify the contents of instance media folder and skips it.

@shivam-tripathi
Copy link
Contributor

shivam-tripathi commented Mar 25, 2017

Looked at the code, I was mistaken (see the crossed out remarks in the comment above). Making changes to the instances folder makes no difference, as names are read from the XML response. If file is not found, it is skipped.
The redundancy in the image in the comment was (unfortunately causing the confusion) due to it being actually present twice.
The code handles the files with same timestamp by adding suffix.

@icemc
Copy link
Contributor

icemc commented Mar 26, 2017

I confirm what @shivam-tripathi said earlier. When the code encounters files with the same time stamp it solves the problem by adding an incremental suffix.

@yanokwa
Copy link
Member Author

yanokwa commented Apr 5, 2017

Thanks so much for the confirmation, gentlemen! I'm closing this issue because this is exactly the behavior we want.

@yanokwa yanokwa closed this as completed Apr 5, 2017
@yanokwa yanokwa reopened this Apr 6, 2017
@yanokwa
Copy link
Member Author

yanokwa commented Apr 6, 2017

Actually. Instead of adding image-2.jpg, I wonder if we can check the MD5 hash and only append a number if those files are actually different. What do you think @shivam-tripathi @icemc @rclakmal?

@rclakmal
Copy link
Contributor

rclakmal commented Apr 6, 2017

@yanokwa We can store MD5 hashes and file paths in a HashTable. MD5 hash would be the key. This will allow us to skip the duplicates. As far as I know, there is a theoretical possibility, however small, that two different files could return same hash. I think we can ignore this for practocal purposes?

Wouldn't you think we should provide this as an option? This introduce extra overhead to the export process and in a big form result set delay could be noticeble.

@icemc
Copy link
Contributor

icemc commented Apr 6, 2017

@yanokwa following what @rclakmal said an MD5 hash will solve the problem (in most cases) but will decrease the performance during export of forms with a large amount of instances. It is left for the community to decide if this extra cost is necessary.

@yanokwa
Copy link
Member Author

yanokwa commented Apr 6, 2017

I'd rather we decide what is best than add options to the app. I think this should be pretty fast because you'd only be doing the MD5 check when you have a matching filename, no? Either way, this is something that can be tested empirically if either of you are up for it.

@shivam-tripathi
Copy link
Contributor

With everyone's permission, I would be glad to look into this.

@rclakmal
Copy link
Contributor

rclakmal commented Apr 7, 2017

@shivam-tripathi Sure go ahead :-) I did a basic comparison already on MD5 and SHA1. MD5 was faster and looks suitable for our use case. Let me know if any assistance is needed from my side.

@icemc
Copy link
Contributor

icemc commented Apr 7, 2017

@shivam-tripathi sure u can look into it. I don't think someone is currently working on it. Did a quick look and seems like the issue comes from the file ConvertToCSV.java in the method emitSubmissionCsv in case org.javarosa.core.model.Constants.DATATYPE_BINARY: correct me if I'm wrong. Hope I helped

@yanokwa yanokwa changed the title Confirm that images of the same name do not overwrite each other Only duplicate media on export when there are truly unique Apr 7, 2017
shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 8, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the SHA-1 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 9, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the SHA-1 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 9, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the SHA-1 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 9, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the SHA-1 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 9, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the MD5 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
@joeflack4
Copy link

Thanks for your efforts with this. Some of our partners have very limited connections, so the duplicate issues can be a huge issue, particularly with our fork of collect which uses form linking, as some images are shared between our household and individual questionnaires.

@shivam-tripathi
Copy link
Contributor

shivam-tripathi commented Apr 10, 2017

@joeflack4 Hi!
Presently we are trying to remove redundancy when you export the data collected using Briefcase after pulling it from the server. This is essentially an offline process.
However, while pulling the data off the server - this remains an issue. I hope some solution surfaces in future. If I am correct, it needs to be done at Aggregate level - as in Briefcase we cannot determine whether or not the media file is duplicate before the fetch.

shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 18, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the MD5 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 19, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the MD5 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
shivam-tripathi added a commit to shivam-tripathi/briefcase that referenced this issue Apr 19, 2017
While exporting, if the destination folder already contains a file with the
same name, compute the MD5 hash of the files and compare them. If
same, skip copying; else append a suffix to differentiate the files.
For better performance, cache the already computed hash in a HashMap with
fileName-hashValue as key-value pair.

Fix getodk#77
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.