-
Notifications
You must be signed in to change notification settings - Fork 156
Only duplicate media on export when there are truly unique #77
Comments
@yanokwa Tested on OS X 10.12 with There were two images with the same timestamp in two difference instance folders. After exporting, in the media folder, one was renamed with the |
|
Looked at the code, I was mistaken (see the crossed out remarks in the comment above). Making changes to the instances folder makes no difference, as names are read from the XML response. If file is not found, it is skipped. |
I confirm what @shivam-tripathi said earlier. When the code encounters files with the same time stamp it solves the problem by adding an incremental suffix. |
Thanks so much for the confirmation, gentlemen! I'm closing this issue because this is exactly the behavior we want. |
Actually. Instead of adding image-2.jpg, I wonder if we can check the MD5 hash and only append a number if those files are actually different. What do you think @shivam-tripathi @icemc @rclakmal? |
@yanokwa We can store MD5 hashes and file paths in a HashTable. MD5 hash would be the key. This will allow us to skip the duplicates. As far as I know, there is a theoretical possibility, however small, that two different files could return same hash. I think we can ignore this for practocal purposes? Wouldn't you think we should provide this as an option? This introduce extra overhead to the export process and in a big form result set delay could be noticeble. |
I'd rather we decide what is best than add options to the app. I think this should be pretty fast because you'd only be doing the MD5 check when you have a matching filename, no? Either way, this is something that can be tested empirically if either of you are up for it. |
With everyone's permission, I would be glad to look into this. |
@shivam-tripathi Sure go ahead :-) I did a basic comparison already on MD5 and SHA1. MD5 was faster and looks suitable for our use case. Let me know if any assistance is needed from my side. |
@shivam-tripathi sure u can look into it. I don't think someone is currently working on it. Did a quick look and seems like the issue comes from the file ConvertToCSV.java in the method emitSubmissionCsv in case org.javarosa.core.model.Constants.DATATYPE_BINARY: correct me if I'm wrong. Hope I helped |
While exporting, if the destination folder already contains a file with the same name, compute the SHA-1 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
While exporting, if the destination folder already contains a file with the same name, compute the SHA-1 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
While exporting, if the destination folder already contains a file with the same name, compute the SHA-1 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
While exporting, if the destination folder already contains a file with the same name, compute the SHA-1 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
While exporting, if the destination folder already contains a file with the same name, compute the MD5 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
Thanks for your efforts with this. Some of our partners have very limited connections, so the duplicate issues can be a huge issue, particularly with our fork of collect which uses form linking, as some images are shared between our household and individual questionnaires. |
@joeflack4 Hi! |
While exporting, if the destination folder already contains a file with the same name, compute the MD5 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
While exporting, if the destination folder already contains a file with the same name, compute the MD5 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
While exporting, if the destination folder already contains a file with the same name, compute the MD5 hash of the files and compare them. If same, skip copying; else append a suffix to differentiate the files. For better performance, cache the already computed hash in a HashMap with fileName-hashValue as key-value pair. Fix getodk#77
Collect uses timestamps to name files. On a very large campaign, two devices can take images at the same time, upload them both to Aggregate, then download them to Briefcase. When those images are exported, what happens?
The text was updated successfully, but these errors were encountered: