Skip to content

kyberdrb/duplicate_finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

duplicate_finder

Finds duplicate files in a single directory non-recursively/shallowly.

The duplicate files are moved to a separate directory.

A report is generated and saved in the separate directory with duplicate files. The report shows the name/path of the duplicate file and its corresponding original file.

Dependencies

  • compiler supporting C++17 standard or newer
  • packages: openssl for generating hash

Build

To build the release version of the executable from the project, run the provided build script in the root directory of this repository:

./build-release.sh

Then launch the executable to find duplicate files by running a command:

./cmake-build-release/duplicate_finder "/path/to/directory/with/possible/duplicate/files/"

The duplicate files will be located in the same directory in subdirectory with the name DUPLICATE_FILES

Design

  • HashGenerator
  • File
  • Hash
  • Directory
  • FilePathsComparator - for vector of all files in dir
  • StringComparatorAscending - for string/object-type keys in maps
  • DuplicateFilesHandler
  • ReportGenerator

Algorithm notes

// According to the C++ reference docs, "the insertion operation checks whether each inserted element has a key equivalent to the one of an element already in the container, and if so, the element is not inserted" // insert hashAsText-File as key-value pair into the original files.

  • Instead of vector being a container of unique_ptrs for Files and map a container of reference to ref_wrap string-ref wrap File pair another solution for storing Files would be to make the vector a container of shared_ptrs on Files and the map an container of weak_ptrs to the string (hashAsText in File) and of weak_ptrs to the file itself C++ combos:

    • 'shared_ptr' and 'weak_ptr'
    • 'unique_ptr' and 'reference_wrapper'
  • referencing local variables produces unreadable characters and undefined behavior

     originalFiles.emplace(hashAsText, fileReference);
    

    when iterating and printing out contents of the map

  • if the file is missing in the original files

    • by checking if the original files container contains the hashAsText key associated with the file - add it to the original files. Otherwise add the file to the duplicate files
  • Sledovat typovu zhodu a konstantnost typu pre:
    typ elementu v kontaineri <=> typ elementu v cykle pri iterovani <=> typ elementu v navratovej hodnote <=> typ elementu pri vkladani <=> typ parametra vo funkcii v ktorej element chceme pouzit

  • reference_wrapped prijma aj rvalue referencie

    duplicateFiles.emplace(file->gethash(), *(file.get()));
    
  • These routines for erasing character from string are equivalent

    Remove characters with replace

    int numberOfCharacters = 1;
    while ((position = modifiedAbsolutePath.find("(")) != std::string::npos) {
        modifiedAbsolutePath.replace(position, 1, "");
    }
    

    Remove characters with erase

    while ((position = modifiedAbsolutePath.find(")")) != std::string::npos) {
        modifiedAbsolutePath.erase(position, numberOfCharacters);
    }
    

Sources

About

Utility to find and isolate duplicate files

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published