Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude regular expression improvements #6200

Merged
merged 6 commits into from
Dec 7, 2017
Merged

Exclude regular expression improvements #6200

merged 6 commits into from
Dec 7, 2017

Conversation

ckamm
Copy link
Contributor

@ckamm ckamm commented Nov 29, 2017

A collection of refactor and improvement of the regex-based codepath. Calls to fnmatch are entirely gone.

@ckamm ckamm added this to the 2.5.0 milestone Nov 29, 2017
@ckamm ckamm self-assigned this Nov 29, 2017
@ckamm ckamm requested a review from guruz November 29, 2017 13:06
Copy link
Contributor

@ogoffart ogoffart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at the first commit so far.

👎 Because I don't want to introduce more dependency between the syncengine and the config file. I want to move ConfigFile to gui, because i don't want the tools or test to mess or depend on the user's config file

Also, it is a bit strange that the ExcludeFiles class goes in a csync_xxx file, i'd rather it to be named excludesfiles.{cpp,h} and not have a mix of C-style code and proper C++ in the same file.

assert_int_equal(rc, 0);
rc = _csync_exclude_add(&(csync->excludes), "latex/*/*.tex.tmp");
assert_int_equal(rc, 0);
excludedFiles->addManualExclude("*.💩"); // is this source file utf8 encoded?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@@ -26,6 +26,7 @@
#include "propagateremotedelete.h"
#include "propagatedownload.h"
#include "common/asserts.h"
#include "configfile.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid dependency between the syncengine and the config file?
Ideally all the config should be handled by the app.
Please put that in the Folder instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely.

*
* See ExcludedFiles in csync_exclude.
*/
std::function<CSYNC_EXCLUDE_TYPE(const char *path, int filetype)> exclude_traversal_fn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put a (smart?) pointer to the ExcludeFiles here and call the relevant function, instead of using this hook-like indirection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer tying ExcludedFiles and csync together through this function because it's more composable/decoupled. Now csync only ever cares about the traversal exclude match function which makes it easy to hook in a different exclude function for testing.

*/
bool reloadExcludeFiles();

#ifdef CSYNC_TEST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a friend?

(This breaks ODR, and will not work on windows because it mangle the access in the symbol name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (OCC::Utility::fsCasePreserving())
patternOptions |= QRegularExpression::CaseInsensitiveOption;
_regex.setPatternOptions(patternOptions);
_regex.optimize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass the OptimizeOnFirstUsageOption flag if you optimize now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was kept from the initial version. I agree that it can be removed.

_nonRegexExcludes.clear();

// Start out with regexes that would match nothing
QString exclude_only = "a^";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be done at the end (if exclude_only.isEmpty()) exclude_only = "a^";)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do.

QString patternFileDirRemove = "a^";
QString patternDirKeep = "a^";
QString patternDirRemove = "a^";
auto regexAppend = [](QString &pattern, QString v) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const QString &v

match = CSYNC_FILE_SILENTLY_EXCLUDED;
goto out;
if (blen == 11) {
rc = csync_fnmatch("Desktop.ini", bname, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not bname == "Desktop.ini" (with Qt::CaseInsensitive for file system that needs it) ?

But also, why should we not put this and all other builtin in the regexp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's definitely worthy of further simplification. Some of the patterns above have similar opportunities. Adding to the regex doesn't work easily since these want to be SILENTLY_EXCLUDED.

@ckamm
Copy link
Contributor Author

ckamm commented Dec 6, 2017

@ogoffart Addressed comments (separate commits for easy checking, will be rebased away before merge)

@@ -702,7 +702,7 @@ void ExcludedFiles::prepare()
+ "(?:^|/)(?:" + bnameFileDirRemove + "|" + bnameDirRemove + ")(?:$|/)"
+ ")");

QRegularExpression::PatternOptions patternOptions = QRegularExpression::OptimizeOnFirstUsageOption;
QRegularExpression::PatternOptions patternOptions = QRegularExpression::NoPatternOption;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i would have kept this flag, but removed the optimize call.
But it does not matter anyway.

Make ExcludedFiles something that is instantiated outside of
the CSYNC context and then given to it as a hook.

ExcludedFiles still lives in csync_exclude and the internal
workings haven't been touched.
Improves full matches by more than an order of magnitude
and also improves speed of traversal matches by roughly 20%,
judging by the check_csync_exclude performance test.
Neither the documentation at
http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13
nor the fnmatch source code in <glibc>/posix/fnmatch_loop.c points
toward { causing special matching behavior.
Now all exclude patterns can be translated to regular expressions.
@ckamm ckamm merged commit 287670c into master Dec 7, 2017
@guruz guruz deleted the exclude-regex-5017 branch December 12, 2017 14:42
@guruz
Copy link
Contributor

guruz commented Dec 12, 2017

Thanks a lot @ckamm for taking this up again :)
What were your CPU improvements?

@ckamm
Copy link
Contributor Author

ckamm commented Dec 13, 2017

@guruz Current benchmarks on my machine.

Numbers are 
  2.3 / 2.4 / master
  in ms per call

traversal
  0.00098 / 0.00047 / 0.00046
  reduced to 47% of original

full
  0.0266 / 0.0251 / 0.0019
  reduced to 7% of original

The traversal number is the primarily important one - it speeds up local discovery for which exclude computations had been a bottleneck. The full value is primarily used by the file watcher so the huge improvement should be easily noticable if one copies thousands of files into a synced folder.

Note that these values are for a high-end linux machine. Speedups were even larger when we profiled windows/osx during the initial investigation.

@guruz
Copy link
Contributor

guruz commented Dec 13, 2017

Impressive :-)

FYI @michaelstingl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants