Fix slow execution time when using CharacterSet or CharacterSetComple… #11991

tblanchard · 2022-11-26T02:22:23Z

…ment to express delimiters with String findTokens:

Ducasse · 2022-11-26T14:11:31Z

Thanks. I'm surprised to see that includes: is faster than a =. May be I did not understand it :)
Do you have some bench showing the difference?

tblanchard · 2022-11-26T17:03:18Z

Thanks. I'm surprised to see that includes: is faster than a =. May be I did not understand it :)
Do you have some bench showing the difference?

I didn't time it exactly but paste this into a playground in an image without my change (P10 is good) and you will find it hangs your image.

Bag withAll: ('one, two, three, and four, one, five, one ' findTokens: ((CharacterSet newFrom: (Character alphabet, Character alphabet asUppercase))complement))

After my change it runs instantaneously. The reason is that the current code iterates the delimiters collection to search for a matching character. When you have a CharacterSetComplement based on just alpha characters, that is a HUGE number of characters to walk through when really all you want to do is test membership.

Old Version:

`skipDelimiters: delimiters startingAt: start

start to: self size do: [ :i | 
	(delimiters anySatisfy: [ :delim | delim = (self at: i) ]) 
		ifFalse: [ ^ i ] ].
^ self size + 1`

New Version:
`skipDelimiters: delimiters startingAt: start

"Answer the index of the first character within the receiver, starting at start, that does NOT match any element of delimiters (a collection of characters). If the end of the receiver is reached, answer size + 1."

start to: self size do: [ :i | 
	(delimiters includes: (self at: i)) 
		ifFalse: [ ^ i ] ].
^ self size + 1`

So (delimiters anySatisfy: [ :delim | delim = (self at: i) ]) is going to call delimiters do: with something like [:d | d = c ifTrue: [^true]] which makes a huge number of comparisons but all you really want to know is if delimiters includes character c. So this is highly inefficient when using a CharacterSet - especially if CharacterSet is cleverly implemented as a bit vector or hashed collection and you just need to test membership. eg - you are ignoring the available O(1) lookup and doing an O(n) iteration for no good reason.

I hope that helps.

Ducasse · 2022-11-26T18:48:04Z

Thanks a lot for the explanation!!!! Could you add some of this logic in the method comments because I would like that we document such design point. I think that this is important to educate readers.

…erSet when number of delimiters is large.

tblanchard · 2022-11-26T19:12:33Z

Sure, I have added this comment to findTokens: (where I think people will be most likely to find it) and the two methods I changed.

"delimiters is any collection of characters and is often passed as a String. This is fine when the number of possible delimiters is small even though String>>includes: is an O(n) operation because n is small. When using a large number of possible delimiters, using a CharacterSet with a lookup efficiency of O(1) will produce much better performance."

Ducasse · 2022-11-27T20:00:11Z

Tx!

Ducasse · 2022-11-27T20:00:36Z

Checking why the build is failing.

Ducasse · 2022-11-27T20:50:09Z

Changes looks good to me.

Ducasse · 2022-11-27T20:51:37Z

Broken tests are unrelated

Fix slow execution time when using CharacterSet or CharacterSetComple…

ec1b8e7

…ment to express delimiters with String findTokens:

Added comment explaining why delimiters is better passed as a Charact…

2023480

…erSet when number of delimiters is large.

Ducasse closed this Nov 27, 2022

Ducasse reopened this Nov 27, 2022

Ducasse added this to the 11.0.0 milestone Nov 27, 2022

Ducasse added the Type: Enhancement label Nov 27, 2022

Ducasse added the Status: Tests passed please review! label Nov 27, 2022

MarcusDenker approved these changes Nov 29, 2022

View reviewed changes

MarcusDenker linked an issue Nov 29, 2022 that may be closed by this pull request

String findTokens: is very slow when passing large CharacterSet or CharacterSetComplement #11990

Closed

MarcusDenker merged commit b6ce452 into pharo-project:Pharo11 Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slow execution time when using CharacterSet or CharacterSetComple… #11991

Fix slow execution time when using CharacterSet or CharacterSetComple… #11991

tblanchard commented Nov 26, 2022

Ducasse commented Nov 26, 2022

tblanchard commented Nov 26, 2022 •

edited

Ducasse commented Nov 26, 2022

tblanchard commented Nov 26, 2022 •

edited

Ducasse commented Nov 27, 2022

Ducasse commented Nov 27, 2022

Ducasse commented Nov 27, 2022

Ducasse commented Nov 27, 2022

Fix slow execution time when using CharacterSet or CharacterSetComple… #11991

Fix slow execution time when using CharacterSet or CharacterSetComple… #11991

Conversation

tblanchard commented Nov 26, 2022

Ducasse commented Nov 26, 2022

tblanchard commented Nov 26, 2022 • edited

Ducasse commented Nov 26, 2022

tblanchard commented Nov 26, 2022 • edited

Ducasse commented Nov 27, 2022

Ducasse commented Nov 27, 2022

Ducasse commented Nov 27, 2022

Ducasse commented Nov 27, 2022

tblanchard commented Nov 26, 2022 •

edited

tblanchard commented Nov 26, 2022 •

edited