Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slow execution time when using CharacterSet or CharacterSetComple… #11991

Merged
merged 2 commits into from Nov 29, 2022
Merged

Fix slow execution time when using CharacterSet or CharacterSetComple… #11991

merged 2 commits into from Nov 29, 2022

Conversation

tblanchard
Copy link
Contributor

…ment to express delimiters with String findTokens:

…ment to express delimiters with String findTokens:
@Ducasse
Copy link
Member

Ducasse commented Nov 26, 2022

Thanks. I'm surprised to see that includes: is faster than a =. May be I did not understand it :)
Do you have some bench showing the difference?

@tblanchard
Copy link
Contributor Author

tblanchard commented Nov 26, 2022

Thanks. I'm surprised to see that includes: is faster than a =. May be I did not understand it :)
Do you have some bench showing the difference?

I didn't time it exactly but paste this into a playground in an image without my change (P10 is good) and you will find it hangs your image.

Bag withAll: ('one, two, three, and four, one, five, one ' findTokens: ((CharacterSet newFrom: (Character alphabet, Character alphabet asUppercase))complement))

After my change it runs instantaneously. The reason is that the current code iterates the delimiters collection to search for a matching character. When you have a CharacterSetComplement based on just alpha characters, that is a HUGE number of characters to walk through when really all you want to do is test membership.

Old Version:

`skipDelimiters: delimiters startingAt: start

start to: self size do: [ :i | 
	(delimiters anySatisfy: [ :delim | delim = (self at: i) ]) 
		ifFalse: [ ^ i ] ].
^ self size + 1`

New Version:
`skipDelimiters: delimiters startingAt: start

"Answer the index of the first character within the receiver, starting at start, that does NOT match any element of delimiters (a collection of characters). If the end of the receiver is reached, answer size + 1."

start to: self size do: [ :i | 
	(delimiters includes: (self at: i)) 
		ifFalse: [ ^ i ] ].
^ self size + 1`

So (delimiters anySatisfy: [ :delim | delim = (self at: i) ]) is going to call delimiters do: with something like [:d | d = c ifTrue: [^true]] which makes a huge number of comparisons but all you really want to know is if delimiters includes character c. So this is highly inefficient when using a CharacterSet - especially if CharacterSet is cleverly implemented as a bit vector or hashed collection and you just need to test membership. eg - you are ignoring the available O(1) lookup and doing an O(n) iteration for no good reason.

I hope that helps.

@Ducasse
Copy link
Member

Ducasse commented Nov 26, 2022

Thanks a lot for the explanation!!!! Could you add some of this logic in the method comments because I would like that we document such design point. I think that this is important to educate readers.

@tblanchard
Copy link
Contributor Author

tblanchard commented Nov 26, 2022

Sure, I have added this comment to findTokens: (where I think people will be most likely to find it) and the two methods I changed.

"delimiters is any collection of characters and is often passed as a String. This is fine when the number of possible delimiters is small even though String>>includes: is an O(n) operation because n is small. When using a large number of possible delimiters, using a CharacterSet with a lookup efficiency of O(1) will produce much better performance."

@Ducasse
Copy link
Member

Ducasse commented Nov 27, 2022

Tx!

@Ducasse Ducasse closed this Nov 27, 2022
@Ducasse Ducasse reopened this Nov 27, 2022
@Ducasse
Copy link
Member

Ducasse commented Nov 27, 2022

Checking why the build is failing.

@Ducasse
Copy link
Member

Ducasse commented Nov 27, 2022

Changes looks good to me.

@Ducasse Ducasse added this to the 11.0.0 milestone Nov 27, 2022
@Ducasse
Copy link
Member

Ducasse commented Nov 27, 2022

Broken tests are unrelated

@MarcusDenker MarcusDenker merged commit b6ce452 into pharo-project:Pharo11 Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String findTokens: is very slow when passing large CharacterSet or CharacterSetComplement
3 participants