Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache compiled regexes in analyzer #1335

Merged
merged 3 commits into from
Mar 19, 2024

Conversation

Edward-Upton
Copy link
Contributor

@Edward-Upton Edward-Upton commented Mar 15, 2024

Change Description

This change caches the compiled regexes of Patterns in presidio-analyzer.

A large portion of time is spent compiling the regex patterns when calling /analyze on the presidio-analyzer. For the following request:

curl -d '{"text":"Facere Konopelski alias animi dis
tinctio vel. Ratione corrupti quae consequuntur omnis eligendi. Excepturi et atque vol
uptas nemo reiciendis. Doloribus odio dolor http://www.nnrhwie.org/ ullam quibusdam. Q
ui sed 6011237658142640 fugiat labore et. Odio a explicabo ut maiores neque. Quidem ne
que magni 6011158690503127 qui sapiente. Necessitatibus est molestias Champlin vero vo
luptates. Tempore voluptate officiis quos error deleniti. Pariatur illum et est ipsum 
sed. Vitae voluptatem optio doloremque nulla repudiandae. Unde +6732345678 quo explica
bo eaque iure. Incidunt bc1qvg883qsjkqh2g6nr3attc0d0yfey4pk2ds06qs odio sit tempora pl
aceat. Officia ipsum quaerat et Lea soluta. Est et pariatur est non et. Officia aperia
m quisquam dolores illo occaecati. Officiis et et animi dolores cum. 6011119331161688 
quidem cumque libero minus nihil. 356a:ce5d:5eb8:ef5b:1cb1:d717:22e7:2df1 rem libero f
uga suscipit voluptas. Reiciendis ea ad consequatur assumenda Ms. Esperanza Sawayn Ver
o http://smzbxej.edu/FMCoeSQ.svg ut suscipit ab sunt. Velit dolorem Prof. Selena Wehne
r dolores voluptatibus sit.", "language":"en"}' -H "Content-Type: application/json" -X

the original flame graph (attached as the first image) shows that over 50% of the duration is taken up with these compilations.

The second flame graph (attached as the second image) shows the improvement following these changes after the first request has cached the compiled regexes.

output
output_new2

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@Edward-Upton
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Netcraft"

@omri374
Copy link
Contributor

omri374 commented Mar 15, 2024

Thanks! Does it make sense to compile the regexes during Pattern instantiation instead of during the first call?

@omri374
Copy link
Contributor

omri374 commented Mar 15, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Edward-Upton
Copy link
Contributor Author

Thanks! Does it make sense to compile the regexes during Pattern instantiation instead of during the first call?

I had originally implemented like that, however analyze can be called with arbitrary flags so we may need to recompile on each call to analyze in the worst case.

@omri374
Copy link
Contributor

omri374 commented Mar 18, 2024

Can we do both? compile a-priori and re-compile if flags change?

@Edward-Upton
Copy link
Contributor Author

Can we do both? compile a-priori and re-compile if flags change?

That would mean duplicating the default flags into pattern.py from pattern_recognizer.py, where the current approach avoids having to do that and takes more of a 'caching' approach .

@omri374
Copy link
Contributor

omri374 commented Mar 19, 2024

I see. Thanks! merged.

@omri374 omri374 merged commit 4db5278 into microsoft:main Mar 19, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants