Skip to content

perf: skip non-matching OFAC government ID regexes during preparation#752

Merged
adamdecaf merged 1 commit into
moov-io:masterfrom
akamick86:perf/ofac-remark-prefilter
May 29, 2026
Merged

perf: skip non-matching OFAC government ID regexes during preparation#752
adamdecaf merged 1 commit into
moov-io:masterfrom
akamick86:perf/ofac-remark-prefilter

Conversation

@akamick86
Copy link
Copy Markdown
Contributor

What

parseGovernmentIDs evaluates every government ID regex against every OFAC remark when grouping SDNs into entities. That is ~20 case-insensitive regexes per remark and is one of the heavier steps when the OFAC list is prepared at startup.

Every pattern is anchored on a fixed leading keyword (Passport, Tax ID, Business Registration, ...), so a remark can only match when that keyword is present. This gates each regex behind a cheap ASCII case-insensitive substring check and only falls through to the regexp engine when the keyword is found. The two dotted abbreviation patterns (CUIT, CURP) have no usable literal prefix and keep running unconditionally.

To make the keyword table explicit, baseGovernmentIDs becomes a small ordered slice of {marker, regex, base}. A side benefit is that the output order is now deterministic rather than dependent on map iteration order.

Correctness

The produced GovernmentID set is unchanged. I compared old vs new output for every SDN in the current OFAC list and they match exactly. Existing pkg/sources/ofac tests pass.

Benchmark

Single core, current OFAC SDN data (~19k records):

parseGovernmentIDs   649ms -> 154ms   (-76%)

Allocations unchanged.

parseGovernmentIDs runs every government ID regex against every OFAC
remark while grouping SDNs into entities. That is roughly twenty
case-insensitive regexes per remark, and it shows up as one of the
heavier steps when the OFAC list is prepared at startup.

Each pattern is anchored on a fixed leading keyword (Passport, Tax ID,
Business Registration, and so on), so a remark cannot match unless that
keyword is present. Gate each regex behind a cheap ASCII case-insensitive
substring check and only run the regexp engine when the keyword is found.
The two dotted abbreviation patterns (CUIT, CURP) have no usable literal
prefix, so they keep running unconditionally.

The set of GovernmentIDs produced is unchanged; I verified the output is
identical for every SDN in the current OFAC list. As a side effect the
lookup table moves from a map to an ordered slice, so the result order is
now deterministic instead of depending on map iteration.

Benchmark over the current OFAC SDN data (about 19k records), single core:

    parseGovernmentIDs   649ms -> 154ms   (-76%, allocations unchanged)
@akamick86 akamick86 requested a review from adamdecaf as a code owner May 29, 2026 16:17
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes government ID parsing in pkg/sources/ofac/mapper.go by replacing the map-based regex iteration with a slice of patterns containing lowercase markers. A custom, allocation-free containsFold helper is introduced to quickly filter out remarks that do not contain the required marker before executing the regular expression. The reviewer suggested adding an init function to validate that all markers in governmentIDPatterns are strictly lowercase ASCII at startup, preventing potential silent failures if a developer adds an invalid marker in the future.

Comment on lines +371 to +391
var governmentIDPatterns = []governmentIDPattern{
{"passport", governmentIDPassportRegex, search.GovernmentID{Type: search.GovernmentIDPassport}},
{"driver", governmentIDDriversLicenseRegex, search.GovernmentID{Type: search.GovernmentIDDriversLicense}},
{"diplomatic", governmentIDDiplomaticPassRegex, search.GovernmentID{Type: search.GovernmentIDDiplomaticPass}},
{"national", governmentIDNationalRegex, search.GovernmentID{Type: search.GovernmentIDNational}},
{"personal", governmentIDPersonalIDRegex, search.GovernmentID{Type: search.GovernmentIDPersonalID}},
{"tax", governmentIDTaxRegex, search.GovernmentID{Type: search.GovernmentIDTax}},
{"", governmentIDCUITRegex, search.GovernmentID{Type: search.GovernmentIDCUIT}},
{"ssn", governmentIDSSNRegex, search.GovernmentID{Type: search.GovernmentIDSSN}},
{"cedula", governmentIDCedulaRegex, search.GovernmentID{Type: search.GovernmentIDCedula}},
{"", governmentIDCURPRegex, search.GovernmentID{Type: search.GovernmentIDCURP}},
{"electoral", governmentIDElectoralRegex, search.GovernmentID{Type: search.GovernmentIDElectoral}},
{"business", governmentIDBusinessRegistrationRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
{"company", governmentIDCompanyNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
{"legal", governmentIDLegalEntityNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
{"commercial", governmentIDCommercialRegistryRegex, search.GovernmentID{Type: search.GovernmentIDCommercialRegistry}},
{"isin", chinaISINRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration, Country: "China"}},
{"social", chinaUSCCRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
{"birth", governmentIDBirthCertRegex, search.GovernmentID{Type: search.GovernmentIDBirthCert}},
{"refugee", governmentIDRefugeeRegex, search.GovernmentID{Type: search.GovernmentIDRefugee}},
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent future bugs where a developer might add a new pattern with an uppercase or non-ASCII marker (which would cause containsFold to silently fail to match), we can add an init function to validate that all markers in governmentIDPatterns are strictly lowercase ASCII (or empty).

var governmentIDPatterns = []governmentIDPattern{\n\t{\"passport\", governmentIDPassportRegex, search.GovernmentID{Type: search.GovernmentIDPassport}},\n\t{\"driver\", governmentIDDriversLicenseRegex, search.GovernmentID{Type: search.GovernmentIDDriversLicense}},\n\t{\"diplomatic\", governmentIDDiplomaticPassRegex, search.GovernmentID{Type: search.GovernmentIDDiplomaticPass}},\n\t{\"national\", governmentIDNationalRegex, search.GovernmentID{Type: search.GovernmentIDNational}},\n\t{\"personal\", governmentIDPersonalIDRegex, search.GovernmentID{Type: search.GovernmentIDPersonalID}},\n\t{\"tax\", governmentIDTaxRegex, search.GovernmentID{Type: search.GovernmentIDTax}},\n\t{\"\", governmentIDCUITRegex, search.GovernmentID{Type: search.GovernmentIDCUIT}},\n\t{\"ssn\", governmentIDSSNRegex, search.GovernmentID{Type: search.GovernmentIDSSN}},\n\t{\"cedula\", governmentIDCedulaRegex, search.GovernmentID{Type: search.GovernmentIDCedula}},\n\t{\"\", governmentIDCURPRegex, search.GovernmentID{Type: search.GovernmentIDCURP}},\n\t{\"electoral\", governmentIDElectoralRegex, search.GovernmentID{Type: search.GovernmentIDElectoral}},\n\t{\"business\", governmentIDBusinessRegistrationRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"company\", governmentIDCompanyNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"legal\", governmentIDLegalEntityNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"commercial\", governmentIDCommercialRegistryRegex, search.GovernmentID{Type: search.GovernmentIDCommercialRegistry}},\n\t{\"isin\", chinaISINRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration, Country: \"China\"}},\n\t{\"social\", chinaUSCCRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"birth\", governmentIDBirthCertRegex, search.GovernmentID{Type: search.GovernmentIDBirthCert}},\n\t{\"refugee\", governmentIDRefugeeRegex, search.GovernmentID{Type: search.GovernmentIDRefugee}},\n}\n\nfunc init() {\n\tfor _, pattern := range governmentIDPatterns {\n\t\tfor i := 0; i < len(pattern.marker); i++ {\n\t\t\tc := pattern.marker[i]\n\t\t\tif c < 'a' || c > 'z' {\n\t\t\t\tpanic(\"governmentIDPatterns: marker must be lowercase ASCII: \" + pattern.marker)\n\t\t\t}\n\t\t}\n\t}\n}

@zhemaituk
Copy link
Copy Markdown

Second this, and more broadly, startup performance optimization. These optimizations are especially valuable in auto-scalable environments, where additional containers (or Lambda functions) need to spin up quickly to handle traffic bursts. In my current setup, startup takes over 30 seconds, primarily due to reading and preparing data files.

@adamdecaf
Copy link
Copy Markdown
Member

This makes a lot of sense. Good call out that maps were randomly iterating before, which does lead to inconsistent results. Thanks for the fix.

@adamdecaf adamdecaf merged commit 96a4c0a into moov-io:master May 29, 2026
12 checks passed
@adamdecaf
Copy link
Copy Markdown
Member

I've never liked the government ID regexes... It's a big pain, but I haven't found a small enough model to embed (similar to libpostal, which is also big) for extraction over regexes.

I'd like to have Watchman leverage lots of little models for data processing/extraction/etc rather than hardcoded logic we have now.

@akamick86
Copy link
Copy Markdown
Contributor Author

I've never liked the government ID regexes... It's a big pain, but I haven't found a small enough model to embed (similar to libpostal, which is also big) for extraction over regexes.

I'd like to have Watchman leverage lots of little models for data processing/extraction/etc rather than hardcoded logic we have now.

Opened a PR for libpostal to allow not to load the entire CRF into memory. See openvenues/libpostal#724

If that lands I think there is an opportunity to follow the same patter for other files libpostal loads and bring entire memory usage down to about 250Mb vs 2.5Gb today.
The tradeoff is just disk usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants