perf: skip non-matching OFAC government ID regexes during preparation by akamick86 · Pull Request #752 · moov-io/watchman

akamick86 · 2026-05-29T16:17:29Z

What

parseGovernmentIDs evaluates every government ID regex against every OFAC remark when grouping SDNs into entities. That is ~20 case-insensitive regexes per remark and is one of the heavier steps when the OFAC list is prepared at startup.

Every pattern is anchored on a fixed leading keyword (Passport, Tax ID, Business Registration, ...), so a remark can only match when that keyword is present. This gates each regex behind a cheap ASCII case-insensitive substring check and only falls through to the regexp engine when the keyword is found. The two dotted abbreviation patterns (CUIT, CURP) have no usable literal prefix and keep running unconditionally.

To make the keyword table explicit, baseGovernmentIDs becomes a small ordered slice of {marker, regex, base}. A side benefit is that the output order is now deterministic rather than dependent on map iteration order.

Correctness

The produced GovernmentID set is unchanged. I compared old vs new output for every SDN in the current OFAC list and they match exactly. Existing pkg/sources/ofac tests pass.

Benchmark

Single core, current OFAC SDN data (~19k records):

parseGovernmentIDs   649ms -> 154ms   (-76%)

Allocations unchanged.

parseGovernmentIDs runs every government ID regex against every OFAC remark while grouping SDNs into entities. That is roughly twenty case-insensitive regexes per remark, and it shows up as one of the heavier steps when the OFAC list is prepared at startup. Each pattern is anchored on a fixed leading keyword (Passport, Tax ID, Business Registration, and so on), so a remark cannot match unless that keyword is present. Gate each regex behind a cheap ASCII case-insensitive substring check and only run the regexp engine when the keyword is found. The two dotted abbreviation patterns (CUIT, CURP) have no usable literal prefix, so they keep running unconditionally. The set of GovernmentIDs produced is unchanged; I verified the output is identical for every SDN in the current OFAC list. As a side effect the lookup table moves from a map to an ordered slice, so the result order is now deterministic instead of depending on map iteration. Benchmark over the current OFAC SDN data (about 19k records), single core: parseGovernmentIDs 649ms -> 154ms (-76%, allocations unchanged)

gemini-code-assist

Code Review

This pull request optimizes government ID parsing in pkg/sources/ofac/mapper.go by replacing the map-based regex iteration with a slice of patterns containing lowercase markers. A custom, allocation-free containsFold helper is introduced to quickly filter out remarks that do not contain the required marker before executing the regular expression. The reviewer suggested adding an init function to validate that all markers in governmentIDPatterns are strictly lowercase ASCII at startup, preventing potential silent failures if a developer adds an invalid marker in the future.

gemini-code-assist · 2026-05-29T16:19:30Z

+var governmentIDPatterns = []governmentIDPattern{
+	{"passport", governmentIDPassportRegex, search.GovernmentID{Type: search.GovernmentIDPassport}},
+	{"driver", governmentIDDriversLicenseRegex, search.GovernmentID{Type: search.GovernmentIDDriversLicense}},
+	{"diplomatic", governmentIDDiplomaticPassRegex, search.GovernmentID{Type: search.GovernmentIDDiplomaticPass}},
+	{"national", governmentIDNationalRegex, search.GovernmentID{Type: search.GovernmentIDNational}},
+	{"personal", governmentIDPersonalIDRegex, search.GovernmentID{Type: search.GovernmentIDPersonalID}},
+	{"tax", governmentIDTaxRegex, search.GovernmentID{Type: search.GovernmentIDTax}},
+	{"", governmentIDCUITRegex, search.GovernmentID{Type: search.GovernmentIDCUIT}},
+	{"ssn", governmentIDSSNRegex, search.GovernmentID{Type: search.GovernmentIDSSN}},
+	{"cedula", governmentIDCedulaRegex, search.GovernmentID{Type: search.GovernmentIDCedula}},
+	{"", governmentIDCURPRegex, search.GovernmentID{Type: search.GovernmentIDCURP}},
+	{"electoral", governmentIDElectoralRegex, search.GovernmentID{Type: search.GovernmentIDElectoral}},
+	{"business", governmentIDBusinessRegistrationRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
+	{"company", governmentIDCompanyNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
+	{"legal", governmentIDLegalEntityNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
+	{"commercial", governmentIDCommercialRegistryRegex, search.GovernmentID{Type: search.GovernmentIDCommercialRegistry}},
+	{"isin", chinaISINRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration, Country: "China"}},
+	{"social", chinaUSCCRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},
+	{"birth", governmentIDBirthCertRegex, search.GovernmentID{Type: search.GovernmentIDBirthCert}},
+	{"refugee", governmentIDRefugeeRegex, search.GovernmentID{Type: search.GovernmentIDRefugee}},
+}


To prevent future bugs where a developer might add a new pattern with an uppercase or non-ASCII marker (which would cause containsFold to silently fail to match), we can add an init function to validate that all markers in governmentIDPatterns are strictly lowercase ASCII (or empty).

var governmentIDPatterns = []governmentIDPattern{\n\t{\"passport\", governmentIDPassportRegex, search.GovernmentID{Type: search.GovernmentIDPassport}},\n\t{\"driver\", governmentIDDriversLicenseRegex, search.GovernmentID{Type: search.GovernmentIDDriversLicense}},\n\t{\"diplomatic\", governmentIDDiplomaticPassRegex, search.GovernmentID{Type: search.GovernmentIDDiplomaticPass}},\n\t{\"national\", governmentIDNationalRegex, search.GovernmentID{Type: search.GovernmentIDNational}},\n\t{\"personal\", governmentIDPersonalIDRegex, search.GovernmentID{Type: search.GovernmentIDPersonalID}},\n\t{\"tax\", governmentIDTaxRegex, search.GovernmentID{Type: search.GovernmentIDTax}},\n\t{\"\", governmentIDCUITRegex, search.GovernmentID{Type: search.GovernmentIDCUIT}},\n\t{\"ssn\", governmentIDSSNRegex, search.GovernmentID{Type: search.GovernmentIDSSN}},\n\t{\"cedula\", governmentIDCedulaRegex, search.GovernmentID{Type: search.GovernmentIDCedula}},\n\t{\"\", governmentIDCURPRegex, search.GovernmentID{Type: search.GovernmentIDCURP}},\n\t{\"electoral\", governmentIDElectoralRegex, search.GovernmentID{Type: search.GovernmentIDElectoral}},\n\t{\"business\", governmentIDBusinessRegistrationRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"company\", governmentIDCompanyNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"legal\", governmentIDLegalEntityNumberRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"commercial\", governmentIDCommercialRegistryRegex, search.GovernmentID{Type: search.GovernmentIDCommercialRegistry}},\n\t{\"isin\", chinaISINRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration, Country: \"China\"}},\n\t{\"social\", chinaUSCCRegex, search.GovernmentID{Type: search.GovernmentIDBusinessRegisration}},\n\t{\"birth\", governmentIDBirthCertRegex, search.GovernmentID{Type: search.GovernmentIDBirthCert}},\n\t{\"refugee\", governmentIDRefugeeRegex, search.GovernmentID{Type: search.GovernmentIDRefugee}},\n}\n\nfunc init() {\n\tfor _, pattern := range governmentIDPatterns {\n\t\tfor i := 0; i < len(pattern.marker); i++ {\n\t\t\tc := pattern.marker[i]\n\t\t\tif c < 'a' || c > 'z' {\n\t\t\t\tpanic(\"governmentIDPatterns: marker must be lowercase ASCII: \" + pattern.marker)\n\t\t\t}\n\t\t}\n\t}\n}

zhemaituk · 2026-05-29T18:21:52Z

Second this, and more broadly, startup performance optimization. These optimizations are especially valuable in auto-scalable environments, where additional containers (or Lambda functions) need to spin up quickly to handle traffic bursts. In my current setup, startup takes over 30 seconds, primarily due to reading and preparing data files.

adamdecaf · 2026-05-29T18:59:05Z

This makes a lot of sense. Good call out that maps were randomly iterating before, which does lead to inconsistent results. Thanks for the fix.

adamdecaf · 2026-05-29T19:00:33Z

I've never liked the government ID regexes... It's a big pain, but I haven't found a small enough model to embed (similar to libpostal, which is also big) for extraction over regexes.

I'd like to have Watchman leverage lots of little models for data processing/extraction/etc rather than hardcoded logic we have now.

akamick86 · 2026-05-29T23:19:49Z

I've never liked the government ID regexes... It's a big pain, but I haven't found a small enough model to embed (similar to libpostal, which is also big) for extraction over regexes.

I'd like to have Watchman leverage lots of little models for data processing/extraction/etc rather than hardcoded logic we have now.

Opened a PR for libpostal to allow not to load the entire CRF into memory. See openvenues/libpostal#724

If that lands I think there is an opportunity to follow the same patter for other files libpostal loads and bring entire memory usage down to about 250Mb vs 2.5Gb today.
The tradeoff is just disk usage.

akamick86 requested a review from adamdecaf as a code owner May 29, 2026 16:17

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

adamdecaf merged commit 96a4c0a into moov-io:master May 29, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: skip non-matching OFAC government ID regexes during preparation#752

perf: skip non-matching OFAC government ID regexes during preparation#752
adamdecaf merged 1 commit into
moov-io:masterfrom
akamick86:perf/ofac-remark-prefilter

akamick86 commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

zhemaituk commented May 29, 2026

Uh oh!

adamdecaf commented May 29, 2026

Uh oh!

Uh oh!

adamdecaf commented May 29, 2026

Uh oh!

akamick86 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akamick86 commented May 29, 2026

What

Correctness

Benchmark

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

zhemaituk commented May 29, 2026

Uh oh!

adamdecaf commented May 29, 2026

Uh oh!

Uh oh!

adamdecaf commented May 29, 2026

Uh oh!

akamick86 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants