The goal was to match candidate names from several congressional election files to a master list of congressional members using only name information. The available data included a primary dataset (congress_members_with_parties.csv) with roughly 2,900 members and several year‑specific election datasets (congressional_elections_YYYY.csv for 2019–2025). Each election file contained candidate names, party affiliations and basic metadata; however, the “status” column was considered unreliable and was ignored.
-
Data Loading and Inspection. All members were read into a pandas DataFrame. Each member’s canonical name was constructed by concatenating the
first_nameandlast_namefields and normalised by lowercasing, stripping whitespace and removing punctuation. A dictionary mapping these normalised names to their row indices was built to enable constant‑time exact matching. -
Election File Handling. All election files matching the prescribed pattern were discovered. To prevent processing empty or malformed datasets, files were filtered out if reading them produced an empty DataFrame or raised an exception. Only files with data were retained.
-
Exact and Fuzzy Matching. For each candidate in the election data, the candidate’s name was normalised in the same way as the members’. An exact match check was performed against the lookup dictionary. If an exact match was found, the candidate was linked to the corresponding member with full confidence. For names without an exact match, a fuzzy match was attempted using
rapidfuzz.fuzz.token_sort_ratio, which is robust to token reordering (e.g.,Nyajuoga, Josephvs.Joseph Nyajuoga). The candidate was linked to the member with the highest similarity score if that score exceeded 85%; otherwise it was left unmatched. -
Result Compilation. The matching process produced a list of results containing the candidate’s name, party, year, whether a match was found, the matched member’s details (if any) and a confidence score. A summary of members who were never matched in any election file was also generated. All results were saved to CSV files for reporting.
-
Name Variations and Punctuation. Candidate names appeared in various formats (e.g., with commas, initials or suffixes). Normalisation included removing punctuation and collapsing extra spaces, allowing consistent comparison between “John A. Smith” and “Smith, John.”
-
Data Completeness. Some election files were missing or contained no data. Skipping empty files prevented errors and false negatives.
-
Duplicate Names. Occasionally multiple members shared the same name. The matching process retained the first occurrence for exact matches and used the highest fuzzy score to select among potential matches, although this could still conflate distinct individuals with identical names.
-
Performance. Fuzzy matching all remaining candidates against ~2,900 members could be computationally intensive. Precomputing normalised member names and using the efficient
rapidfuzzlibrary reduced the overhead.
Overall, combining a normalised exact match with token‑based fuzzy matching provided good coverage while avoiding many false positives.
[!summary]
- Total candidates compared: 24244
- Number of matches: 7615
- Number of exact matches: 4377
- Number of fuzzy matches: 3238
- Number of unmatched candidates: 16629
- Unique members matched: 1576 of 2873
- Members unmatched: 1297