CURRENT version:
- RegEx
- Set Lookups from JSON files
FUTURE version:
- Fine-tuned Named Entity Recognition (NER) model https://huggingface.co/FacebookAI/xlm-roberta-base
-
Names: https://github.com/swedishdata/personal-names/tree/master SCB 2020 First names: https://www.statistikdatabasen.scb.se/pxweb/en/ssd/START__BE__BE0001__BE0001G/BE0001FNamn10/ SCB 2020 Last names: https://www.statistikdatabasen.scb.se/pxweb/en/ssd/START__BE__BE0001__BE0001G/BE0001ENamn10/
-
Education Programs:
-
Professions:
-
Marital Status:
-
Sexual Orientation:
-
Addresses:
- Street & Postal (DOESN'T CONTAIN ALL): https://github.com/beshrkayali/sverige_postnummer
- Municipalities & Counties: https://github.com/swedishdata/geography/blob/master/counties-municipalities.csv
-
Most up-to-date addresses:
osmium tags-filter sweden-latest.osm.pbf -o addresses.osm.pbf n/addr:street n/addr:postcode n/addr:city n/admin_level=7 n/admin_level=4 w/addr:street w/addr:postcode w/addr:city w/admin_level=7 w/admin_level=4
osmium export addresses.osm.pbf -f geojson -o addresses.geojson
jq -r '.features[] | select(.properties["addr:street"] != null and .properties["addr:postcode"] != null and .properties["addr:city"] != null) | "\(.properties["addr:street"]),\(.properties["addr:postcode"]),\(.properties["addr:city"])"' addresses.geojson | sort | uniq | sed -E 's/([0-9]{3})([0-9]{2})/\1 \2/' > unique_addresses.txt
jq -r '.features[] | select(.properties.admin_level == "7" and .properties.name != null) | "\(.properties.name)"' addresses.geojson | sort | uniq > municipalities.txt
jq -r '.features[] | select(.properties.admin_level == "4" and .properties.name != null) | "\(.properties.name)"' addresses.geojson | sort | uniq > counties.txt
- Person First Name (PER-FIRST)
- Person Last Name (PER-LAST)
- Personnummer (ID-PNR)
- Samordningsnummer (ID-SNR)
— - Marital Status (MARITAL)
- Biological Sex (SEX)
- Nationality (NATION)
— - Education Program (EDU-PROGRAM)
- Profession (PROF)
— - Disabilities (DISAB)
- Ethnicity (ETHNIC)
- Sexual Orientation (SEXOR)
— - Political Opinions (POL)
- Religious Beliefs (REL)
— - Phone Number (PHONE)
- Email (EMAIL)
- Social Media Profiles (SOCM)
— - Street Address (ADDR-STREET)
- Postal Code (ADDR-POSTAL)
- Municipality (ADDR-MUNICIPALITY)
- City (ADDR-CITY)
- County (ADDR-COUNTY)
— - Bank Account Number (FIN-BANKNUM)
- IBAN (FIN-IBAN)
- BIC/SWIFT Code (FIN-BIC)
- Credit Card Number (FIN-CC)
— - Organization Number (ORG-NUM)
- Company Name (ORG-WORK)
- Education Institute (ORG-EDU)
— - IP Address (IP)
- MAC Address (MAC)
— - Date (DATE)
- Time (TIME)
— - Vehicle Registration Number (VEH)