# Data validation

* https://adventofcode.com/2020/day/4

We get to validate passports. Part 1 asks us to validate the fields; there are a number of required fields, and one optional. This is mostly a parsing task, however.

The data for each passport is separated from the next by a blank line, so we just split the whole text by the doubled newline character (`\n`). Each passport is then trivially split into key-value pairs by splitting on arbitrary whitespace; the `str.split()` method doesn't care if the separators are newlines, spaces or some other whitespace! Each key-value pair is then split once on `:`, turning each passport entry into a dictionary.

Now that we have dictionaries, we need to validate the keys in them. I'm making use of the fact that Python's [`dict.keys()` keys view object](https://docs.python.org/library/stdtypes.html#dict-views) acts as a *set*, and testing each is a [superset](https://docs.python.org/3/library/stdtypes.html#frozenset.issuperset) of the required field names, as well as being a subset of all possible field names. Python's chained operators make this a very simple and elegant expression:

```python
required = frozenset(...)  # all required keys
all_ = required | frozenset({"cid"})  # required plus optional keys
all_ >= passport.keys() >= required   # true if all required keys are there, and no unknown keys
```

In [1]:
from typing import Callable, Iterable, Mapping

PassportData = Mapping[str, str]
Validator = Callable[[PassportData], bool]

required = frozenset({"byr", "iyr", "eyr", "hgt", "hcl", "ecl", "pid"})
all_ = required | frozenset({"cid"})

def valid_passport(passport: PassportData) -> bool:
    return all_ >= passport.keys() >= required

def read_passports(data: str) -> Iterable[PassportData]:
    for block in data.split("\n\n"):
        yield dict(f.split(':', 1) for f in block.split())

def count_valid(passports: Iterable[PassportData], validator: Validator=valid_passport) -> int:
    return sum(1 for _ in filter(validator, passports))

testdata = """\
ecl:gry pid:860033327 eyr:2020 hcl:#fffffd
byr:1937 iyr:2017 cid:147 hgt:183cm

iyr:2013 ecl:amb cid:350 eyr:2023 pid:028048884
hcl:#cfa07d byr:1929

hcl:#ae17e1 iyr:2013
eyr:2024
ecl:brn pid:760753108 byr:1931
hgt:179cm

hcl:#cfa07d eyr:2025 pid:166559648
iyr:2011 ecl:brn hgt:59in
"""

assert count_valid(read_passports(testdata)) == 2

In [2]:
import aocd
passportdata = aocd.get_data(day=4, year=2020)

In [3]:
print("Part 1:", count_valid(read_passports(passportdata)))

Part 1: 226


## Value validation

To validate the values, I reached for a tool I use quite often: a schema validation library called [Marshmallow](https://marshmallow.readthedocs.io/). It makes it trivial to define validators for each field; only the height validation required 'custom' code:

In [4]:
from marshmallow import fields, validate, RAISE, Schema, ValidationError

def validate_height(height: str) -> bool:
    try:
        value = int(height[:-2])
    except ValueError:
        raise ValidationError("Invalid height")

    if height[-2:] == "cm" and (150 <= value <= 193):
        return
    elif height[-2:] == "in" and (59 <= value <= 76):
        return
    raise ValidationError("Invalid height")
        

class PassportSchema(Schema):
    class Meta:
        unknown = 'RAISE'
    byr = fields.Int(required=True, validate=validate.Range(1920, 2003))
    iyr = fields.Int(required=True, validate=validate.Range(2010, 2021))
    eyr = fields.Int(required=True, validate=validate.Range(2020, 2031))
    hgt = fields.Str(required=True, validate=validate_height)
    hcl = fields.Str(required=True, validate=validate.Regexp(r"^#[0-9a-fA-F]{6}$"))
    ecl = fields.Str(
        required=True,
        validate=validate.OneOf(frozenset("amb blu brn gry grn hzl oth".split())),
    )
    pid = fields.Str(required=True, validate=validate.Regexp(r"^\d{9}$"))
    cid = fields.Str()


def valid_passport_fields(passport: Mapping):
    try:
        PassportSchema().load(passport)
        return True
    except ValidationError:
        return False

testinvalid = """\
eyr:1972 cid:100
hcl:#18171d ecl:amb hgt:170 pid:186cm iyr:2018 byr:1926

iyr:2019
hcl:#602927 eyr:1967 hgt:170cm
ecl:grn pid:012533040 byr:1946

hcl:dab227 iyr:2012
ecl:brn hgt:182cm pid:021572410 eyr:2020 byr:1992 cid:277

hgt:59cm ecl:zzz
eyr:2038 hcl:74454a iyr:2023
pid:3556412378 byr:2007
"""
assert count_valid(read_passports(testinvalid), valid_passport_fields) == 0

testvalid = """\
pid:087499704 hgt:74in ecl:grn iyr:2012 eyr:2030 byr:1980
hcl:#623a2f

eyr:2029 ecl:blu cid:129 byr:1989
iyr:2014 pid:896056539 hcl:#a97842 hgt:165cm

hcl:#888785
hgt:164cm byr:2001 iyr:2015 cid:88
pid:545766238 ecl:hzl
eyr:2022

iyr:2010 hgt:158cm hcl:#b6652a ecl:blu byr:1944 eyr:2021 pid:093154719
"""

count_valid(read_passports(testvalid), valid_passport_fields) == 4

In [5]:
print("Part 2:", count_valid(read_passports(passportdata), valid_passport_fields))

Part 2: 160
