# Email validator

Create an email address validator to ensure the addresses in the database are valid.

Any e-mail address is made of a local part, the symbol `@` and a domain name. Thus, the validation process is done in three steps, and using RegEx patterns:
1. Check if the email address has a valid base format, i.e., `localpart@domainname`.
2. Check the local part format.
3. Check the domain name format.

Because a vast majority of the users use an unquoted address, the validator do not accepts addresses with a quoted local part. For more information about the formatting of the local part, see this [link](https://en.wikipedia.org/wiki/Email_address#Local-part). In addition, as IP domain names are extremly rare except in email spam, the validator does not accept them either. For more information about the domain formatting, see this other [link](https://en.wikipedia.org/wiki/Email_address#Domain).

Of course, using RegEx patterns I can only verify that the email address is sintactically correct, but not whether it was misstyped or if it (exists checking the SMTP server). For a more complete validator that does extra verifications, one might use the `Python` third-party library `validate-email` (see their GitHub [repo](https://github.com/syrusakbary/validate_email)).

In the following cell I include `EmailValidator`, which is a `Python` class designed to check whether an email address is valid or not. To use it, one only needs to instantiate it and use its convenience function `is_valid`, which returns a boolean value indicating whether the input email is valid or not. In case `is_email_valid` returns `False`, it also prints a message indicating the  reason why. See the following example:
```python
validator = EmailValidator()
validator.is_email_valid('user@example.com')
# output: True
```

For a faster (and less exhaustive) validation, one can use its class method `fast_validation`, which also returns a boolean according to the validity of the input address but prints no messages (nor raises exceptions). See the following example:
```python
EmailValidator.fast_validation('userexample.com') # notice there is no need to instantiate the class
# output: False
```

In [1]:
import re
import traceback

In [2]:
# NOTE 1
# (?: ) represents a non-capturing group, which allows to control 
# the expression concatenation order without the overhead of saving 
# it as a matched part of the string
#
# NOTE 2
# the following references allude to RFC5322 
# (https://tools.ietf.org/html/rfc5322)
#
# NOTE 3
# as indicated in 3.1. Introduction, refer to RFC5234 Appendix B.1 
# (https://tools.ietf.org/html/rfc5234#appendix-B.1) for the primitive 
# tokens definitions
#
# NOTE 4
# in 3.2.2., when defining the comment content, RFC includes COMMENT 
# syntax, but adding it would be circular

# see 2.2.2. Structured Header Field Bodies
WSP = r'\s'                                               # White SPace (maybe add square brakets)

# see 2.2.3. Long Header Fields
CRLF = r'(?:\r\n)'                                        # Carriage Return and Line Feed

# see 3.2.1. Quoted characters
QUOTED_PAIR = r'(?:\\.)'

# see 3.2.2. Folding White Space and Comments
FWS = r'(?:(?:' + WSP + '*' + CRLF + r')?' + WSP + r'+)'
CTEXT = r'[\x21-\x27\x2A-\x5B\x5D-\x7E]'                  # comment text: printable US-ASCII chars but `(`, `)`, `\`
CCONTENT = r'(?:' + CTEXT + r'|' + QUOTED_PAIR + r')'
COMMENT = r'\((?:' + FWS + r'?' + CCONTENT + r')*' + FWS + r'?\)'
CFWS = r'(?:(?:(?:' + FWS + r'?' + COMMENT + r')+' + FWS + r'?)' + r'|' + FWS + r')'

# see 3.2.3. Atom
ATEXT = r'[a-zA-Z0-9!#\$%&\'\*\+\-/=\?\^_`\{\|\}~]'       # atom text: printable US-ASCII chars not including specials
ATOM = CFWS + r'?' + ATEXT + r'+' + CFWS + r'?'
DOT_ATOM_TEXT = ATEXT + r'+(?:\.' + ATEXT + r'+)*'
DOT_ATOM = CFWS + r'?' + DOT_ATOM_TEXT + CFWS + r'?'

# see 3.2.4. Quoted string
QTEXT = r'[\x21\x23-\x5B\x5D-\x7E]'                       # quoted text: printable ASCII chars but `"`, `\`
QCONTENT = r'(?:' + QTEXT + r'|' + QUOTED_PAIR + r')'
QUOTED_STRING = CFWS + r'?' + r'"(?:' + FWS + r'?' + QCONTENT + r')*' + FWS + r'?' + r'"' + CFWS + r'?'

# see 3.4.1. Addr-Spec
LOCAL_PART = r'(?:' + DOT_ATOM + r'|' + QUOTED_STRING + r')'
DTEXT = r'[\x21-\x5A\x5E-\x7E]' # domain text: printable US-ASCII chars but `(`, `)`, `\`
DOMAIN_LITERAL = CFWS + r'?' + r'\[' + r'(?:' + FWS + r'?' + DTEXT + r')*' + FWS + r'?\]' + CFWS + r'?'
DOMAIN = r'(?:' + DOT_ATOM + r'|' + DOMAIN_LITERAL + r')'
ADDR_SPEC = r'(' + LOCAL_PART + r')' + '@' + r'(' + DOMAIN + r')'

# compiled regex patterns
VALID_ADDRESS_PATTERN = re.compile(r'^' + ADDR_SPEC + r'$')         # a valid address must match exactly ADDR_SPEC
UNQUOTED_LOCAL_PART_DOTS_PATTERN = re.compile(r"^\.|\.{2,}|\.$")    # dot explicit problems for unquoted local part
UNQUOTED_LOCAL_PART_PATTERN = re.compile(DOT_ATOM)                  # without quotes, it follows the DOT_ATOM syntax
QUOTED_LOCAL_PART_PATTERN = re.compile(r'^' + QUOTED_STRING + r'$') # when quoted, the local part syntax is more flexible
DOMAIN_DNS_LABEL_DOTS_PATTERN = re.compile("\.{2,}")                # domain DNS labels must be separated by a single dot
DOMAIN_DNS_LABEL_HYPHENS_PATTERN = re.compile("^-|-$")              # domain DNS labels must not begin nor end with a hyphen
DOMAIN_NAME_PATTERN = UNQUOTED_LOCAL_PART_PATTERN                   # when not a literal, the domain has the ULP syntax 
DOMAIN_LITERAL_PATTERN = re.compile(r'^' + DOMAIN_LITERAL + r'$')   # IP format

# lenght constants
EMAIL_ADDRESS_MAX_LENGTH = 254
LOCAL_PART_MAX_LENGTH = 64
DOMAIN_MAX_LENGTH = 255
DNS_LABEL_MAX_LENGHT = 64

In [3]:
class AddressNotSetError(ValueError):
    pass

class NotValidEmailAddressSyntaxError(ValueError):
    pass

class LocalPartSyntaxError(ValueError):
    pass

class DomainSyntaxError(ValueError):
    pass

In [25]:
# TODOs
#  - include checkings beyond syntax
#  - add further distinction between IP versions

class EmailValidator():
    """
    Class used to validate an e-mail address. 
    
    Any e-mail address is made of a local part, the symbol @ and a domain name.
    According to the curernt standards, the local part is case-sentive, but the 
    recieving hosts must deliver messages in a case-independent way. Thus, we 
    treat the whole addresses in this latter manner.
    
    It uses the syntax rules from RFC 5322 (https://tools.ietf.org/html/rfc5322)
    """
    def __init__(self, email_address = None):
        """
        Parameters
        ----------
        email_address: str, optional
            the address to validate. it can be set either when the 
            class is instantiated or when the method `self.validate`
            is called. however, it is recommended to set pass it in
            the latter case
        """
        self.email_address_set_at_init = email_address is not None
        self.email_address = str(email_address).lower() if self.email_address_set_at_init else None
        self.local_part = None
        self.domain = None
        
        #lenghts
        self._email_address_max_length = EMAIL_ADDRESS_MAX_LENGTH
        self._local_part_max_length = LOCAL_PART_MAX_LENGTH
        self._domain_max_length = DOMAIN_MAX_LENGTH
        self._dns_label_max_length = DNS_LABEL_MAX_LENGHT
        
        # patterns
        self._valid_address_pattern = VALID_ADDRESS_PATTERN
        self._unquoted_local_part_dots_pattern = UNQUOTED_LOCAL_PART_DOTS_PATTERN
        self._unquoted_local_part_pattern = UNQUOTED_LOCAL_PART_PATTERN
        self._quoted_local_part_pattern = QUOTED_LOCAL_PART_PATTERN
        self._lhd_domain_dns_label_hyphens_pattern = DOMAIN_DNS_LABEL_HYPHENS_PATTERN
        self._lhd_domain_dns_label_dots_pattern = DOMAIN_DNS_LABEL_DOTS_PATTERN
        self._lhd_domain_dns_label_pattern = DOMAIN_NAME_PATTERN
        self._domain_literal_pattern = DOMAIN_LITERAL_PATTERN
    
    def _is_call_valid(self, attr):
        if attr is None:
            print(
                ('WARNING: This private method must not be called directly. ' 
                 'Use `self.is_email_valid()` or `cls.fast_validation()` instead.')
            )
            return False
        
        return True
        
    def _validate_base(self, simple = False):
        """
        Checks if the e-mail address has an appropiate base format. 
        
        NOTE: This (private) method must not be called directly but only via `self.validate`.
        """
        if not self._is_call_valid(self.email_address):
            return
        
        if len(self.email_address) > self._email_address_max_length:
            raise NotValidEmailAddressSyntaxError(
                f"The address can not be longet than {self._email_address_max_length} characters"
            )
        
        # when called by `cls.fast_validation`
        if simple:
            if self._valid_address_pattern.search(self.email_address) is None:
                raise NotValidEmailAddressSyntaxError((f"Invalid syntax for address `{self.email_address}`"))
            
            local_part, _, domain = self.email_address.rpartition('@')
            if len(local_part) > self._local_part_max_length or len(domain) > self._domain_max_length:
                raise NotValidEmailAddressSyntaxError((f"Invalid syntax for address `{self.email_address}`"))
                   
        # when called by `self.is_email_valid` or by `self.validate()`    
        else:              
            if self.email_address.count('@') == 0:
                raise NotValidEmailAddressSyntaxError("Expecting address syntax like `localpart@domainname`")

            # assume that in case an address has more than 1 `@` character, the one
            # delimiting local part and the domain is the last one
            self.local_part, _, self.domain = self.email_address.rpartition('@')
    
    def _find_invalid_chars(self, text, pattern):
        """Find the characters from `text` not matching the `pattern` (the so-called invalid characters)"""
        valid_characters = ''.join(pattern.findall(text))
        if valid_characters:
            return set(text).difference(valid_characters)
        else:
            return sorted(set(text))
    
    def _validate_local_part_quoted(self):
        """
        Validates the local part format when it is quoted.
        
        NOTE: This (private) method must not be called directly but only via `self.validate`.
        """
        if not self._is_call_valid(self.local_part):
            return
        
        if self.local_part.count('"') > 2:
            raise LocalPartSyntaxError(
                (f"Invalid syntax for quoted local part `{self.local_part}`.\n"
                  "It must contain only one quoted string, i.e., it can't contain "
                """more than 2 `"` characters.""")
            )
            
        invalid_chars = self._find_invalid_chars(self.local_part, self._quoted_local_part_pattern)
        if invalid_chars:
            raise LocalPartSyntaxError(
                (f"Invalid syntax for quoted local part `{self.local_part}`.\n"
                 f"It contains the following non-valid characters: `{''.join(invalid_chars)}`.\n" 
                  """The accepted ones are printable US-ASCII characters but printable ASCII chars but `"` and `\`""")
            )
        
    def _validate_local_part_unquoted(self):
        """
        Validates the local part format when it is unquoted.
        
        NOTE: This (private) method must not be called directly but only via `self.validate`.
        """
        if not self._is_call_valid(self.local_part):
            return
        
        if self._unquoted_local_part_dots_pattern.search(self.local_part) is not None:
            raise LocalPartSyntaxError(
                (f"Invalid syntax for local part `{self.local_part}`.\n"
                  "Unquoted local parts cannot either begin, end, or have two or more consecutive dot `.` characters.")
            )
            
        invalid_chars = self._find_invalid_chars(self.local_part, self._unquoted_local_part_pattern)
        if invalid_chars:
            raise LocalPartSyntaxError(
                (f"Invalid syntax for unquoted local part `{self.local_part}`.\n"
                 f"It contains the following non-valid characters: `{''.join(invalid_chars)}`.\n" 
                  "The accepted ones are printable US-ASCII characters not including the specials, i.e.:\n"
                  "  - Latin letters `a` to `z` and `A` to `Z`\n"
                  "  - Digits `0` to `9`\n"
                  "  - Printable characters `!#$%&'*+-/=?^_`{|}~`\n"
                  "  - Dot `.`, as long as it is not the first or last character and that it does not appear consecutively")
            )
            
    def _validate_local_part(self):
        """
        Validates the local part syntax according to the rules extracted from 
        RFC 5322
        
        This (private) method must not be called directly but only via `self._validate_local_part`.
        """
        if not self._is_call_valid(self.local_part):
            return
        
        if len(self.local_part) > self._local_part_max_length:
            raise LocalPartSyntaxError(f"The local part cannot be longer than {self._local_part_max_length} characters")
        if self.local_part.startswith('"') and self.local_part.endswith('"'):
            self._validate_local_part_quoted()
        else:
            self._validate_local_part_unquoted()
            
    def _validate_domain_literal(self):
        """
        Validates the domain name when it is an IP address.
        
        NOTE: This (private) method must not be called directly but only via `self.validate`.
        """
        if not self._is_call_valid(self.domain):
            return
        
        if self._domain_literal_pattern.search(self.domain) is None:
            raise DomainSyntaxError("Invalid domain literal syntax")
            
        # TODO: add further distinction between IP versions
            
    def _validate_LDH_domain(self):
        """
        Validates the domain name when it follows the Letters, Digits, Hyphen (LHD) rule.
        
        NOTE: This (private) method must not be called directly but only via `self.validate`.
        """
        if not self._is_call_valid(self.domain):
            return
        
        # the domain name is a sequence of dot-separated DNS labels
        if self._lhd_domain_dns_label_dots_pattern.search(self.domain) is not None:
            raise DomainSyntaxError(
                (f"Invalid format for domain name `{self.domain}`.\n"
                  "DNS labels must be separated by a single dot `.` character.")
            )
                
        self.dns_labels = self.domain.split(".")
        for label in self.dns_labels:
            if len(label) > self._dns_label_max_length:
                raise DomainSyntaxError(
                    (f"Invalid syntax for domain name `{self.domain}`.\n"
                     f"DNS labels cannot be longer than {self._dns_label_max_length} characters")
                )
            
            if self._lhd_domain_dns_label_hyphens_pattern.search(label) is not None:
                raise DomainSyntaxError(
                    (f"Invalid syntax for domain name `{self.domain}`.\n"
                      "DNS labels cannot either begin or end with a hyphen `-` character.")
                )
                
            invalid_chars = self._find_invalid_chars(label, self._lhd_domain_dns_label_pattern)
            if invalid_chars:
                raise DomainSyntaxError(
                    (f"Invalid format for domain name `{self.domain}`.\n"
                     f"The DNS label `{label}` contains the following non-valid characters: `{''.join(invalid_chars)}`.\n"
                      "The accepted ones are printable US-ASCII characters not including the specials, i.e.:\n"
                      "  - Latin letters `a` to `z` and `A` to `Z`\n"
                      "  - Digits `0` to `9`\n"
                      "  - Printable characters `!#$%&'*+-/=?^_`{|}~`\n"
                      "  - Dot `.`, as long as it is not the first or last character and that it does not appear"
                          "consecutively")
                )    
        
    def _validate_domain(self):
        """
        Validates the domain name of an email address according to the rules extracted from 
        https://en.wikipedia.org/wiki/Email_address#Domain
        
        This (private) method must not be called directly but only via `self._validate_local_part`.
        """
        if not self._is_call_valid(self.domain):
            return
        
        if len(self.domain) > self._domain_max_length:
            raise DomainSyntaxError(
                (f"Invalid syntax for domain name `{self.domain}`.\n"
                 f"It cannot be longer than {self._domain_max_length} characters")
            )
            
        # when the domain name is an IP address, it is surrounded by square brackets.
        if self.domain.startswith('[') and self.domain.endswith(']'):
            self._validate_domain_literal()
        else:
            self._validate_LDH_domain()
            
    def validate(self, email_address = None):
        """
        The validation process is done in three steps:
          1. Check if the email address has a valid base format, i.e., `localpart@domainname`.
          2. Check the local part format.
          3. Check the domain name format

        If any validation step fails, the function raises an exception explaining what went wrong.
        
        Parameters
        ----------
        email_address: str, optional
            the address to validate. there is no need to pass 
            it if it was set during the class instantiation
        """
        if email_address is None and self.email_address is None:
            raise AddressNotSetError(
                ("The address to validate must be set either when the class "
                 "is instantiated or when this method is called.")
            )
            
        if self.email_address_set_at_init and email_address != self.email_address:
            print(
                (f"WARNING: the email address `{self.email_address}` was when "
                 f"instantiating the class, but now you are trying to validate "
                 f"a different address `{email_address}`. "
                 f"To avoid this warning, do not set any email address during the "
                 f"class instantiation, as indicated in the docs.")
            )
                   
        # for convenience, cast to string (to avoid exceptions due to wrong input types) and lowercase
        self.email_address = str(email_address).lower()
        self._validate_base()
        self._validate_local_part()
        self._validate_domain()
    
    def _handle_error(self, error, show_traceback):
        """
        Controls how to handle errors raised when validating an email address.
        The current process is to print the error, optionally show its traceback
        and return False, as an address that raises an error during its validation
        is assumed to be an invalid one.
        
        Parameters
        ----------
        error: Exception
            the exception to be handled
        show_traceback: bool
            whether to print the error traceback or not
        
        Returns
        -------
        False: bool
            an email address that raises an error during its validation
            is assumed to be an invalid one
        """
        if show_traceback:
            lines = traceback.format_tb(error.__traceback__)
            for line in lines:
                print(line)
        print(f'{type(error).__name__}: {error}')
        return False
    
    @classmethod
    def fast_validation(cls, email_address):
        """
        Convenience classmethod to check where an email address is valid or not. 
        It checks whether the address syntax (including the length of its parts) 
        is valid and returns a boolean accordingly.
        
        For a more exhaustive validation, one can use `self._is_email_valid()`
        
        Parameters
        ----------
        email_address: str
            the address to validate
        
        Returns
        -------
        address_is_valid: bool
            whether the input email address is valid or not
        """
        try:
            return cls(email_address)._validate_base(simple = True) is None
        except NotValidEmailAddressSyntaxError as expected_error:
            return False
        except Exception as unexpected_error:
            print(f'Found the following unexpected error when validating the address `{email_address}`')
            return cls()._handle_error(unexpected_error, show_traceback = True)
        
    def is_email_valid(self, email_address, show_traceback = False):
        """
        Convenience function to check where an email address is valid or not. 
        Unless `self.validate()`, it does not raise any exceptions but prints them.
        For more information see the `validate` function.
        
        Parameters
        ----------
        email_address: str
            the address to validate
        
        Returns
        -------
        address_is_valid: bool
            whether the input email address is valid or not
        """
        try:
            return self.validate(email_address) is None
        except (NotValidEmailAddressSyntaxError, LocalPartSyntaxError, DomainSyntaxError) as expected_error:
            return self._handle_error(expected_error, show_traceback = show_traceback)
        except Exception as unexpected_error:
            print(f'Found the following unexpected error when validating the address `{email_address}`')
            return self._handle_error(unexpected_error, show_traceback = True)  

## Timing

In [26]:
%%timeit
validator = EmailValidator()
validator.is_email_valid("user@test.com")

9.79 µs ± 42.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [27]:
%%timeit
EmailValidator.fast_validation("user@test.com")

3.3 µs ± 28.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Testing

In [23]:
valid_addresses = [
    'simple@example.com',
    'very.common@example.com',
    'disposable.style.email.with+symbol@example.com',
    'other.email-with-hyphen@example.com',
    'fully-qualified-domain@example.com',
    'user.name+tag+sorting@example.com',
    'x@example.com',
    'example-indeed@strange-example.com',
    'admin@mailserver1', 
    'example@s.example', 
    '" "@example.org',
    '"john..doe"@example.org',
    'mailhost!username@example.org',
    'user%example.com@example.org',
    'user-@example.org',
    '"A@b@c"@example.com',
    
    'jsmith@[192.168.2.1]',
    'jsmith@[IPv6:2001:db8::1]'
]

invalid_addresses = [
    'Abc.example.com', # no @ character
    'A@b@c@example.com', # only one @ is allowed outside quotation marks
    'a"b(c)d,e:f;g<h>i[j\k]l@example.com', # none of the special characters in this local-part are allowed outside quotation marks
    'just"not"right@example.com', # quoted strings must be the only element making up the local-part
    '"quote"separated"@address.com',
   r'this is"not\allowed@example.com', # spaces, quotes, and backslashes may only exist when within quoted strings and preceded by a backslash
    'this\ still\"not\\allowed@example.com', # even if escaped (preceded by a backslash), spaces, quotes, and backslashes must still be contained by quotes
    '1234567890123456789012345678901234567890123456789012345678901234+x@example.com', # local-part is longer than 64 characters
]

In [17]:
all(EmailValidator.fast_validation(address) for address in valid_addresses)

validator = EmailValidator()
all(validator.is_email_valid(address) for address in valid_addresses)

True

In [24]:
all(not EmailValidator.fast_validation(address) for address in invalid_addresses)

True

In [19]:
validator = EmailValidator()
all(not validator.is_email_valid(address) for address in invalid_addresses)

NotValidEmailAddressSyntaxError: Expecting address syntax like `localpart@domainname`
LocalPartSyntaxError: Invalid syntax for unquoted local part `a@b@c`.
It contains the following non-valid characters: `@`.
The accepted ones are printable US-ASCII characters not including the specials, i.e.:
  - Latin letters `a` to `z` and `A` to `Z`
  - Digits `0` to `9`
  - Printable characters `!#$%&'*+-/=?^_`{|}~`
  - Dot `.`, as long as it is not the first or last character and that it does not appear consecutively
LocalPartSyntaxError: Invalid syntax for unquoted local part `a"b(c)d,e:f;g<h>i[j\k]l`.
It contains the following non-valid characters: `:,<[;\>]"`.
The accepted ones are printable US-ASCII characters not including the specials, i.e.:
  - Latin letters `a` to `z` and `A` to `Z`
  - Digits `0` to `9`
  - Printable characters `!#$%&'*+-/=?^_`{|}~`
  - Dot `.`, as long as it is not the first or last character and that it does not appear consecutively
LocalPartSyntaxError: Invalid syntax

True