In [1]:
from IPython.display import Image

In [1]:
import re

---------------------------
## Project - on regex
-----------------------

#### Validate Email Addresses

**Problem**
- You have a form on your website or a dialog box in your application that asks the user for an `email address`. 

- You want to use a `regular expression` to validate this email address before trying to send email to it. 
    - This reduces the number of emails returned to you as undeliverable.

Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. 

- `\w` (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. 
- `\W` (upper case W) matches any non-word character.
- `@` scan till you see this character
- [w.] a set of characters to potentially match, so w is all alphanumeric characters, and the trailing period . adds to that set of characters.
- + one or more of the previous set.

#### Solution - Option 1

**Simple**

- This first solution does a very simple check. 

- It only validates that the string contains an at sign (`@`) that is `preceded` and `followed` by `one or more nonwhitespace characters`.

In [2]:
regex = r'\S+@\S+'
text  = '''purple 
support@openDBtech.com 
monkey 
dishwasher 
bks@bks.com 
bks_@bks.in 
bks!@bks.com 
ak*@ak.in 
ak-00@in'''

re.findall(regex, text)

['support@openDBtech.com',
 'bks@bks.com',
 'bks_@bks.in',
 'bks!@bks.com',
 'ak*@ak.in',
 'ak-00@in']

- ‹\S› matches any character that is `not a whitespace character`.
    - Equivalent to negated character class -> [^ \t\n\r\f\v]
        - Not a tab char
        - Not a new line
        - Not a carriage return
        - Not a form feed
        - Not a vertical whitespace
    - Equivalent to 
        - A-Z
        - a-z
        - 0-9
        - underscore
        - All punctuations

#### Solution - Option 2 

**with restrictions on characters**

- The `domain name`, the part after the `@` sign, is restricted to characters allowed in domain names. 

- Internationalized domain names are not allowed. 

- The local part, the part before the @ sign, is restricted to characters commonly used in email local parts, which is more restrictive than what most email clients and servers will accept:

In [3]:
regex = r'[A-Z0-9+_.-]+@[A-Z0-9.-]+'
text  = '''purple 
support@openDBtech.com 
monkey 
dishwasher 
bks@bks.com 
bks_@bks.in 
bks!@bks.com 
ak*@ak.in 
ak-00@in
akash...in@in
'''

re.findall(regex, text, re.IGNORECASE)

['support@openDBtech.com',
 'bks@bks.com',
 'bks_@bks.in',
 'ak-00@in',
 'akash...in@in']

observe bks!@bks.com , ak*@ak.in are rejected

#### Solution - Option 3 

**with all valid local part characters**

- This regular expression expands the previous one by allowing a larger set of rarely used characters in the local part. 
- Not all email software can handle all these characters, but we’ve included all the characters permitted by RFC 5322, which governs the email message format. 
- Among the permitted characters are some that present a security risk if passed directly from user input to an SQL statement, such as the single quote (') and the pipe character (|). 
- Be sure to escape sensitive characters when inserting the email address into a string passed to another program, in order to prevent security holes such as SQL injection attacks:

In [4]:
regex = r"[A-Z0-9_!#$%'*+/=?`{|}~^.-]+@[A-Z0-9.-]+"

text  = '''purple 
support@openDBtech.com 
monkey 
dishwasher 
bks@bks.com 
bks_@bks.in 
bks!@bks.com 
ak*@ak.in 
ak-00@in
akash...in@in
.bharat@.india.com
'''

re.findall(regex, text, re.IGNORECASE)  

['support@openDBtech.com',
 'bks@bks.com',
 'bks_@bks.in',
 'bks!@bks.com',
 'ak*@ak.in',
 'ak-00@in',
 'akash...in@in',
 '.bharat@.india.com']

#### Solution - Option 4 

**No leading, trailing, or consecutive dots**

- Both the local part and the domain name can contain one or more dots, but no two dots can appear right next to each other. 

- Furthermore, the first and last characters in the local part and in the domain name must not be dots:


In [5]:
regex = "[A-Z0-9_!#$%&'*+/=?`{|}~^-]+(?:\.[A-Z0-9_!#$%&'*+/=?`{|}~^-]+)*@[A-Z0-9-]+(?:\.[A-Z0-9-]+)*"

text  = '''purple 
support@openDBtech.com 
monkey 
dishwasher 
bks@bks.com 
bks_@bks.in 
bks!@bks.com 
ak*@ak.in 
ak-00@in
ak...ash...in@in
.bharat@.india.com
'''

re.findall(regex, text, re.IGNORECASE)  

['support@openDBtech.com',
 'bks@bks.com',
 'bks_@bks.in',
 'bks!@bks.com',
 'ak*@ak.in',
 'ak-00@in',
 'in@in']

the use of a group, `(?:[A-Z0-9-]+\.)+` matches one or more letters, digits, and/or hyphens, followed by one literal dot. The plus sign repeats this group one or more times. The group must match at least once, but can match as many times as possible.

#### Solution - Option 5 

**Top-level domain has two to six letters**

- This regular expression adds to the previous versions by specifying that 
    - the domain name must include at least one dot, 
    - and that the part of the domain name after the last dot can only consist of letters. 
    - The top-level domain (.com in these examples) must consist of two to six letters. 
    - All country-code top-level domains (.us, .uk, etc.) have two letters. 
    - The generic top-level domains have between 3 (.com) and 6 letters (.museum):

In [6]:
regex ="[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}"

text  = '''purple 
support@openDBtech.com 
monkey 
dishwasher 
bks@bks.com 
bks_@bks.in 
bks!@bks.com 
ak*@ak.in 
ak-00@in
ak...ash...in@in
.bharat@.india.com
dave_uk@unitedkingdom
'''

re.findall(regex, text, re.IGNORECASE) 

['support@openDBtech.com',
 'bks@bks.com',
 'bks_@bks.in',
 'bks!@bks.com',
 'ak*@ak.in']

Observe longer than 6 chrs domain names are rejected