Allow to search for specific patterns of text

    .       - Any Character Except New Line
    \d      - Digit (0-9)
    \D      - Not a Digit (0-9)
    \w      - Word Character (a-z, A-Z, 0-9, _)
    \W      - Not a Word Character
    \s      - Whitespace (space, tab, newline)
    \S      - Not Whitespace (space, tab, newline)

    \b      - Word Boundary (invisible placeholder)
    \B      - Not a Word Boundary
    ^       - Beginning of a String (line)
    $       - End of a String

    []      - Matches Characters in brackets
    [1-7]   - Matches Number between 1 and 7
    [a-z]   - Matches Lower case letters a-z
    [a-zA-B]- Matches Lower case letters a-z and upper case A-B
    [^ ]    - Matches Characters NOT in brackets
    [^a-b]  - Matches Characters that is NOT lower case a-b
    |       - Either Or
    ( )     - Group

    Quantifiers (how many we try to match):
    *       - 0 or More
    +       - 1 or More
    ?       - 0 or One
    {3}     - Exact Number
    {3,4}   - Range of Numbers (Minimum, Maximum)

    Other
    (?<!@)  - look around (?), ensure character immediately before (<) is not (!) @

    abcdefghijklmnopqurtuvwxyz
    ABCDEFGHIJKLMNOPQRSTUVWXYZ
    1234567890

    Ha HaHa

    MetaCharacters (Need to be escaped with \):
    .[{()\^$|?*+

    coreyms.com

    cat
    mat
    pat
    bat
    Kat
    Yat
    ppat

    321-555-4321
    123.555.1234
    123*555+1234
    321--555-4321
    800-555-4321
    900-555-4321

    Mr. Schafer
    Mr Smith
    Ms Davis
    mrs Kane
    Mrs. Robinson
    ms.  Rush
    Mr.   T
    Mr.   jmat

    CoreyMSchafer@gmail.com
    corey.schafer@university.edu
    corey-321-schafer@my-work.net

    https://www.google.com
    http://coreyms.com
    https://youtube.com
    https://www.nasa.gov

#### Phone number

To match `321-555-4321`

    \d{3}.\d{3}.\d{4}

If we only want to match `-` or `.` as separator, we need to use character set

in character set, we don't need to use `\` to specify `.`

    \d{3}[-.]\d{3}[-.]\d{4}

If we only want to match 800 or 900 numbers

    [89]00[-.]\d{3}[-.]\d{4}

If we also want to match things like `321--555-4321`

    \d{3}[-.]{1,2}\d{3}[-.]{1,2}\d{4}

#### Characters

Match all 3 letter words end in `at`, but not bat

    \b[ac-zA-Z]at\b

Match Mr xxx or Mr. xxx or mr xxx or mr. xxx

    \b[Mm]r\.?\s+[A-Z]\w*

Use group to match ms mrs. mrs etc

    \b[Mm](r|s|rs)\.?\s+[A-Z]\w*

#### Email address

    \b[A-Za-z0-9._-]+@[A-Za-z0-9._-]+\.[A-Za-z0-9._-]+

#### Urls

Use group makes it easier to capture and retrieve information

    https?://(www\.)?(\w+)(\.\w+)

#### In Python

In [37]:
import re

text_to_search = """

  abcdefghijklmnopqurtuvwxyz
  ABCDEFGHIJKLMNOPQRSTUVWXYZ
  1234567890

  Ha HaHa

  MetaCharacters (Need to be escaped with \):
  .[{()\^$|?*+

  coreyms.com

  cat
  mat
  pat
  bat
  Kat
  Yat
  ppat

  321-555-4321
  123.555.1234
  123*555+1234
  321--555-4321
  800-555-4321
  900-555-4321

  Mr. Schafer
  Mr Smith
  Ms Davis
  mrs Kane
  Mrs. Robinson
  ms.  Rush
  Mr.   T
  Mr.   jmat

  CoreyMSchafer@gmail.com
  corey.schafer@university.edu
  corey-321-schafer@my-work.net

  https://www.google.com
  http://coreyms.com
  https://youtube.com
  https://www.nasa.gov

sentence = 'start a sentence and then Bring it to an end'
"""

Don't forget to use `/r` to let Python treat pattern as raw string

In [38]:
url_pattern = re.compile(r'\b(https?://)?(www\.)?((\w|[-._])+)(\.[a-z]+)\b(?![^@\s]*@)')

url_matches = url_pattern.finditer(text_to_search)

for match in url_matches:
    print(match)

<re.Match object; span=(149, 160), match='coreyms.com'>
<re.Match object; span=(413, 422), match='gmail.com'>
<re.Match object; span=(439, 453), match='university.edu'>
<re.Match object; span=(474, 485), match='my-work.net'>
<re.Match object; span=(489, 511), match='https://www.google.com'>
<re.Match object; span=(514, 532), match='http://coreyms.com'>
<re.Match object; span=(535, 554), match='https://youtube.com'>
<re.Match object; span=(557, 577), match='https://www.nasa.gov'>


In [39]:
email_pattern = re.compile(r'\b[A-Za-z0-9._-]+@[A-Za-z0-9._-]+(\.[A-Za-z0-9._-]+)')

email_matches = email_pattern.finditer(text_to_search)

# we can do by group
for match in email_matches:
    print(match.group(1))

.com
.edu
.net
