# Curriculum

- regex (short for 'regular expressions') is most commonly used in two ways:
    
    1.) to find / extract text that matches a pattern; and 
    
    2.) To replace / substitute text that matches a pattern
    
    
- module is 're' - that's what I import

- the two main functions in 're' that I will use are '.findall', and '.search'

**A note on raw strings:**
    
    - carry the prefix 'r'
    
    - backslashes are included in the sting verbatim, and don't carry any special meaning.  For instance, 'and/or' is 'and' or 'or'
    
    - common in regex, so don't freak out
    

In [1]:
import re

In [57]:
re.findall(r'b', 'abcd') 

# we're looking for pattern 'b' in the string 'abcd'

['b']

In [3]:
# function to simplify the process of showing many results form regex:

def show_all_matches(regexes, subject, re_length=6):
    """
    function that takes in the regexes, the subject, and the length, and prints out the results
    """
    print("Sentence: ")
    print()
    print("      {}".format(subject))
    print()
    print(" regexp{} | matches".format(' ' * (re_length - 6)))
    print(" ------{} | -------".format(' ' * (re_length - 6)))
    for regexp in regexes:
        fmt = "  {:<%d} | {!r}" % re_length
        matches = re.findall(regexp, subject)
        if len(matches) > 8:
            matches = matches[:8] + ['...']
        print(fmt.format(regexp, matches))

In [4]:
sentence = "Mary had a little lamb. 1 little lamb.  Not 10, not 12, not 22, just one."

In [5]:
show_all_matches([
r'a',
r'm',
r'M', 
r'Mary',
r'little', 
r'1',
r'10', 
r'22'], sentence)

Sentence: 

      Mary had a little lamb. 1 little lamb.  Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
  a      | ['a', 'a', 'a', 'a', 'a']
  m      | ['m', 'm']
  M      | ['M']
  Mary   | ['Mary']
  little | ['little', 'little']
  1      | ['1', '1', '1']
  10     | ['10']
  22     | ['22']


## Metacharacters and Character Classes

**METACHARACTERS** match several different kinds of characters, BUT DON'T MATCH THE CHARACTER ITSELF literally

Not only that, but metacharacters have to be escaped to match the character itself. 

METACHARACTER  |  WHAT IT MATCHES

. _______________ Anything

\w ______________ ANY letter or number

\W ______________ Anything that's NOT a letter or number

\d ______________ Any DIGIT

\D ______________ Any NON-DIGIT

\s ______________ Any WHITESPACE character

In [6]:
res = [
    r'\w', # matches everything but the spaces
    r'\d', # matches all the digits (numbers)
    r'\s', # matches all the whitespace characters
    r'.', # matches every character, including spaces
    r'\.' # matches all the actual, literal periods
]

show_all_matches(res, sentence)

Sentence: 

      Mary had a little lamb. 1 little lamb.  Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
  \w     | ['M', 'a', 'r', 'y', 'h', 'a', 'd', 'a', '...']
  \d     | ['1', '1', '0', '1', '2', '2', '2']
  \s     | [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '...']
  .      | ['M', 'a', 'r', 'y', ' ', 'h', 'a', 'd', '...']
  \.     | ['.', '.', '.']


### Lookie, lookie!  You can combine regex characters to do multiple things!

In [7]:
show_all_matches([r'l\w\w\w\W', r'\d\d'], sentence, re_length=9) 

Sentence: 

      Mary had a little lamb. 1 little lamb.  Not 10, not 12, not 22, just one.

 regexp    | matches
 ------    | -------
  l\w\w\w\W | ['lamb.', 'lamb.']
  \d\d      | ['10', '12', '22']


**^^ 'l\w\w\w\W' matched 'lamb.'  '\d\d' matched numbers that were in pairs.  If I increased the 'l\w' stuff by another w, it would look for a longer word that began with 'l'.**

## Repeating

There are some metacharacters that match the character before them a repeated number of times:
    
\* = zero or more matches

\+ = one or more matches

{n} = exactly n number of matches (repetitions)

{n, } = n or more repetitions

{n, m} = betwen n and m reptetitions

? = the character before it (the ?) is optional character

In [8]:
show_all_matches([r'\d+'], sentence)

Sentence: 

      Mary had a little lamb. 1 little lamb.  Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
  \d+    | ['1', '10', '12', '22']


In [9]:
print('\n---\n')


---



In [58]:
show_all_matches([r'a{2,}', # matches 2 or more a's
                  r'a{2}', # matches a's in groups of 2
                  r'a{3,4}'], # matches where a is in a group of either 3 or 4
                 'aabbaaaa')

Sentence: 

      aabbaaaa

 regexp | matches
 ------ | -------
  a{2,}  | ['aa', 'aaaa']
  a{2}   | ['aa', 'aa', 'aa']
  a{3,4} | ['aaaa']


## Any or None Of

- **SQUARE BRACKETS** = single character that will match any of the values within the square brackets.  Eg: '[ab]' will match EITHER 'a' or 'b'.

IF THE FIRST CHARACTER IN THE BRACKETS IS A '^' (caret), then anything that is NOT inside the [ ] will be matched.  For example, [^'ab'] will match any character that is NOT 'a' OR 'b'. 

**VIMP**

Ranges of digits AND letters can be abbreviated with a hyphen

In [11]:
show_all_matches([r'[lt]', # matches all l's and t's individually
                  r'[lt]+', # matches all l's and t's, literally and in combos where 'l' and 't' are together
                  r'[^aeiou\s\.]', # matches anything that's not a vowel or a space or a period
                  r'[a-d]'], # matches anything that's a, b, c, or d.
                 sentence, re_length=12)

Sentence: 

      Mary had a little lamb. 1 little lamb.  Not 10, not 12, not 22, just one.

 regexp       | matches
 ------       | -------
  [lt]         | ['l', 't', 't', 'l', 'l', 'l', 't', 't', '...']
  [lt]+        | ['l', 'ttl', 'l', 'l', 'ttl', 'l', 't', 't', '...']
  [^aeiou\s\.] | ['M', 'r', 'y', 'h', 'd', 'l', 't', 't', '...']
  [a-d]        | ['a', 'a', 'd', 'a', 'a', 'b', 'a', 'b']


# Anchors

- Metacharacters that don't match any individual characters, but ANCHOR the rest of the regular expression

- '^' matches "The start of the string / line"

- '$' matches "The end of the string / line"

- '\b' matches "a word boundary -> the alphanumeric value that comes after a word boundary" 

**On word boundaries:**


- a word boundary is any character that is NOT a word character - can be a space, a dash, a tab
    
- useful in finding whole / partial word searches 

- say I want to find words that END WITH 'and', not just the word 'and'.  'and\b' will get me all the 'and' words by themselves because they have a space - a word boundary - after the 'd.'  Also, it will find me all the words that end in 'and' for the same reason

- If I want to find words that BEGIN WITH 'and', just put the '\b' at the front, '\band'

- If I want to find ONLY where 'and' is PART OF A WORD - not by itself, not only at the beginning, not only at the end, but within words, I would type 'b\and\w+'.  That basically says 'match words that begin with 'and' and have word characters follwing.'  This will find 'andouille', 'andromeda', and 'androgen', but not 'and' by itself


**VIMP**

Numbers in regex are considered word characters


In [12]:
show_all_matches([
    r'\bo\w+', # finds all words that begin with 'o'
    r'^\s', # matches that start with a space
    r'^M', # matches things that begin withcapital 'm'
    r'\.$' # matches the period at end of string.  If a !, ?, or '', mark, it won't find it, but the period before
], sentence)

Sentence: 

      Mary had a little lamb. 1 little lamb.  Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
  \bo\w+ | ['one']
  ^\s    | []
  ^M     | ['M']
  \.$    | ['.']


### Other Common Features:

- 'match' = Matches from start of string

- 'search' = Finds the first instance of the regular expression

- 'sub' = Make substitutions with regular expression

- 'compile' = Prepare a regex for use ahead of time

## Capture Groups
 
    **comeback2**

In [13]:
sentence = """
You can find us on the web at https://codeup.com.  Our ip address is 123.123.123.123 (maybe).
""".strip()

In [14]:
sentence

'You can find us on the web at https://codeup.com.  Our ip address is 123.123.123.123 (maybe).'

In [15]:
ip_re = r'\d+(\.\d+){3}' # Braces indicate # of groups it'll capture.

# if I have {3}, it'll grab 3 periods and the numbers on both sides of the period:

# ip_re = r'\d+(\.\d+){3} = 123.123.123.123

# if I have {2}, it'll grab 2 periods and the numbers on both sides of the period:

# ip_re = r'\d+(\.\d+){2} = 123.123.123

# if I have {1}, it'll grab 1 period and the numbers on both sides of it

# ip_re = r'\d+(\.\d+){1} = 123.123

match = re.search(ip_re, sentence)
match[0]

'123.123.123.123'

In [16]:
# simplified for demonstration, a real url to parse urls would be mouch more complex

# re: 'sentence'

url_re = r'(https?)://(\w+)\.(\w+)' # the '?' is 

protocol, domain, tld = re.search(url_re, sentence).groups()

print(f"""
protocol: {protocol}
domain: {domain}
tld: {tld}
"""
)


protocol: https
domain: codeup
tld: com



### Non-Capturing ("Shy") Groups

- add '?:' to the beginning of the group

- groups can be named by adding '?P<group_name>'

In [17]:
url_re = r'(?P<protocol>https?)://(?:\w+)\.(?P<tld>\w+)'

match = re.search(url_re, sentence)

print(f'''
groups: {match.groups()}
referencing a group by name: {match.group('tld')}
group dictionary: {match.groupdict()}      

''')


groups: ('https', 'com')
referencing a group by name: com
group dictionary: {'protocol': 'https', 'tld': 'com'}      




## Substitution:

- We can use regex to replace / remove parts of a string.  

- Also, if the regex we supply has capture groups in it, the text captured can be referencd when making the substitution


In [18]:
# remove anything that IS NOT a DIGIT

re.sub(r'\D', '', 'abc 123')

'123'

In [19]:
# removing anything that's NOT A LETTER

re.sub(r'[^a-z]', '', 'abc 123')

'abc'

In [20]:
re.sub(r'.(.).', r'\1', 'abc')

'b'

In [21]:
re.sub(r'(.)(.)(.)', r'\3\2\1', 'abc')

'cba'

In [22]:
re.sub(r'.{2}$', 'X', 'abc')

'aX'

## Regex Flags

- add the below to the last argument of any regex expression, and that behavior will be enabled for that use of the regular exppression:

re.MULTILINE = the ^ and $ anchors will apply line by line instead of working on only the start and end of the string


re.IGNORECASE = IGNORES character casing when matching


re.VERBOSE = Ignores all whitespace 

           = Makes regex more readable
    
           = Can be combined with non-capturing comment groups
        
           = useful b/c you can split things up and comment out w/ this special capture group: (?#'commnent')
    




In [23]:
regexp = r'''
[aeiou](?# any vowel)
[^aeiou](?# followed by a non-vowel)
'''

In [24]:
regexp

'\n[aeiou](?# any vowel)\n[^aeiou](?# followed by a non-vowel)\n'

**^^ Is the same as v v when the VERBOSE flag is set**

In [25]:
regexp = r'[aeiou][^aeiou]'

In [26]:
regexp

'[aeiou][^aeiou]'

# Exercises

### 1.) Write 'is_vowel' function that accepts a string as input and uses a regex to determine if the passed string is a vowel.  You can treat the result of 're.search' as a boolean that indecates whether or not the regex matches the given string.

In [59]:
# Mine:

# def is_vowel(string):
#     """
#     Function that takes in a string and uses regex to check if the passed string is a vowel.  
#     """
#     regex = '^[aeiouAEIOU][A-Za-z0-9_]*'
#     if(re.search(regex, string)):
#         print(True)
#     else:
#         print(False)
    

# Ryan's:

def is_vowel(string):
    """
    Function to take in a string and use regex to determine if it's a vowel
    """
    #starts with a character from this character class
    #ends with a character from this other character class
    regex = r'^[aeiouAEIOU]$'
    return bool(re.search(regex, string))

assert is_vowel("A") == True
assert is_vowel("B") == False

In [60]:
is_vowel("b")

False

### 2.) Write a function named 'is_valid_username' that accepts a string as input.  A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character.  Also, it should be NO LONGER than 32 characters.  This function should return either **True** or **False** depending on whether or not the string passed is a valid username.

In [61]:
# Mine

# def is_valid_username(regexes, username, max_length=32):
#     """
#     Function using regex to check the validity of a given username according to the rules above
#     """
#     alpha = 'abcdefghijklmnopqrstuvwxyz'
#     numb = '_0123456789'
#     print("Your username: ")
#     print()
#     print(f"  {string}")
#     for regexp in regexes:
#         match = re.findall(r'^[alpha]', username)
#         if len(match) <= 32:
            
# Ryan's:

def is_valid_username(string):
    """
    Function to determine if a username is valid based on the rules above
    """
    return bool(re.search(r'^[a-z][a-z0-9_]{,31}$', string))

assert is_valid_username("jane_janeway76") == True
assert is_valid_username("BillKEtas774g") == False


### 3.) Write a regex to capture phone numbers.  It should match all the following:

- (210) 867-5309

- +1 210.867-5309

- 867-5309

- 210-867-5309

In [63]:
# Mine

# phone_number = '(210) 867-5309'
# re.sub(r'\b[^0-9]\b', '', phone_number)

# Ryans:

numbers = [
    "(210) 867-5309", 
    "+1 210.867.5309", 
    "867-5309", 
    "210-867-5309"
    ]

# parts of a phone number: Country Code = +1, Area Code = 210, Exchange Code = 226, Line number = 3232

phone_number_re = re.compile(r'''
^
(?P<country_code>\+\d+)?
\D*?
(?P<area_code>\d{3})?
\D*?
(?P<exchange_code>\d{3})
\D*?
(?P<line_number>\d{4})
\D*
$
''', re.VERBOSE)

# iterate through the list of strings, producing a dictionary containing named groups from each string

phone_numbers = [re.search(phone_number_re, number).groupdict() for number in numbers]

phone_numbers

[{'country_code': None,
  'area_code': '210',
  'exchange_code': '867',
  'line_number': '5309'},
 {'country_code': '+1',
  'area_code': '210',
  'exchange_code': '867',
  'line_number': '5309'},
 {'country_code': None,
  'area_code': None,
  'exchange_code': '867',
  'line_number': '5309'},
 {'country_code': None,
  'area_code': '210',
  'exchange_code': '867',
  'line_number': '5309'}]

In [65]:
# Now that we have a list of dictionaries, we can make a dataframe!  Cool!
import pandas as pd
import numpy as np

df = pd.DataFrame(phone_numbers)
df

Unnamed: 0,country_code,area_code,exchange_code,line_number
0,,210.0,867,5309
1,1.0,210.0,867,5309
2,,,867,5309
3,,210.0,867,5309


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
country_code     1 non-null object
area_code        3 non-null object
exchange_code    4 non-null object
line_number      4 non-null object
dtypes: object(4)
memory usage: 256.0+ bytes


### 4.) Standardize the dates below:

In [None]:
# Convert the dates below to the standardized year-month-day format:

