# Regular Expressions
                     Regexes
A sort of meta-language with a few applicable uses. Generally speaking, it can be used to describe patterns in text. It falls within the scope of lexical analysis. 
- Find and/or extract text that matches a pattern
- Replace or substitute text that matches a pattern.

Regular expressions fit under the python standard library ```re``` module. _findall_ will be used most rigorously in this notebook. 
_findall_ receives a string that is a regular expression, the pattern, and another string that is the string you wish to search; it returns a list of all the times the given regex matches the string. 
***
#### Raw Strings
##### Any string in python prefixed with an r is a raw string. This means that backslashes will be included in the string verbatim, and don't carry special meaning. It is very common to use raw strings when creating a regular expression.
***

In [1]:
import re
import pandas as pd
import numpy as np

***
### ((Reg(ex))ercises)
***
1. Write a function named is_vowel. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

2. Write a function named is_valid_username that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either True or False depending on whether the passed string is a valid username.

```
>>> is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
False
>>> is_valid_username('codeup')
True
>>> is_valid_username('Codeup')
False
>>> is_valid_username('codeup123')
True
>>> is_valid_username('1codeup')
False
```

3. Write a regular expression to capture phone numbers. It should match all of the following:
```
(210) 867 5309
+1 210.867.5309
867-5309
210-867-5309
```


4. Use regular expressions to convert the dates below to the standardized year-month-day format.

```
02/04/19
02/05/19
02/06/19
02/07/19
02/08/19
02/09/19
02/10/19
```

5. Write a regex to extract the various parts of these logfile lines:

```
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
```

*** 

Bonus

You can find a list of words on your mac at /usr/share/dict/words. Use this file to answer the following questions:

- How many words have at least 3 vowels?
- How many words have at least 3 vowels in a row?
- How many words have at least 4 consonants in a row?
- How many words start and end with the same letter?
- How many words start and end with a vowel?
- How many words contain the same letter 3 times in a row?
- What other interesting patterns in words can you find?


***
***
1.
__Write a function named is_vowel__.       
- It should accept a string as input and use a regular expression to determine if the passed string is a vowel. 
- While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [9]:
def is_vowel(string):
    '''
    this function receives a string, and then indicates whether or not it is a vowel
    '''
    regexp = r'[aeiouAEIOU]'
    return bool(re.search(regexp, string))
    
#assert is_vowel('k') this returns an assertion error, as it should.
assert is_vowel('i') # this is accepted. 

# an alternative approach 
def check_for_vowels(string):
    '''
    this function receives a string, and then indicates whether or not it has a vowel
    '''
    regexp = r'[aeiouAEIOU]'
    return re.search(regexp, string)

In [13]:
print(f'{check_vowel("dog")}')
print(f'{check_vowel("try")}') # ignore how y is sometimes a vowel
print(f'{check_vowel("bnb")}')

<re.Match object; span=(1, 2), match='o'>
None
None


In [8]:
def startswith_vowel(string):
    '''
    this function receives a string and uses regular expression
    to indicate whether or not that word begins with a vowel.
    '''
    regexp = r'^[aeiouAEIOU]\w+'
    return re.search(regexp, string)

<re.Match object; span=(0, 5), match='eagle'>

In [16]:
print(f'{startswith_vowel("eagle")}')
print(f'{startswith_vowel("beagle")}') # ignore how y is sometimes a vowel
print(f'{startswith_vowel("seagull")}')
print(f'{startswith_vowel("Europe")}')

<re.Match object; span=(0, 5), match='eagle'>
None
None
<re.Match object; span=(0, 6), match='Europe'>


In [21]:
def endswith_vowel(string):
    '''
    this function receives a string and uses regular expression
    to indicate whether or not that word end with a vowel.
    '''
    regexp = r'\w+[aeiouAEIOU]$'
    return re.search(regexp, string)

In [22]:
print(f'{endswith_vowel("eagle")}')
print(f'{endswith_vowel("beagle")}') # ignore how y is sometimes a vowel
print(f'{endswith_vowel("seagull")}')
print(f'{endswith_vowel("Europe")}')

<re.Match object; span=(0, 5), match='eagle'>
<re.Match object; span=(0, 6), match='beagle'>
None
<re.Match object; span=(0, 6), match='Europe'>


***
***
2. __Write a function named is_valid_username__
- Accept a string as input. 
- A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character.
- It should also be no longer than 32 characters.
- The function should return either True or False depending on whether the passed string is a valid username.

In [31]:
def is_valid_username(string):
    '''
    This function indicates whether a string passes username requirements.
    A boolean test will determine whether the username is all lowercase
    and only has numbers or the _ character. It also ensures it is less than 32 chars.
    Returns true or false depending on if it meets this criteria.
    '''
    # ^ (starts with) lowercase a-z /w( (letters, numbers, or underscores) anywhere up to {} 31 characters.
    regexp = r'^[a-z]\w{,31}$'
    # conduct regex search and return a boolean indicator of its outcome
    return bool(re.search(regexp, string))

In [38]:
# the answers should be False, True, False, True, False

print(f" 1: {is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')}")
print(f" 2: {is_valid_username('codeup')}")
print(f" 3: {is_valid_username('Codeup')}")
print(f" 4: {is_valid_username('codeup123')}")
print(f" 5: {is_valid_username('1codeup')}") 
print(f" 6: {is_valid_username('aylma0%?')}")
assert is_valid_username('dolphins_rule69') # this passes, so the underscore is working as well.

 1: False
 2: True
 3: False
 4: True
 5: False
 6: False


In [74]:
def check_username(string):
    '''
    This function indicates whether a string passes username requirements.
    A boolean test will determine whether the username is all lowercase
    and only has numbers or the _ character. It also ensures it is less than 32 chars.
    Returns true or false depending on if it meets this criteria.
    '''
    # ^ (starts with) lowercase a-z /w( (letters, numbers, or underscores) anywhere up to {} 31 characters.
    regexp = r'^[a-z]\w{,31}$'
    # conduct regex search and return a boolean indicator of its outcome
    test = bool(re.search(regexp, string))
    if test == True:
        return print(f'Excellent. Username "{string}" is granted.')
    else:
        return print(f'The username "{string}" does not meet the requirements. \n Try again.')

In [68]:
print(check_username('hollabackgirl')) # not sure how to get rid of none here. 

Excellent. Username "hollabackgirl" is granted.
None


In [71]:
check_username('420blazeit')

The username "420blazeit" does not meet the requirements. 
 Try again.


***
***
3. Write a regular expression to capture phone numbers.
- It should match all of the following:
    - (210) 867 5309
    - +1 210.867.5309
    - 867-5309
    - 210-867-5309


In [93]:
# set variables for phone numbers I will be matching
phonenum = '(210) 867 5309'
phonenum1 = '+1 210.867.5309'
phonenum2 = '867-5309'
phonenum3 = '210-867-5309'

# r'.*\d{,3} (read below) The first dot accounts for instances of international code and parentheses, as well as spaces
# . (anything) * (zero or more) \d (numbers (digits makes it easier to remember)) {,3} up to three times
# .*\d{3}
# anything (space or not) * (>=0) digit for three spaces
# .*\d{4}
# same as before but with four spaces to close out the phone number. 

regexp = r'.*\d{,3}.*\d{3}.*\d{4}'
print(f'{re.search(regexp, ohno)}')

<re.Match object; span=(0, 21), match='867530986753098675309'>


In [78]:
print(f'{re.search(regexp, phonenum)}')
print(f'{re.search(regexp, phonenum1)}')
print(f'{re.search(regexp, phonenum2)}')
print(f'{re.search(regexp, phonenum3)}')

<re.Match object; span=(0, 14), match='(210) 867 5309'>
<re.Match object; span=(0, 15), match='+1 210.867.5309'>
<re.Match object; span=(0, 8), match='867-5309'>
<re.Match object; span=(0, 12), match='210-867-5309'>


In [91]:
def check_phonenumber(string):
    '''
    This function indicates whether a string passes is a valid phonenumber.
    Handles instances where the area-code is not included, accounts for hyphens, periods, 
    international codes and white-space. Uses a boolean test on the search and returns whether
    a valid phone-number was detected from the string. 
    '''
    regexp = r'.*\d{,3}.*\d{3}.*\d{4}'
    # conduct regex search and return a boolean indicator of its outcome
    test = bool(re.search(regexp, string))
    if test == True:
        return print(f'Phone number "{string}" accepted.')
    else:
        return print(f'"{string}" Does not register as a valid phone number')

In [92]:
okiedokie = '((6665))-2388'
print(check_phonenumber(phonenum))
print(check_phonenumber(okiedokie))
ohno = '867530986753098675309'
print(check_phonenumber(ohno))
dolphin = 'dolphin'
print(check_phonenumber(dolphin))

Phone number "(210) 867 5309" accepted.
None
Phone number "((6665))-2388" accepted.
None
Phone number "867530986753098675309" accepted.
None
"dolphin" Does not register as a valid phone number
None


***
***
4. Use regular expressions to convert the dates below to the standardized year-month-day format.
        - 02/04/19
        - 02/05/19
        - 02/06/19
        - 02/07/19
        - 02/08/19
        - 02/09/19
        - 02/10/19
        
The key to success here will be capture groups and substitution

In [94]:
date_list = [
    '02/04/19',
    '02/05/19',
    '02/06/19',
    '02/07/19',
    '02/08/19',
    '02/09/19',
    '02/10/19'
]

In [97]:
date_regex = r'(\d+)/(\d+)/(\d+)'

# testing
re.sub(date_regex, r'20\3-\1-\2',date_list[1])

'2019-02-05'

In [98]:
# apply to all
[re.sub(date_regex, r'20\3-\1-\2',date) for date in date_list]

['2019-02-04',
 '2019-02-05',
 '2019-02-06',
 '2019-02-07',
 '2019-02-08',
 '2019-02-09',
 '2019-02-10']

In [104]:
def convert_mdy_to_ymd(date_list):
    '''
    Receives a list of days formatted as mm/dd/yy
    and uses substitution and regex to return the list
    reformatted as yyyy-mm-dd
    '''
    regex = r'(\d+)/(\d+)/(\d+)'
    return [re.sub(date_reg, r'20\3-\1-\2',date) for date in date_list]

In [105]:
convert_mdy_to_ymd(date_list)

['2019-02-04',
 '2019-02-05',
 '2019-02-06',
 '2019-02-07',
 '2019-02-08',
 '2019-02-09',
 '2019-02-10']

***
***
5. Write a regex to extract the various parts of these logfile lines:
            
                        
```
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
```


In [109]:
lines = """
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
"""

In [107]:
regexp = r'''
^
(?P<method>GET|POST)
\s
(?P<path>/[/\w\-\?=]+)
\s
\[(?P<timestamp>.+)\]
\s
(?P<http_version>HTTP/\d+\.\d+)
\s
\{(?P<status_code>\d+)\}
\s
(?P<bytes>\d+)
\s
"(?P<user_agent>.+)"
\s
(?P<ip>\d+\.\d+\.\d+\.\d+)
$
'''

In [118]:
# experimenting with comments 
regexp2 = r'''
^
(?P<method>GET|POST) (?# ?P<> -chevrons- are used to designate the name of the capture-group. GET OR POST)
\s (?# Indicates the whitespace for the lines)
(?P<path>/[/\w\-\?=]+) (?# The path can contain one or more of any of the bracketed symbols, hence the plus)
\s
\[(?P<timestamp>.+)\] (?# The timestamp is anything, and any number of it all, contained within brackets)
\s
(?P<http_version>HTTP/\d+\.\d+) (?# version is HTTP/ \d+ -one or more numbers- \. -anything for the dot- and \d+)
\s
\{(?P<status_code>\d+)\} (?# status code is any number of digits contained in curly brackets)
\s
(?P<bytes>\d+) (?# collect all digits after the white space for bytes)
\s
"(?P<user_agent>.+)" (?# All of anything contained in quotation marks for user agent)
\s
(?P<ip>\d+\.\d+\.\d+\.\d+) (?# any number of digits between the period dividers)
$ (?# end of regex)
'''

In [119]:
[re.search(regexp2, line, re.VERBOSE).groupdict() for line in lines.strip().split('\n')]
# comments worked and didn't throw a wrench in the process.

[{'method': 'GET',
  'path': '/api/v1/sales?page=86',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '200',
  'bytes': '510348',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'},
 {'method': 'POST',
  'path': '/users_accounts/file-upload',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '201',
  'bytes': '42',
  'user_agent': 'User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
  'ip': '97.105.19.58'},
 {'method': 'GET',
  'path': '/api/v1/items?page=3',
  'timestamp': '16/Apr/2019:193453+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '429',
  'bytes': '3561',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'}]

In [111]:
# method, endpoint, date, protocol, http_status_code, some_number, "user_agent", ip_address

regex = r'''
(?P<method>[A-Z]+)
\s
(?P<path>.*)
\s
\[(?P<timestamp>.*)\]
\s
HTTP/1.1
\s
{(?P<status>\d+)}
\s
(?P<bytes_sent>\d+)
\s
"(?P<user_agent>.*)"
\s+
(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
'''

regex = re.compile(regex, re.VERBOSE)

df = pd.DataFrame()
df['line'] = lines.strip().split('\n')
df = pd.concat([df, df.line.str.extract(regex)], axis=1)
df

Unnamed: 0,line,method,path,timestamp,status,bytes_sent,user_agent,ip
0,"GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 ""python-requests/2.21.0"" 97.105.19.58",GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,200,510348,python-requests/2.21.0,97.105.19.58
1,"POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 ""User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"" 97.105.19.58",POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,201,42,"User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",97.105.19.58
2,"GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 ""python-requests/2.21.0"" 97.105.19.58",GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,429,3561,python-requests/2.21.0,97.105.19.58


***
***
### BONUSES 
                                                            and extra materials
                                                            
                                                            

In [2]:
# This function is taken from the DS textbook via Codeup
def show_all_matches(regexes, subject, re_length=6):
    print('Sentence:')
    print()
    print('    {}'.format(subject))
    print()
    print(' regexp{} | matches'.format(' ' * (re_length - 6)))
    print(' ------{} | -------'.format(' ' * (re_length - 6)))
    for regexp in regexes:
        fmt = ' {:<%d} | {!r}' % re_length
        matches = re.findall(regexp, subject)
        if len(matches) > 8:
            matches = matches[:8] + ['...']
        print(fmt.format(regexp, matches))

In [3]:
sentence = 'Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.'

show_all_matches([
    r'a',
    r'm',
    r'M',
    r'Mary',
    r'little',
    r'1',
    r'10',
    r'22'
], sentence)


Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
 a      | ['a', 'a', 'a', 'a', 'a']
 m      | ['m', 'm']
 M      | ['M']
 Mary   | ['Mary']
 little | ['little', 'little']
 1      | ['1', '1', '1']
 10     | ['10']
 22     | ['22']
