# Regex

1. Write a function named is_vowel. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

2. Write a function named is_valid_username that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either True or False depending on whether the passed string is a valid username.

3. Write a regular expression to capture phone numbers. It should match all of the following:
    
    (210) 867 5309
    
    +1 210.867.5309
    
    867-5309
    
    210-867-5309

4. Use regular expressions to convert the dates below to the standardized year-month-day format.

5. Write a regex to extract the various parts of these logfile lines:


In [1]:
# imports
import pandas as pd
import re

In [84]:
# Write a function named is_vowel. 
# It should accept a string as input and use a regular expression to determine if the passed string is a vowel.

def is_a_vowel(a_string):
    '''
    Takes a given string and returns any vowels
    '''
    regexp = r'[aeiouAEIOU]'
    
    return re.findall(regexp, a_string)
    

In [85]:
is_a_vowel('tacos')

['a', 'o']

In [86]:
is_a_vowel('TACOS')

['A', 'O']

In [87]:
# solution
def is_vowel(string):
    '''
    returns a boolean value assessing if the passed string is a single vowel
    '''
    regex = r'^[aeiou]$'
    return bool(re.search(regex, string.lower()))

In [89]:
print(is_vowel('A'))
print(is_vowel('e'))
print(is_vowel('b'))
print(is_vowel('ee'))
print(is_vowel('aie'))

True
True
False
False
False


In [79]:
# Write a function named is_valid_username that accepts a string as input.

#def is_valid_username(a_string):
    '''
    Takes a username that starts with a lowercase letter and only consists of lower case letters and numbers no longer
    than 32 characters
    '''
    
    #regexp = r'^([a-z]+[0-9a-z]{1,32}$)'
    #if re.match(regexp,a_string):
        #return True
    #else:
        #return False

In [106]:
# solution
def is_valid_username(some_string):
    '''
    returns a boolean value assessing if the passed string fits the critera of a username:
    starts with a lowercase letter, 
    only consists of lowercase letters, numbers, or the _ character. 
    no longer than 32 characters. 
    '''
    regex = r'^[a-z][a-z0-9_]{,31}$'
    return bool(re.search(regex, some_string))

In [107]:
is_valid_username('tonymontana01')

True

In [108]:
is_valid_username('Tonymontana01')

False

In [109]:
is_valid_username('abcdefghijklmnopqrstuvwxyzaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa123456789')

False

In [110]:
print(is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'))
print(is_valid_username('codeup'))
print(is_valid_username('Codeup'))
print(is_valid_username('codeup123'))
print(is_valid_username('1codeup'))

False
True
False
True
False


In [112]:
'a' * 33

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

In [111]:
print(is_valid_username('a' * 33))

False


In [95]:
# Write a regular expression to capture phone numbers. It should match all of the following:
# (210) 867 5309
# +1 210.867.5309
# 867-5309
# 210-867-5309

# solution
df = pd.DataFrame()
df['number'] = [
    '(210) 867 5309',
    '+1 210.867.5309',
    '867-5309',
    '210-867-5309',
    '2108675309',
]

In [96]:
# solution
phone_regex = re.compile(
'''
^
(?P<country_code>\+\d+)?
\D*?
(?P<area_code>\d{3})?
\D*?
(?P<exchange_code>\d{3})
\D*?
(?P<line_number>\d{4})
\D*
$
''', re.VERBOSE)

In [97]:
# solution
df.number.str.extract(phone_regex)

Unnamed: 0,country_code,area_code,exchange_code,line_number
0,,210.0,867,5309
1,1.0,210.0,867,5309
2,,,867,5309
3,,210.0,867,5309
4,,210.0,867,5309


In [98]:
# solution
pd.concat([df, df.number.str.extract(phone_regex)], axis=1)

Unnamed: 0,number,country_code,area_code,exchange_code,line_number
0,(210) 867 5309,,210.0,867,5309
1,+1 210.867.5309,1.0,210.0,867,5309
2,867-5309,,,867,5309
3,210-867-5309,,210.0,867,5309
4,2108675309,,210.0,867,5309


In [99]:
# Use regular expressions to convert the dates below to the standardized year-month-day format.

date_list = [
    '02/04/19',
    '02/05/19',
    '02/06/19',
    '02/07/19',
    '02/08/19',
    '02/09/19',
    '02/10/19',
]

In [100]:
date_reg = r'(\d+)/(\d+)/(\d+)'
re.sub(date_reg, r'20\3-\1-\2',date_list[0])

'2019-02-04'

In [101]:
[re.sub(date_reg, r'20\3-\1-\2',date) for date in date_list]

['2019-02-04',
 '2019-02-05',
 '2019-02-06',
 '2019-02-07',
 '2019-02-08',
 '2019-02-09',
 '2019-02-10']

In [102]:
# Write a regex to extract the various parts of these logfile lines:
# solution
lines = """
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
"""

In [103]:
# solution
regexp = r'''
^
(?P<method>GET|POST)
\s
(?P<path>/[/\w\-\?=]+)
\s
\[(?P<timestamp>.+)\]
\s
(?P<http_version>HTTP/\d+\.\d+)
\s
\{(?P<status_code>\d+)\}
\s
(?P<bytes>\d+)
\s
"(?P<user_agent>.+)"
\s
(?P<ip>\d+\.\d+\.\d+\.\d+)
$
'''

In [104]:
# solution
[re.search(regexp, line, re.VERBOSE).groupdict() for line in lines.strip().split('\n')]

[{'method': 'GET',
  'path': '/api/v1/sales?page=86',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '200',
  'bytes': '510348',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'},
 {'method': 'POST',
  'path': '/users_accounts/file-upload',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '201',
  'bytes': '42',
  'user_agent': 'User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
  'ip': '97.105.19.58'},
 {'method': 'GET',
  'path': '/api/v1/items?page=3',
  'timestamp': '16/Apr/2019:193453+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '429',
  'bytes': '3561',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'}]

In [105]:
# solution
# method, endpoint, date, protocol, http_status_code, some_number, "user_agent", ip_address

regex = r'''
(?P<method>[A-Z]+)
\s
(?P<path>.*)
\s
\[(?P<timestamp>.*)\]
\s
HTTP/1.1
\s
{(?P<status>\d+)}
\s
(?P<bytes_sent>\d+)
\s
"(?P<user_agent>.*)"
\s+
(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
'''

regex = re.compile(regex, re.VERBOSE)

df = pd.DataFrame()
df['line'] = lines.strip().split('\n')
df = pd.concat([df, df.line.str.extract(regex)], axis=1)
df

Unnamed: 0,line,method,path,timestamp,status,bytes_sent,user_agent,ip
0,GET /api/v1/sales?page=86 [16/Apr/2019:193452+...,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,200,510348,python-requests/2.21.0,97.105.19.58
1,POST /users_accounts/file-upload [16/Apr/2019:...,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET /api/v1/items?page=3 [16/Apr/2019:193453+0...,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,429,3561,python-requests/2.21.0,97.105.19.58
