# Exercises - Regular Expressions
[*Lesson Page*](https://ds.codeup.com/11-nlp/2-regular-expressions/)

Create a directory named `nlp` to do your work for this module. All exercises should live inside of this directory.

Unless a specific file extension is specified, you may do your work either in a python script (.py) or a jupyter notebook (.ipynb).

Do your work for this exercise in a file named `regex`.

In [1]:
import pandas as pd
import re
from zgulde.hl_matches import hl_all_matches_nb as hl

1. **Write a function named `is_vowel`. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.**

In [2]:
def is_vowel(chars):
    '''
    returns True if test value is one single vowel
    >>> is_vowel('a')
    True
    >>> is_vowel('aa')
    False
    >>> is_vowel('')
    False
    >>> is_vowel(9)
    False
    >>> is_vowel('A')
    True
    '''
    test = str(chars).lower()
    chk = r'^[aeiou]$'
    return re.search(chk, test) is not None

In [3]:
test1 = pd.DataFrame(['a','aa','b','A','A ','$', 9, '9'], columns=['test'])
test1['is_vowel'] = test1.test.apply(lambda x: is_vowel(x))
test1[test1.is_vowel == True]

Unnamed: 0,test,is_vowel
0,a,True
3,A,True


2. **Write a function named `is_valid_username` that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either True or False depending on whether the passed string is a valid username.**


<code>
>>> is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
False
>>> is_valid_username('codeup')
True
>>> is_valid_username('Codeup')
False
>>> is_valid_username('codeup123')
True
>>> is_valid_username('1codeup')
False
</code>

In [4]:
def is_valid_username(chars):
    '''
    >>> is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
    False
    >>> is_valid_username('codeup')
    True
    >>> is_valid_username('Codeup')
    False
    >>> is_valid_username('codeup123')
    True
    >>> is_valid_username('1codeup')
    False
    '''
    test = str(chars)
    chk = r'^[a-z][a-z0-9_]{,31}$'
    return re.search(chk, test) is not None

3. **Write a regular expression to capture phone numbers. It should match all of the following:**


(210) 867 5309<br>
+1 210.867.5309<br>
867-5309<br>
210-867-5309<br>

In [5]:
def format_phone_number_uscan(chars):
    '''
    
    >>> format_phone_number_uscan('(210) 867 5309')
    '210-867-5309'
    >>> format_phone_number_uscan('+1 210.867.5309')
    '210-867-5309'
    >>> format_phone_number_uscan('867-5309')
    '867-5309'
    >>> format_phone_number_uscan('-867-5309')
    False
    >>> format_phone_number_uscan('210-867-5309')
    '210-867-5309'
    >>> format_phone_number_uscan('phone number')
    False
    '''
    test = str(chars.lower())
    chk = re.compile(r'^(?P<exchg>\d{3})[ .-](?P<digits>\d{4})$')
    if chk.search(test) is not None:
        results = chk.findall(test)
        return results[0][0] + '-' + results[0][1]
    chk = re.compile(r'(\+1[ -.])?[(]?(?P<area>\d{3})[)]?[ .-](?P<exchg>\d{3})[ .-](?P<digits>\d{4})$')
    if chk.search(test) is None:
        return False
    results = chk.findall(test)
    return results[0][1] + '-' + results[0][2] + '-' + results[0][3]

4. **Use regular expressions to convert the dates below to the standardized year-month-day format.**


02/04/19<br>
02/05/19<br>
02/06/19<br>
02/07/19<br>
02/08/19<br>
02/09/19<br>
02/10/19<br>

In [47]:
dates = pd.Series(
    ['02/04/19', '02/05/2019', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19'], 
    name='dates')
chk = re.compile(r'^([0123]?\d)/([0123]?\d)/(([\d]{2})?\d{2})')

# [chk.sub(r'/3-/1-/2', str(date)) for date in dates]

5. **Write a regex to extract the various parts of these logfile lines:**

<code>GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58<br>POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58<br>GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58</code>

In [49]:
inputs5 = pd.Series([
    r'GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
    r'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
    r'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58'
])
display(inputs5)

outputs5 = inputs5.str.extract(r'^([A-Z]+)\s(\/[^?^ ]+)(\?\S+)?\s\[(.+)\]\s(\S+)\s(\{[0-9]+\})\s([0-9]+)\s(\".+\")\s(\d+.\d+.\d+.\d)')
outputs5.columns = ['method', 'path', 'qstring', 'timestamp', 'protocol', 'status', 'bytes', 'useragent', 'ip']
display(outputs5)

0    GET /api/v1/sales?page=86 [16/Apr/2019:193452+...
1    POST /users_accounts/file-upload [16/Apr/2019:...
2    GET /api/v1/items?page=3 [16/Apr/2019:193453+0...
dtype: object

Unnamed: 0,method,path,qstring,timestamp,protocol,status,bytes,useragent,ip
0,GET,/api/v1/sales,?page=86,16/Apr/2019:193452+0000,HTTP/1.1,{200},510348,"""python-requests/2.21.0""",97.105.19.5
1,POST,/users_accounts/file-upload,,16/Apr/2019:193452+0000,HTTP/1.1,{201},42,"""User-Agent: Mozilla/5.0 (X11; Fedora; Fedora;...",97.105.19.5
2,GET,/api/v1/items,?page=3,16/Apr/2019:193453+0000,HTTP/1.1,{429},3561,"""python-requests/2.21.0""",97.105.19.5


6. **You can find a list of words on your mac at `/usr/share/dict/words`. Use this file to answer the following questions:
   - How many words have at least 3 vowels?
   - How many words have at least 3 vowels in a row?
   - How many words have at least 4 consonants in a row?
   - How many words start and end with the same letter?
   - How many words start and end with a vowel?
   - How many words contain the same letter 3 times in a row?
   - What other interesting patterns in words can you find?

In [14]:
words = pd.read_csv('/usr/share/dict/words', header=None).iloc[:,0].dropna().str.lower()
words.describe()

count     235884
unique    234370
top        daira
freq           2
Name: 0, dtype: object

In [41]:
# How many words have at least 3 vowels?

w3v = words[words.str.count(r'[aeiou]') >= 3]
display(w3v.describe())
display(w3v.sample(4))

count     191365
unique    190744
top       sabine
freq           2
Name: 0, dtype: object

220306          unpathed
219478    unmilitariness
201415        thelyotoky
148411    plectospondyli
Name: 0, dtype: object

In [42]:
# How many words have at least 3 vowels in a row?

w3vc = words[words.str.match(r'.*[aeiou]{3,}')]
display(w3vc.describe())
display(w3vc.sample(4))

count          6182
unique         6172
top       delicious
freq              2
Name: 0, dtype: object

137239    palaeophytology
235488         zooerastia
99036            isozooid
53022        diatomaceous
Name: 0, dtype: object

In [None]:
# How many words have at least 4 consonants in a row?

ww = words[words]
display(ww.describe())
display(ww.sample(4))

In [None]:
# How many words start and end with the same letter?

ww = words[words]
display(ww.describe())
display(ww.sample(4))

In [None]:
# How many words start and end with a vowel?

ww = words[words]
display(ww.describe())
display(ww.sample(4))

In [None]:
# How many words contain the same letter 3 times in a row?

ww = words[words]
display(ww.describe())
display(ww.sample(4))

In [43]:
# How many words contain two sets of the same two letters repeated?

w2lr = words[words.str.contains(r'((.)\2).*\1')]
display(w2lr.describe())
display(w2lr.sample(4))

  This is separate from the ipykernel package so we can avoid doing imports until


count               792
unique              792
top       assiduousness
freq                  1
Name: 0, dtype: object

186096      speechlessness
216204    unexpressiveness
198654       tastelessness
231813        whillaballoo
Name: 0, dtype: object

In [None]:
# What other interesting patterns in words can you find?

ww = words[words]
display(ww.describe())
display(ww.sample(8))

In [None]:
import doctest
doctest.testmod(verbose=False)