### <font color='brown'>Problem Set 5: Regular Expressions</brown>

**Square Brackets []:** Square brackets are used to create character classes. They specify a set of characters from which the regex engine can match any single character. For example:

- [abc]: Matches either 'a', 'b', or 'c'.
- [a-z]: Matches any lowercase letter from 'a' to 'z'.
- [0-9]: Matches any digit from '0' to '9'.
- [^abc]: Matches any character except 'a', 'b', or 'c'.
- [a-zA-Z0-9]: Matches any alphanumeric character.
- [aeiou]: Matches any vowel.


**Parentheses ():** Parentheses are used to create capturing groups or to specify the scope of alternation (the | symbol).

- (abc): Capturing group, captures and remembers 'abc' for later reference.
- (abc|def): Alternation, matches either 'abc' or 'def'.
- (?:abc): Non-capturing group, groups the pattern without capturing it.
- (ab)+: Capturing group with quantifier, matches 'ab' one or more times.
- (?=abc): Positive lookahead, matches if 'abc' is ahead in the string.
- (?!abc): Negative lookahead, matches if 'abc' is not ahead in the string.


**Neither:** You don't always need to use brackets or parentheses in a regular expression. For example:
- abc: Matches the sequence 'abc' exactly.
- a|b: Alternation, matches either 'a' or 'b'.
- \d: Matches any digit, equivalent to [0-9].
- \w: Matches any word character, equivalent to [a-zA-Z0-9_].
- \s: Matches any whitespace character.

In [10]:
import re

---

#### Problem 1

Write regular expressions for the following patterns, each of which will be used with the re.search() function on some target string:
1. A sequence for currency in dollars (> 0 and \< 1 million, no leading zeros) with at least one space before and after. The actual number will be preceded by a dollar sign, e.g. "$25".
2. A sequence for currency in dollars (> 0 and \< 1 million, no leading zeros) and cents, with at least one space before and after, e.g. "12.25". Cents must have 2 decimal places and 0 cents must be written as "00".
3. Modify the pattern for the sequence above to extract and print the dollars and cents values
4. A word (lowercase letters only) of length 4 or more that starts and ends with the same letter
5. A sequence that allows for (1) or (2), i.e. dollars with or without cents
6. A string that starts with letter, followed by at least one character that is not one of '.' (period) ',' (comma) or ';' (semicolon) and ending with any character that is not a digit.

In [49]:
# 1. A sequence for currency in dollars (> 0 and < 1 million, no leading zeros) with at least one space before and after. The actual number will be preceded by a dollar sign, e.g. "$25". 0-1000000

dol = re.compile(r'^\s+\$[1-9]\d{,5}\s+')
res = dol.search(' $3 ')
print(res)
res = dol.search(' $15 ')
print(res)
res = dol.search(' $0 ')
print(res)
res = dol.search(' 355 ')
print(res)
res = dol.search(' $0123 ')
print(res)
res = dol.search(' $300000 ')
print(res)
res = dol.search('$1000000')
print(res)

<re.Match object; span=(0, 4), match=' $3 '>
<re.Match object; span=(0, 5), match=' $15 '>
None
None
None
<re.Match object; span=(0, 9), match=' $300000 '>
None


In [50]:
# 2.  A sequence for currency in dollars (> 0 and \< 1 million, no leading zeros) and cents, with at least one space before and after, e.g. "12.25". Cents must have 2 decimal places and 0 cents must be written as "00".

dolcnt = re.compile('^\s+\$\d{,5}\.\d\d\s+')
# dolcnt = re.compile('^\s+\$\d{,5}\.\d{2}s+')

res = dolcnt.search(' $3 ')
print(res)
res = dolcnt.search(' $15 ')
print(res)
res = dolcnt.search(' $15.65 ')
print(res)
res = dolcnt.search(' $355.06 ')
print(res)
res = dolcnt.search(' $86723.00 ')
print(res)
res = dolcnt.search(' $86723.0 ')
print(res)

None
None
<re.Match object; span=(0, 8), match=' $15.65 '>
<re.Match object; span=(0, 9), match=' $355.06 '>
<re.Match object; span=(0, 11), match=' $86723.00 '>
None


In [54]:
# 3. Modify the pattern for the sequence above to extract and print the dollars and cents values
# Group the dollars and cents in () to signify res.group()
pattern_3 = r'^\s+\$([1-9]\d{,5})\.(\d{2})\s+'
res = re.search(pattern_3, ' $355.09 ')

if res:
    inp = res.group(0)
    dollars = res.group(1)
    cents = res.group(2)
    print('Dollars: ' + dollars)
    print('Cents: ' + cents)
else:
    print('Match not found')

Dollars: 355
Cents: 09


# !!!
Difference between
- `res.groups()`
- `res.group()`
- `res.groups()[0]`

Should use if else block because if there is no match, then the code will not run and throw an error

In [24]:
dolcnt = re.compile(r'\s+\$([1-9]{,5})\.(\d{2})\s+')
#  = re.compile(r'\s+\$([1-9]{,5})\.(\d{2})\s+')
res = dolcnt.search("    $2345.46      ")
print(res)

money = res.groups(0)

print(money)

print("Dollars: " + str(money[0]) + '\nCents: ' + str(money[1]))

print(f'dollars: {res.groups()[0]}, cents: {res.groups()[1]}')

<re.Match object; span=(0, 18), match='    $2345.46      '>
('2345', '46')
Dollars: 2345
Cents: 46
dollars: 2345, cents: 46


In [8]:
# 4. A word (lowercase letters only) that starts and ends with the same letter
# Use backreference to match the same letter at the beginning and the end
aword = re.compile(r'^([a-z])[a-z]{2,}\1$')

res = aword.search('alpha')
print(res)
res = aword.search('asia')
print(res)
res = aword.search('aja')
print(res)
res = aword.search('2ab2')
print(res)
res = aword.search('xalphay')
print(res)

<re.Match object; span=(0, 5), match='alpha'>
<re.Match object; span=(0, 4), match='asia'>
None
None
None


In [11]:
# 5. A sequence that allows for (1) or (2), i.e. dollars with or without cents
import re

pattern_5 = re.compile(r'\s+\$[1-9]\d{,5}\s+|\s+\$[1-9]\d{,5}\.\d{2})\s+')

res = pattern_5.search(' $3 ')
print(res)

res = pattern_5.search(' $15.00 ')
print(res)

error: unbalanced parenthesis at position 43

In [14]:
# A string that starts with letter, followed by at least one character that is not one of '.' (period) ',' (comma) or ';' (semicolon) and ending with any character that is not a digit.

# + at least one character
pattern_6 = r'^[a-zA-Z][^.,;]+[^0-9]'

res = re.search(pattern_6, 'aaa')
print(res)

<re.Match object; span=(0, 3), match='aaa'>


---

#### Problem 2

Write a function that takes a sentence as an input string parameter. The function should print True if the sentence starts with "The", contains "mountains" and ends with "river.", otherwise it should print False. <br/>Also, if the pattern matches, the function should split the sentence into two parts (using the regular expression split function), one each on either side of the word "mountains", and print the parts. 

In [15]:
def prob_2(string):
    pattern = r'^The.*mountains.*river\.$'
    res = re.search(pattern, string)
    
    if res:
        parts = re.split(r'\bmountains\b', string)
        print(parts)
        print("Match")
    else:
        print("No match")

prob_2('The mountains and the river')

No match


In [17]:
prob_2("The wild foxes live on the mountains and drink water from the river.")

['The wild foxes live on the ', ' and drink water from the river.']
Match


In [16]:
prob_2("The wild foxes live on the plains and drink water from the river.")

No match


---

#### Problem 3

You are given a file named [ps4-5_in.txt](ps4-5_in.txt) with multiple records, each on a separate line. Each record has person name, profession, and school name, separated by a comma. You are required to extract only those records for which person name starts with 'a' and profession is student, and write the extracted records to a new output file called *ps5-3_out.txt*. You must use regular expressions to do the extraction for each record.

Note: 'student scholar' is a different profession than 'student'


In [5]:
import re
def process_records():
    outf = open("./text_files/ps4-5_out.txt", "w")
    
    for line in open("./text_files/ps4-5_ in.txt", 'r'):
        parts = line.split(",")
        name = parts[0]
        profession = parts[1]
        
        if re.match('^a', name.strip()) and re.match('student$', profession.strip()):
            outf.write(line)
    outf.close()
    
process_records()

---

#### Problem 4

In this problem, you will implement a user signup functionality with the following requirements: <br/>
<ol>
    <li>A username starting with either a lowercase letter or an underscore.</li>
    <li>A password at least 6 characters long, containing at least one uppercase character.</li> 
    <li>An email address starting with a letter, and containing exactly one '@' immediately followed by a letter. Feel free to try more realistic requirements as well.</li>
You must ask the user to input the details one by one until each one of them abides by the constraints (while loop!).

In [18]:
def user_signup():
    email = input("Email Address: ")
    username = input("Username")
    password = input("Password: ")

    while True: 
        user_pattern = r'^([a-z]|_)'
        if re.search(user_pattern, username):
            break
        
    while True:
        pass_pattern = r'[A-Z]'
        if len(password)>=6 and re.search(pass_pattern, password):
            break

    while True:
    # while user_res doesn't throw an error
        email_pattern = r'^[A-Za-z]@\.com$'
        email_res = re.search(email_pattern, email)
    
    print(user_res)
    print(pass_res)
    print(email_res)

# user_signup()

---

#### Problem 5

In this problem, you will write a function to validate IP addresses. You will check and print if the IP address passed to your function is a valid IPv4 or IPv6 address. If it is not a valid IP address, print "Invalid IP address". <br/>
Read more about IPv4 and IPv6 addresses <a href = "https://www.ibm.com/support/knowledgecenter/en/STCMML8/com.ibm.storage.ts3500.doc/opg_3584_IPv4_IPv6_addresses.html">here</a>.


In [None]:
def valid_IP():
    addre = input("Enter IP address to validate: ")
    
    pattern= ''