### <font color='brown'>Problem Set 5: Regular Expressions</brown>

**Square Brackets []:** Square brackets are used to create character classes. They specify a set of characters from which the regex engine can match any single character. For example:

- [abc]: Matches either 'a', 'b', or 'c'.
- [a-z]: Matches any lowercase letter from 'a' to 'z'.
- [0-9]: Matches any digit from '0' to '9'.
- [^abc]: Matches any character except 'a', 'b', or 'c'.
- [a-zA-Z0-9]: Matches any alphanumeric character.
- [aeiou]: Matches any vowel.


**Parentheses ():** Parentheses are used to create capturing groups or to specify the scope of alternation (the | symbol).

- (abc): Capturing group, captures and remembers 'abc' for later reference.
- (abc|def): Alternation, matches either 'abc' or 'def'.
- (?:abc): Non-capturing group, groups the pattern without capturing it.
- (ab)+: Capturing group with quantifier, matches 'ab' one or more times.
- (?=abc): Positive lookahead, matches if 'abc' is ahead in the string.
- (?!abc): Negative lookahead, matches if 'abc' is not ahead in the string.


**Neither:** You don't always need to use brackets or parentheses in a regular expression. For example:
- abc: Matches the sequence 'abc' exactly.
- a|b: Alternation, matches either 'a' or 'b'.
- \d: Matches any digit, equivalent to [0-9].
- \w: Matches any word character, equivalent to [a-zA-Z0-9_].
- \s: Matches any whitespace character.

In [2]:
import re

---

#### Problem 1

Write regular expressions for the following patterns, each of which will be used with the re.search() function on some target string:
1. A sequence for currency in dollars (> 0 and \< 1 million, no leading zeros) with at least one space before and after. The actual number will be preceded by a dollar sign, e.g. "$25".
2. A sequence for currency in dollars (> 0 and \< 1 million, no leading zeros) and cents, with at least one space before and after, e.g. "12.25". Cents must have 2 decimal places and 0 cents must be written as "00".
3. Modify the pattern for the sequence above to extract and print the dollars and cents values
4. A word (lowercase letters only) that starts and ends with the same letter
5. A sequence that allows for (1) or (2), i.e. dollars with or without cents
6. A string that starts with letter, followed by at least one character that is not one of '.' (period) ',' (comma) or ';' (semicolon) and ending with any character that is not a digit.

In [8]:
# 0 to 1,000,000
pattern = '^\s\$\d{1,6}\s'
res = re.search(pattern, ' $100 ')
print(res)
res = re.search(pattern, ' $431')
print(res)
res = re.search(pattern, ' $100000 ')
print(res)

<re.Match object; span=(0, 6), match=' $100 '>
None
<re.Match object; span=(0, 9), match=' $100000 '>


In [10]:
# A sequence for currency in dollars (> 0 and \< 1 million, no leading zeros) and cents, with at least one space before and after, e.g. "12.25". Cents must have 2 decimal places and 0 cents must be written as "00".

pattern_2 = '^\s\$\d{1,6}\.\d\d\s'

res = re.search(pattern_2, ' $100.00 ')
print(res)

res = re.search(pattern_2, ' $100.69 ')
print(res)

<re.Match object; span=(0, 9), match=' $100.00 '>
<re.Match object; span=(0, 9), match=' $100.69 '>


In [11]:
#  Modify the pattern for the sequence above to extract and print the dollars and cents values

pattern_3 = '^\s\$(\d{1,6})\.(\d\d)\s'
res = re.search(pattern_3, ' $100.69 ')

if res:
    inp = res.group(0)
    dollars = res.group(1)
    cents = res.group(2)
    print('Dollars: ' + dollars)
    print('Cents: ' + cents)
else:
    print('Match not found')

Dollars: 100
Cents: 69


In [12]:
# A word (lowercase letters only) that starts and ends with the same letter
# Use backreference to match the same letter at the beginning and the end
pattern_4 = r'([a-z]).*\1'

res = re.search(pattern_4, 'roar')
print(res)

<re.Match object; span=(0, 4), match='roar'>


In [None]:
# A sequence that allows for (1) or (2), i.e. dollars with or without cents

pattern_5 = r''

In [14]:
# A string that starts with letter, followed by at least one character that is not one of '.' (period) ',' (comma) or ';' (semicolon) and ending with any character that is not a digit.

# + at least one character
pattern_6 = r'^[a-zA-Z][^.,;]+[^0-9]'

res = re.search(pattern_6, 'aaa')
print(res)

<re.Match object; span=(0, 3), match='aaa'>


---

#### Problem 2

Write a function that takes a sentence as an input string parameter. The function should print True if the sentence starts with "The", contains "mountains" and ends with "river.", otherwise it should print False. <br/>Also, if the pattern matches, the function should split the sentence into two parts (using the regular expression split function), one each on either side of the word "mountains", and print the parts. 

In [24]:
def prob_2(string):
    pattern = r'^The.*mountains.*river\.$'
    res = re.search(pattern, string)
    
    if res:
        parts = re.split(r'\bmountains\b', string)
        print(parts)
        print("True")
    else:
        print("False")

prob_2('The mountains and the river')

False


---

#### Problem 3

You are given a file named [ps4-5_in.txt](ps4-5_in.txt) with multiple records, each on a separate line. Each record has person name, profession, and school name, separated by a comma. You are required to extract only those records for which person name starts with 'a' and profession is student, and write the extracted records to a new output file called *ps5-3_out.txt*. You must use regular expressions to do the extraction for each record.

Note: 'student scholar' is a different profession than 'student'


In [34]:
def process_records():
    outf = open("./scripts/ps4-5_out.txt", "w")
    
    for line in open("./scripts/ps4-5_in.txt"):
        parts = line.split(",")
        name = parts[0]
        profession = parts[1]
        
        if re.match('a', name.strip()) and re.match('student$', profession.strip()):
            outf.write(line)
    outf.close()
    
process_records()

FileNotFoundError: [Errno 2] No such file or directory: './scripts/ps4-5_in.txt'

---

#### Problem 4

In this problem, you will implement a user signup functionality with the following requirements: <br/>
<ol>
    <li>A username starting with either a lowercase letter or an underscore.</li>
    <li>A password at least 6 characters long, containing at least one uppercase character.</li> 
    <li>An email address starting with a letter, and containing exactly one '@' immediately followed by a letter. Feel free to try more realistic requirements as well.</li>
You must ask the user to input the details one by one until each one of them abides by the constraints (while loop!).

In [37]:
def user_signup():
    email = input("Email Address: ")
    username = input("Username")
    password = input("Password: ")

    user_pattern = r'^([a-z]|_)'
    user_res = re.search(user_pattern, username)
    
    # while user_res doesn't throw an error
    email_res = re.search(email_pattern, email)
    email_pattern = r'^[A_Za-z]@{1}\.com$'
    
    
    pass_pattern = r'+[A-Z]{6,}'

    
    pass_res = re.search(pass_pattern, password)
    
    print(user_res)
    print(pass_res)
    print(email_res)

user_signup()

error: nothing to repeat at position 0

---

#### Problem 5

In this problem, you will write a function to validate IP addresses. You will check and print if the IP address passed to your function is a valid IPv4 or IPv6 address. If it is not a valid IP address, print "Invalid IP address". <br/>
Read more about IPv4 and IPv6 addresses <a href = "https://www.ibm.com/support/knowledgecenter/en/STCMML8/com.ibm.storage.ts3500.doc/opg_3584_IPv4_IPv6_addresses.html">here</a>.


In [None]:
def valid_IP():
    addre = input("Enter IP address to validate: ")
    
    pattern= ''