## AIM 5001 Regular Expressions + String Processing Sample Solution

#### Task 1: Using regular expressions, extract the Year Born / Year Deceased for each President from the unformatted text string shown above and store them in two separate Python list objects, i.e., one list containing the Year Born values and one list containing the Year Deceased values. ####

**Unformatted Text:** 
"*B/D1732-1799George Washington---PARTY---Unaffiliated—-SERVED:1789 – 1797, VP = John Adams;#?!*****B/D1735-1826John Adams---PARTY---Federalist—-SERVED:1797 – 1801,VP: Thomas Jefferson;*****B/D1743-1826Thomas Jefferson ---PARTY---Democratic-Republican —-SERVED:1801 – 1809, VP = Aaron Burr, George Clinton;##*****B/D1751-1836James Madison ---PARTY---Democratic-Republican —-SERVED:1809 – 1817, VP = George Clinton, Elbridge Gerry;?!C****B/D1758-1831James Monroe ---PARTY---Democratic-Republican —-SERVED:1817 – 1825, VP = Daniel D Tompkins;"

In [1]:
import re
import numpy as np

#initialize data
text_data = '''*B/D1732-1799George Washington---PARTY---Unaffiliated—-SERVED:1789 – 1797, VP = John Adams;#?!*****B/D1735-1826John Adams---PARTY---Federalist—-SERVED:1797 – 1801,VP: Thomas Jefferson;*****B/D1743-1826Thomas Jefferson ---PARTY---Democratic-Republican —-SERVED:1801 – 1809, VP = Aaron Burr, George Clinton;##*****B/D1751-1836James Madison ---PARTY---Democratic-Republican —-SERVED:1809 – 1817, VP = George Clinton, Elbridge Gerry;?!C****B/D1758-1831James Monroe ---PARTY---Democratic-Republican —-SERVED:1817 – 1825, VP = Daniel D Tompkins;'''

text_data #prints initial unformatted text data

'*B/D1732-1799George Washington---PARTY---Unaffiliated—-SERVED:1789 – 1797, VP = John Adams;#?!*****B/D1735-1826John Adams---PARTY---Federalist—-SERVED:1797 – 1801,VP: Thomas Jefferson;*****B/D1743-1826Thomas Jefferson ---PARTY---Democratic-Republican —-SERVED:1801 – 1809, VP = Aaron Burr, George Clinton;##*****B/D1751-1836James Madison ---PARTY---Democratic-Republican —-SERVED:1809 – 1817, VP = George Clinton, Elbridge Gerry;?!C****B/D1758-1831James Monroe ---PARTY---Democratic-Republican —-SERVED:1817 – 1825, VP = Daniel D Tompkins;'

**Approach:** This pattern `[*]+` splits the text data into 5 sections and each section is the information for each president. The for loop finds the pattern `B\/D+([0-9]{4})-([0-9]{4})`to extract the born year and the deceased year for each description for the president. The string will only be added to the lists if there is a match as seen below:

In [2]:
#splits the unformatted text to find Year Born / Year Deceased and then matches it with the pattern and returns matches in a list
YB = []
YD=[]

for i in re.split('[*]+',text_data):
    match=re.match('B\/D+([0-9]{4})-([0-9]{4})',i)
    if match:
        YB.append(match.group(1))
        YD.append(match.group(2))

print(f'Year Born:{YB}')
print(f'Year Deceased:{YD}')

Year Born:['1732', '1735', '1743', '1751', '1758']
Year Deceased:['1799', '1826', '1826', '1836', '1831']


#### Task 2: Extract the Name of each President from the unformatted text string shown above and store the extracted names in a Python list object. ####

**Approach:** The approach here is to utilize re.finall and that returns as a list with elements that match the pattern. The names of presidents follow by the Year Deceased and have a space between the first name and last name. Therefore, by utilizing the pattern `B\/D+[0-9]{4}-[0-9]{4}(\w+\s+\w+)`, the names of presidents can be extracted.

In [3]:
match = 'B\/D+[0-9]{4}-[0-9]{4}(\w+\s+\w+)'
name = re.findall(match,text_data)
print(name)

['George Washington', 'John Adams', 'Thomas Jefferson', 'James Madison', 'James Monroe']


#### Task 3: Using regular expressions, extract the Name of Political Party for each President from the unformatted text string shown above and store the extracted political party names in a Python list object. ####

**Approach:** The pattern,`PARTY---([\w-]+)\s*—-`, was found to extract the Name of Political Party.

In [4]:
match = 'PARTY---([\w-]+)\s*—-'
party = re.findall(match,text_data)
print(party)

['Unaffiliated', 'Federalist', 'Democratic-Republican', 'Democratic-Republican', 'Democratic-Republican']


#### Task 4:Using regular expressions, extract the Name(s) of Vice Presidents for each President from the unformatted text string shown above and store the extracted names in a Python dictionary object wherein the key:value pairs are created using the name of each of the first five Presidents of the United States as the dictionary’s key values and the names of their associated vice presidents being instantiated as data values for each President. Note that only one key:value pair should appear within the resulting dictionary for each President (i.e., one entry for each President). ####

**Approach:** Like task 2, we extract each name of the president and store them in a list. The pattern to extract vice presidents is `VP[\s:=]+([\w\s,]+);`. Utilize the for loop to create a dictionary with the key to be each name of the president and the value to be the list containing the names of the vice presidents. 

In [5]:
dic={}
pre_match = 'B\/D+[0-9]{4}-[0-9]{4}(\w+\s+\w+)'
pre_name = re.findall(pre_match,text_data)

vi_match = 'VP[\s:=]+([\w\s,]+);'
vi_name = re.findall(vi_match,text_data)

for i in range(len(pre_name)):
    dic[pre_name[i]]=vi_name[i].split(', ')
print(dic)

{'George Washington': ['John Adams'], 'John Adams': ['Thomas Jefferson'], 'Thomas Jefferson': ['Aaron Burr', 'George Clinton'], 'James Madison': ['George Clinton', 'Elbridge Gerry'], 'James Monroe': ['Daniel D Tompkins']}


#### Task 5: Using your newly created list and dictionary objects, complete the following tasks:

##### (a). Use your regex and string processing skills to rearrange the content of the list of names of Presidents so that all elements conform to the standard “last name, first name”; then, arrange the list in alphabetical order on the basis of the first letter of the last name of each president. #####

**Approach:** Utilize the split(' ') to seperate the first and last names. Rearrange the sequence of all the elements to conform to the standard “last name, first name”. The res.sort() helped to arrange the list in alphabetical order on the basis of the first letter.

In [6]:
res = [x.split(' ')[1]+', '+x.split(' ')[0] for x in pre_name] # exchange the order of name to conform “last name, first name”
res.sort()
print(res)

['Adams, John', 'Jefferson, Thomas', 'Madison, James', 'Monroe, James', 'Washington, George']


##### (b). Use your Python skills to create a new dictionary object containing the total duration of each President’s lifespan. The resulting dictionary object should use the name of each president as key values while their lifespan is used to populate the associated data values for each key:value pair within the dictionary. Then, using your new dictionary object, calculate the AVERAGE lifespan of the first five Presidents of the United States. #####

**Approach:** Retrieve the year born and the year deceased and transfer them into integers. The life span was calculated by subtracting the year born from the year deceased. Utilize the for loop to create the dictionary with the key to be the president's name and the value to be the life span. By adding all the life spans of presidents and dividing by the number of presidents, we can get the average of life span.

In [7]:
lifespan={}
for i in range(len(pre_name)):
    lifespan[pre_name[i]]=int(YD[i])-int(YB[i])

total=0
for i in lifespan.keys():
    total+=lifespan[i]
print(total / len(pre_name))    
    

79.8


##### (c). Using your regex skills and the dictionary object created in Question 4 (above), construct a new dictionary object indicating whether each Vice President who served during the terms of the first five Presidents of the United States has either a ‘G’ or a ‘J’ anywhere within their first name. The resulting dictionary should be comprised of one entry for each Vice President, wherein the key value is the Vice President’s name and the associated data value contains either the Python keyword ‘TRUE’ or the Python keyword ‘FALSE’. #####

**Approach:** Since each president has one or more vice presidents, we utilize two for loops to extract each vice president in the dictionary created in task 4. The `split(' ')[0]` helps to extract the first name. The  `'G' in first_name or 'J' in first_name` returns True or False to show whether the first name of the vice president has either a ‘G’ or a ‘J’.

In [8]:
vice_contain={}
for i in dic.keys():
    for vice in dic[i]:
        #print(vice)
        first_name = vice.split(' ')[0]
        vice_contain[vice]='G' in first_name or 'J' in first_name

vice_contain    

{'John Adams': True,
 'Thomas Jefferson': False,
 'Aaron Burr': False,
 'George Clinton': True,
 'Elbridge Gerry': False,
 'Daniel D Tompkins': False}

##### (d).Using your regex skills and the dictionary object created in Question 4 (above), construct a new dictionary object indicating whether each Vice President who served during the terms of the first five Presidents of the United States has a middle/second name or middle initial. The resulting dictionary should be comprised of one entry for each Vice President, wherein the key value is the Vice President’s name and the associated data value contains either the Python keyword ‘TRUE’ or the Python keyword ‘FALSE’. #####

**Approach:** To check whether the vice president has a middle/second name or middle initial, we utilize the `split(' ')` and count the length of the list. If the length is greater than 2, the name includes the middle/second name or middle initial.

In [9]:
vice_has_middle={}
for i in dic.keys():
    for vice in dic[i]:
        n = len(vice.split(' '))
        vice_has_middle[vice]=n>2
vice_has_middle

{'John Adams': False,
 'Thomas Jefferson': False,
 'Aaron Burr': False,
 'George Clinton': False,
 'Elbridge Gerry': False,
 'Daniel D Tompkins': True}

#### Task 6.Consider the character string ‘FIdD1E7h=’. We would like to match this string using the regular expression “[a-zA-Z]*[^,]=”, but the regular expression fails to match the text string. Explain why the regular expression fails and correct it. ####

Ans: The regex `[a-zA-Z]*[^,]=` is to find the string with zero or more letters followed by a single character that is not ',' and followed by '='. To correct the regex, it needs to add '+' after the `[^,]` since there have additional two digits in the string. We have to specify the number of characters belonging to the set.

In [10]:
match = '[a-zA-Z]*[^,]+='
re.match(match,'FIdD1E7h=')

<re.Match object; span=(0, 9), match='FIdD1E7h='>

#### Task 7.Consider the character string “The spy was carefully disguised”. We would like to extract only the adverb ‘carefully’ from the string. To do so we write the regular expression “^D\s+ly()+”. Explain why this fails and correct the expression. ####

Ans: The regex `^D\s+ly()+ `did not match with the string since it is to find a string that starts with 'D' followed by one or more spaces and followed by 'ly'. Also, '()' in the regex is utilized to group. To extract the adverb 'carefully', the regex should be `\b([a-z]+ly)\b` to ensure we extract a complete word that has 'ly'.

In [11]:
match = r"\b([a-z]+ly)\b"
re.findall(match,'The spy was carefully disguise')[0]

'carefully'