### <font color='brown'>Problem Set 6: Regular Expressions - Solution</brown>

In [2]:
import re

---

#### Problem 1

Write a re.sub statement for each of the following:
1. Replace all question marks in a string with '?!'
2. Replace all but the first and last character of a string with '###'
3. Given a string with a single '|' separator between its two parts, replace it with a string that flips the parts and subsitutes a ';' (semicolon) for the '|'

#### Solution

1. Replace all question marks in a string with '?!'

In [77]:
res = re.sub('\?','?!','hey? ')
print(res)
res = re.sub('\?','?!','Some rand?m stuff...')
print(res)

hey?! 
Some rand?!m stuff...


**Since '?' is a metacharacter, we need to escape it with '\\'. Try without the '\\', it won't work.**

2. Replace all but the first and last character of a string with '###'

In [80]:
res = re.sub('^(.).*(.)$',r'\1###\2','r2d2')
print(res)

r###2


**Make sure to use a raw string for the replacement, since we are back referenceing captures using '\\1' and '\\2'**

3. Given a string with a single '|' separator between its two parts, replace it with a string that flips the parts and subsitutes a ';' (semicolon) for the '|'

In [82]:
res = re.sub('^(.*)\|(.*)$',r'\2;\1','food and beverages|$45.55')
print(res)

$45.55;food and beverages


**Since '|' is a metacharacter, we need to escape it with '\\'. Try without the '\\', it won't work.**

---

#### Problem 2
Given a text input that mimics a student table, e.g.<br/>
Sample input: "19100 COM Networks
19101 MAT Calculus
19102 MAT Algebra
19103 BIO Microbiology"
<ol>
<li>Extract all the Student IDS (5 digit), Department codes (3 letter codes) and Majors (more than 3 letters)<br/>
Expected output: ['19100', '19101', '19102', '19103']
['COM', 'MAT', 'MAT', 'BIO']
['Networks', 'Calculus', 'Algebra', 'Microbiology']
</li>
<li>Extract tuples such that each tuple contains comma separated student information.<br/>
Expected output: [('19100', 'COM', 'Networks'), ('19101', 'MAT', 'Calculus'), ('19102', 'MAT', 'Algebra'), ('19103', 'BIO', 'Microbiology')]</li>

</ol>.



#### Solution

In [4]:
table = "19100 COM Networks 19101 MAT Calculus 19102 MAT Algebra 19103 BIO Microbiology"
print("********** Part 1 **********")
ids = re.findall('[0-9]+', table)
codes = re.findall('[A-Z]{3}', table)
majors = re.findall('[A-Za-z]{4,}', table)
print(ids)
print(codes)
print(majors)
print("********** Part 2 **********")
studentTuple = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
studentDetails = re.findall(studentTuple, table)
print(studentDetails)

********** Part 1 **********
['19100', '19101', '19102', '19103']
['COM', 'MAT', 'MAT', 'BIO']
['Networks', 'Calculus', 'Algebra', 'Microbiology']
********** Part 2 **********
[('19100', 'COM', 'Networks'), ('19101', 'MAT', 'Calculus'), ('19102', 'MAT', 'Algebra'), ('19103', 'BIO', 'Microbiology')]


---

#### Problem 3

In this problem, you will perform some HTML code parsing. Given an anchor tag (`<a>..</a>`), with a link in its href attribute and some link text, write a function to extract and print the domain name of the link and the link text. Consider the following example anchor tag to write your code.<br/>

E.g., for the input: `<a href="https://www.foxnews.com/politics/">Fox News</a>`, the domain name will be "foxnews.com" and the link text will be "Fox News".

Another example: `<a href="https://support.apple.com/mac">Mac Support Page</a>`. Here, the domain name will be "apple.com" and the link name will be "Mac Support Page".

Assume the same format for the anchor tag, but the url pattern can vary as seen in the two examples above. Note that the domain extensions can vary and won't always be `.com`

<b>Note</b>: When you pass the input to the function, it will contain a string in the href parameter, which could give you an error. Replace the `"` with `\"` to avoid this. 

#### Solution

In [91]:
def htmlParser(tag):
    domainName = re.sub('.*//', '', tag)
    domainName = re.sub('[/\"].*', '', domainName)
    entities = re.split("\.", domainName)
    domainName = ".".join(entities[-2:])
    linkName = re.sub('</a>','', tag)
    linkName = re.sub('.*>','', linkName)
    
    print("Domain Name: " + domainName)
    print("Link Name: " + linkName)
    

In [92]:
htmlParser("<a href=\"https://www.foxnews.com/politics/\">Fox News</a>")
htmlParser("<a href=\"https://support.apple.com/mac\">Mac Support Page</a>")
htmlParser("<a href=\"https://support.apple.com\">Support Page</a>")
htmlParser("<a href=\"https://newbrunswick.rutgers.edu/research\">Rutgers NB Research</a>")

Domain Name: foxnews.com
Link Name: Fox News
Domain Name: apple.com
Link Name: Mac Support Page
Domain Name: apple.com
Link Name: Support Page
Domain Name: rutgers.edu
Link Name: Rutgers NB Research


---

#### Problem 4

Given an HTML code snippet e.g. ```"<html><body><h1><div><h2>Responsive Sidebar Example</h2><title><p>First paragraph.</p></ol><p>Second paragraph.</p></li><h3>Resize the browser window to see the effect.</h3></div></body></html>"```. 

<ol>

<li>Extract all the distinct opening and closing tags that are present.<br/> 
</li>
<li>Extract all the distinct opening tags that do not have corresponding closing tags and all the distinct closing tags that do not have a corresponding opening tag.</ol>

For the sample snippet above,<br/>
The opening tags are: ```'<html>', '<body>', '<div>', '<h1>', '<h2>', , '<title>', '<p>', '<h3>' ```<br/>
The closing tags are: ```'</h2>', '</p>', '</ol>', '</li>', '</h3>', '</div>', '</body>', '</html>' ```<br/>
Tags opened but not closed: ```'<title>', '<h1>'``` <br/>
Tags closed but not opened: ```'</li>', '</ol>'```


#### Solution

In [None]:
import re
code = "<html><body><h1><div><h2>Responsive Sidebar Example</h2><title><p>First paragraph.</p></ol><p>Second paragraph.</p></li><h3>Resize the browser window to see the effect.</h3></div></body></html>"
tags = re.findall('<.*?>', code)
distinctTags = set(tags)

openingTags = set(re.findall('<[a-z0-9]*?>', code))
closingTags = set(re.findall('</[a-z0-9]*?>', code))

print(openingTags)
print(closingTags)

openNoClose = []
closeNoOpen = []
for x in openingTags:
  closing = re.sub('<', '</', x)
  if not closing in closingTags:
    openNoClose.append(x)

for x in closingTags:
  opening = re.sub('/', '', x)
  if not opening in openingTags:
    closeNoOpen.append(x)

print(openNoClose)
print(closeNoOpen)

{'<h2>', '<html>', '<title>', '<p>', '<body>', '<div>', '<h3>', '<h1>'}
{'</p>', '</html>', '</li>', '</body>', '</h2>', '</h3>', '</div>', '</ol>'}
['<title>', '<h1>']
['</li>', '</ol>']


---