# Regular Expression

Tip: To build and test regular expressions, you can use regex tester tools such as [regex101](https://regex101.com/). This tool not only helps you in creating regular expressions, but it also helps you learn it.

In [None]:
import re

<details>    
<summary>
    <font size="4" color="darkgreen"><b>1. re.findall()</b></font>
</summary>
<p>
<ul>Finds all the possible matches in the entire sequence and returns them as a list of strings. Each returned string represents one match.</ul>
</p>

In [None]:
s = 'Please contact us at: support@textmining.com, info@textmining.com'
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', s)
print(addresses)

<details>    
<summary>
    <font size="4" color="darkgreen"><b>2. re.compile()</b></font>
</summary>
<p>
<ul>We can compile a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.</ul>
</p>

In [None]:
s = 'Please contact us at: support@textmining.com, info@textmining.com'
email = re.compile(r'[\w\.-]+@[\w\.-]+')
addresses = email.findall(s)
print(addresses)

<details>    
<summary>
    <font size="4" color="darkgreen"><b>3. re.search()</b></font>
</summary>
<p>
<ul>The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the regex pattern produces a match with the string.</ul>
<ul>If the search is successful, re.search() returns a match object; if not, it returns None.</ul>
</p>

In [None]:
txt = 'Text mining is fun'
x = re.search(r'^Text.*fun$', txt)
if x:
    print('YES!, We have a match!')
else:
    print('No match')

<details>    
<summary>
    <font size="4" color="darkgreen"><b>4. re.search() vs re.match()</b></font>
</summary>
<p>
<ul>The match() function checks for a match only at the beginning of the string (by default), whereas the search() function checks for a match anywhere in the string.
</ul>
</p>

In [None]:
s = 'Text mining is fun'
match_obj = re.match(r'fun', s)
if match_obj:
    print('Matched!')
else:
    print('No match')

In [None]:
s = 'Text mining is fun'
search_obj = re.search(r'fun', s)
if search_obj:
    print('Matched!')
else:
    print('No match')

<details>    
<summary>
    <font size="4" color="darkgreen"><b>5. Grouping in Regular Expressions</b></font>
</summary>
<p>
<ul>The re.search function returns a match object on success. We can apply group(num) or groups() function on match object to get matched expression.
</ul>
<ul>The group feature of regular expression allows you to pick up parts of the matched text. Parts of a regular expression bounded by parenthesis () are called groups.</ul>
</p>

In [None]:
s = "Please contact us at: support@textmining.com"
match = re.search(r'([\w\.-]+)@([\w\.-]+)', s)
print(match.group()) #The whole matched text
print(match.group(1)) #The username (group 1)
print(match.group(2)) #The host (group 2)
print(match.groups()) #The username & host

<details>    
<summary>
    <font size="4" color="darkgreen"><b>6. re.sub()</b></font>
</summary>
<p>
<ul>The method returns a string where matched occurrences are replaced with the content of replace variable.
</ul>
</p>

In [None]:
phone = "2004-959-559 # This is Phone Number"
num = re.sub(r'#.*$', '', phone) #Delete Python-style comments
print(num)
num = re.sub(r'\D', '', phone) #Remove anything other than digits
print(num)

<details>    
<summary>
    <font size="4" color="darkgreen"><b>7. re.split()</b></font>
</summary>
<p>
<ul>The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.
</ul>
</p>

In [None]:
a = 'ab1c26de33tt'
result = re.split(r'\d+', a)
print(result)

## Exercise 01

Write a Python program to remove leading zeros from an IP address.

In [None]:
ip = "216.08.094.196"
### START CODE HERE ###
string = 
### END CODE HERE ###
print(string)

#### Expected output
```
216.8.94.196
```

## Exercise 02

Write a Python program to extract year, month and date from an url. Use re.findall().

In [None]:
url= "https://www.washingtonpost.com/news/2016/09/01/football-insider/wp/2016/09/02/odell-beckhams-fame-rests-on-one-stupid-little/"
### START CODE HERE ###
dates = 
### END CODE HERE ###
print(dates)

#### Expected output
```
[('2016', '09', '01'), ('2016', '09', '02')]
```

## Exercise 03

'china_bond.csv' is a csv format file in which the first column represents id, the second column represents title. Please open it for more details.

Write a python program to read from this file and extract firm name from the title of each line.

#### Options and Hints
- If you would like more of a real-life practice, don't open the 'Hints' below. Try to think this through and implement this yourself.
- If you would prefer more guidance, click on the green 'Detailed Hints' section for step by step instructions.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Detailed Hints</b></font>
</summary>
<p>     
Detailed hints if you're stuck
<ul>
    <li>Use with statement for reading from a csv file</li>
    <li>Use next() to skip header</li>
    <li>Use for loop to read file line by line</li>
    <li>You may think of the whole regex as 2 parts, each part bounded by (), regex inside the first parenthesis matches the pattern left to the firm name, regex inside the second parenthesis matches the firm name itself.</li>
    <li>You may use non-greedy quantifiers</li>
    <li>Use re.search(pattern, string) or re.findall(pattern, string) to extract firm name</li>
</ul>
</p>

In [None]:
### START CODE HERE ###

### END CODE HERE ###

#### Expected output
```
瑞安市国有资产投资集团有限公司
江苏省美尚生态景观股份有限公司
湖南天易融通创业投资有限公司
安宁发展投资集团有限公司
华晨汽车集团控股有限公司
海宁市尖山新区开发有限公司
海门南黄海建设发展有限公司
瑞安市国有资产投资集团有限公司
桃源县经济开发区开发投资有限公司
当涂县城乡建设投资有限责任公司
江苏省美尚生态景观股份有限公司
安宁发展投资集团有限公司
海宁市尖山新区开发有限公司
海门南黄海建设发展有限公司
淮南市城市建设投资有限责任公司
桃源县经济开发区开发投资有限公司
当涂县城乡建设投资有限责任公司
孟州市投资开发有限公司
彭泽县城市建设投资有限公司
孟州市投资开发有限公司
淮南市城市建设投资有限责任公司
青岛华通国有资本运营（集团）有限责任公司
青岛华通国有资本运营（集团）有限责任公司
景德镇陶瓷文化旅游发展有限责任公司
青岛华通国有资本运营（集团）有限责任公司
彭泽县城市建设投资有限公司
孟州市投资开发有限公司
景德镇陶瓷文化旅游发展有限责任公司
永修县城市建设投资开发有限公司
厦门轨道交通集团有限公司
杭州良渚文化城集团有限公司
```