# Regular Expressions in Python
### Table of contents:
1. **match()**
2. **search()**
3. **findall()**
4. **finditer()**
5. **sub()**
6. **split()**
7. **Groups**

In [1]:
import re

## 1. match()
Checks for a match only at the beginning of the string

In [2]:
# Defining a string
string = "Tiger is the national animal of India. Tiger lives in Forest."

# Defining the pattern
pattern = "Tiger"

# Running match() on a string
result = re.match(pattern, string)

# Printing the result
print(result)

&lt;re.Match object; span=(0, 5), match=&#39;Tiger&#39;&gt;


In [3]:
# Defining a string
string = "Tiger is the national animal of India. Tiger lives in Forest."

# Defining the pattern
pattern = "Tiger"

# Extracting String from a match object
result = re.match(pattern, string).group()

# Printing the result
print(result)

Tiger


In [4]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Checking for match
result = re.match(pattern, string)
print(result)

None


## 2. search()
Locates a sub-string matching the RegEx pattern anywhere in the string

In [4]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Searching a substring using search()
result = re.search(pattern, string)
print(result)

&lt;re.Match object; span=(32, 37), match=&#39;Tiger&#39;&gt;


In [5]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Extracting searched string
result = re.search(pattern, string).group()
print(result)

Tiger


## 3. findall()
Finds all the sub-strings matching the RegEx pattern

In [6]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Using findall() on a string
result = re.findall(pattern, string)
print(result)

[&#39;Tiger&#39;, &#39;Tiger&#39;]


In [7]:
# Defining the string
text = "India got freedom on 15-08-1947, and it is celebrated as Independence Day.\
        Indian Constitution came into effect on 26-01-1950, and it is celebrated as Republic Day."

# Defining the pattern
date_pattern = r'\d{2}-\d{2}-\d{4}'

# Extracting dates using findall()
re.findall(date_pattern, text)

[&#39;15-08-1947&#39;, &#39;26-01-1950&#39;]

## 4. finditer()
Similar to findall() but returns an iterator

In [8]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Using finditer() on a string
result = re.finditer(pattern, string)
print(result)



&lt;callable_iterator object at 0x000002EC583183C8&gt;


In [9]:
# Iterating over the iterator
for m in result:
    # Printing match object
    print(m)
    # Printing starting and ending index with matched substring
    print('Start:',m.start(),' End:',m.end(),' Sub-string:',m.group())

&lt;re.Match object; span=(32, 37), match=&#39;Tiger&#39;&gt;
Start: 32  End: 37  Sub-string: Tiger
&lt;re.Match object; span=(39, 44), match=&#39;Tiger&#39;&gt;
Start: 39  End: 44  Sub-string: Tiger


## 5. sub()
Searches for a substring and Replaces it with another string

In [10]:
text="Analytics Vidhya is largest Analytics community of India."

# Replacing a substring using sub()
result=re.sub('India', 'the World',text)
print(result)

Analytics Vidhya is largest Analytics community of the World.


## 6. split()
Split the text by the given RegEx Pattern

In [11]:
line = "I have a big test tomorrow; I can't go out tonight."

# Splitting a string into multiple substrings
re.split(r'[;]', line)

[&#39;I have a big test tomorrow&#39;, &quot; I can&#39;t go out tonight.&quot;]

## 7. Groups

In [12]:
# Running a simple pattern on some text
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

pattern="[\w]+ [\w]+ \$[\d,]+ [a-zA-z ]+ \d{2}-\d{2}-\d{4}"

result=re.findall(pattern,string)

print(result)

[&#39;Ajay credited $500 to your account on 13-08-2020&#39;, &#39;Anmol debited $1,700 from your account on 14-08-2020&#39;, &#39;Alex debited $100 on 16-08-2020&#39;]


In [13]:
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

# Creating groups in the previous pattern
pattern="([\w]+) ([\w]+) (\$[\d,]+) [a-zA-z ]+ (\d{2}-\d{2}-\d{4})"

result=re.findall(pattern,string)

print(result)

[(&#39;Ajay&#39;, &#39;credited&#39;, &#39;$500&#39;, &#39;13-08-2020&#39;), (&#39;Anmol&#39;, &#39;debited&#39;, &#39;$1,700&#39;, &#39;14-08-2020&#39;), (&#39;Alex&#39;, &#39;debited&#39;, &#39;$100&#39;, &#39;16-08-2020&#39;)]


In [14]:
import pandas as pd

# Creating a dataframe
df=pd.DataFrame(result,columns=['Name','Type','Amount','Date'])
df

Unnamed: 0,Name,Type,Amount,Date
0,Ajay,credited,$500,13-08-2020
1,Anmol,debited,"$1,700",14-08-2020
2,Alex,debited,$100,16-08-2020


In [15]:
# Using finditer() for getting match objects
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

pattern="([\w]+) ([\w]+) (\$[\d,]+) [a-zA-z ]+ (\d{2}-\d{2}-\d{4})"

result=re.finditer(pattern,string)

# Accessing groups separately
for i in result:
    print(i.group(0),'=>',i.group(1),'=>',i.group(2),'=>',i.group(3),'=>',i.group(4))

Ajay credited $500 to your account on 13-08-2020 =&gt; Ajay =&gt; credited =&gt; $500 =&gt; 13-08-2020
Anmol debited $1,700 from your account on 14-08-2020 =&gt; Anmol =&gt; debited =&gt; $1,700 =&gt; 14-08-2020
Alex debited $100 on 16-08-2020 =&gt; Alex =&gt; debited =&gt; $100 =&gt; 16-08-2020


**Note:** Syntax for naming groups: `(?P<Group Name>Pattern)`

In [16]:
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

# Naming Groups
pattern="(?P<Name>[\w]+) (?P<Type>[\w]+) (?P<Amount>\$[\d,]+) [a-zA-z ]+ (?P<Date>\d{2}-\d{2}-\d{4})"

result=list(re.finditer(pattern,string))

In [17]:
# Accessing data by group names
for i in result:
    print(i.group('Name'),'=>',i.group('Amount'),'=>',i.group('Date'),'=>',i.group('Type'))

Ajay =&gt; $500 =&gt; 13-08-2020 =&gt; credited
Anmol =&gt; $1,700 =&gt; 14-08-2020 =&gt; debited
Alex =&gt; $100 =&gt; 16-08-2020 =&gt; debited


In [18]:
# Printing data with group names
for i in result:
    print(i.groupdict())

{&#39;Name&#39;: &#39;Ajay&#39;, &#39;Type&#39;: &#39;credited&#39;, &#39;Amount&#39;: &#39;$500&#39;, &#39;Date&#39;: &#39;13-08-2020&#39;}
{&#39;Name&#39;: &#39;Anmol&#39;, &#39;Type&#39;: &#39;debited&#39;, &#39;Amount&#39;: &#39;$1,700&#39;, &#39;Date&#39;: &#39;14-08-2020&#39;}
{&#39;Name&#39;: &#39;Alex&#39;, &#39;Type&#39;: &#39;debited&#39;, &#39;Amount&#39;: &#39;$100&#39;, &#39;Date&#39;: &#39;16-08-2020&#39;}


##  Exercise

"Sam started learning NLP on 02-01-2020. He created his first self project on 18-02-2019. After this, he worked hard and got an internship at ABC Pvt. Ltd. on 10-06-2019. Finally. he got his first job at XYZ Pvt. Ltd. on 22-10-2019."

Use the given text and perform the following tasks:
- Task 1: Correct year in all the dates to 2020
- Task 2: Extract all the dates in the form of a list.
- Task 3: Extract all the company names 

In [19]:
text = "Sam started learning NLP on 02-01-2020. He created his first self project on 18-02-2019. After this, he worked hard and got an internship at ABC Pvt. Ltd. on 10-06-2019. Finally. he got his first job at XYZ Pvt. Ltd. on 22-10-2019."

In [20]:
# Task1 : Correct year in all the dates to 2020
text_corr_dates = re.sub(r'2019', '2020' ,text )
text_corr_dates

&#39;Sam started learning NLP on 02-01-2020. He created his first self project on 18-02-2020. After this, he worked hard and got an internship at ABC Pvt. Ltd. on 10-06-2020. Finally. he got his first job at XYZ Pvt. Ltd. on 22-10-2020.&#39;

In [21]:
# Task 2: Extract all the dates in the form of a list.
lst_dates = re.findall(r'\d{2}-\d{2}-\d{4}', text_corr_dates)
lst_dates

[&#39;02-01-2020&#39;, &#39;18-02-2020&#39;, &#39;10-06-2020&#39;, &#39;22-10-2020&#39;]

In [27]:
# Task 3: Extract all the company names
lst_comp = re.findall(r'[A-Z]+\sPvt. Ltd.',text_corr_dates)
lst_comp

[&#39;ABC Pvt. Ltd.&#39;, &#39;XYZ Pvt. Ltd.&#39;]