# Regular expression
***

Liks:
- https://docs.python.org/3/library/re.html
- https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/

### Purpose of use:
- to determine the required format, for example, a phone number or email address;
- for splitting strings into substrings;
- for searching, replacing and extracting symbols;
- to quickly perform non-trivial operations.

### Module re.
This module provides regular expression matching operations.

The most common uses of regular expressions are:

- Search a string (search and match)
- Finding a string (findall)
- Break string into a sub strings (split)
- Replace part of a string (sub)

In [28]:
import re

# Regular expression using a constant pattern

In [29]:
text = 'SPb is known as the "Cultural Capital of Russia" and the "Northern Capital"'

### 1) re.match(pattern, string)

This method finds match if it occurs at start of the string.

In [30]:
# matching SPb in the given sentence
result = re.match(r"SPb", text)
print("\n", result)
print("\nMatching string :", result.group(0))
print("\nStarting position of the match :", result.start())
print("Ending position of the match :", result.end())


 <re.Match object; span=(0, 3), match='SPb'>

Matching string : SPb

Starting position of the match : 0
Ending position of the match : 3


In [32]:
# Let’s now find ‘Russia’ in the given string.
# Here we see that string is not starting with ‘Russia’ so it should return no match. Let’s see what we get:

result = re.match(r"Russia", text)
print("\nResult :", result)


Result : None


### 2) re.search(pattern, string)
It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only. Unlike previous method, here searching for pattern ‘Russia’ will return a match. However, it only returns the first occurrence of the search pattern.

In [33]:
result = re.search(r"Russia", text)
print(result.group(0))

Russia


In [34]:
result

<re.Match object; span=(41, 47), match='Russia'>

### 3) re.findall (pattern, string)
It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV’ in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.

### 4) re.split(pattern, string, [maxsplit=0])
This methods helps to split string by the occurrences of given pattern.

In [35]:
result = re.split(r"of", "Cultural Capital of Russia")
result

['Cultural Capital ', ' Russia']

Method split() has argument “maxsplit“. It has default value of zero. In this case it does the maximum splits that can be done, but if we give value to maxsplit, it will split the string. Let’s look at the example below:

In [36]:
result = re.split(r"l", "Cultural Capital of Russia")
result

['Cu', 'tura', ' Capita', ' of Russia']

In [37]:
result = re.split(r"l", "Cultural Capital of Russia", maxsplit=1)
result

['Cu', 'tural Capital of Russia']

### 5) re.sub(pattern, repl, string)
It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [38]:
result = re.sub(r"Northern", "Southern", text)
print(text)
print(result)

SPb is known as the "Cultural Capital of Russia" and the "Northern Capital"
SPb is known as the "Cultural Capital of Russia" and the "Southern Capital"


### 6) re.compile(pattern, repl, string)

We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [39]:
pattern = re.compile("Capital")
result = pattern.findall(text)
result

['Capital', 'Capital']

![re_pattrern.PNG](attachment:fd0abbeb-a07b-49e0-89e3-48879b781420.PNG)# Regular expression with non constant pattern

Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.

![image](re_pattrern.png)

### Examples

1. Extract each word

In [40]:
text = 'SPb is known as the "Cultural Capital of Russia" and the "Northern Capital"'

In [41]:
re.findall(r'\w+', text)

['SPb',
 'is',
 'known',
 'as',
 'the',
 'Cultural',
 'Capital',
 'of',
 'Russia',
 'and',
 'the',
 'Northern',
 'Capital']

2. Return the first two character of each word

In [42]:
re.findall(r'\b\w.', text)

['SP', 'is', 'kn', 'as', 'th', 'Cu', 'Ca', 'of', 'Ru', 'an', 'th', 'No', 'Ca']

3. Return the domain type of given email-ids

In [43]:
emails = "abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz"

In [47]:
re.findall(r'\@\w+.\w+', emails)

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']

4. Return date from given string

In [50]:
dates = "Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009"

In [51]:
re.findall(r'\d{2}-\d{2}-\d{4}', dates)

['12-05-2007', '11-11-2011', '12-01-2009']

5.  Return all words of a string those starts with vowel (aeiouAEIOU)

6. Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

In [None]:
numbers = ["8999999999", "999999-999", "99999x9999"]

7. Split a string with multiple delimiters

In [None]:
line = "asdf fjdk;afed,fjek,asdf,foo"  # String has multiple delimiters (";",","," ").


## Answers

In [None]:
# 1) re.findall(r'\w+', text)
# 2) re.findall(r'\b\w.', text)
# 3) re.findall(r'\@\w+.\w+', emails)
# re.findall(r'\@\w+.(\w+)', emails)
# 4) re.findall(r'\d{2}-\d{2}-\d{4}', dates)
# 5) 
# re.findall(r'\b[aeiouAEIOU]\w+', string)
# re.findall(r'\b[^aeiouAEIOU ]\w+', string)
# 6) for number in numbers:
#     if (re.match(r'8|9[0-9]{9}', number)) and len(number)==10:
#         print(number)
# 7) re.split(r'[ ;,]', line)
# re.sub(r'[;,]', ' ', line)