# Regular expression or pattern matching

Regular expressions are a powerful tool for **working with text in Python**. They allow you to search for patterns within a string, extract specific parts of a string, and replace text.

https://docs.python.org/3/library/re.html



https://regex101.com/

## Metacharacters

| Charatcer | Description | Example |
| ---------|--------------|---------|
|[ ] | A set of characters | [2-5] |
|{ } | Exactly the specified number of occurrences | Sa.{4}p |
|{ } | Exactly the specified number of occurrences | Sa.{4}p |
| \ | Signals a special sequence | \d|
|.	|Any character (except newline character)|	"Sa....p"|
|^|Starts with|^Hi|
|Dollar symbol|End with|'End$'|
|I| Either or |"AIB"|
|+|One or more occurence|"he.+o|
|*|Zero or more occurnace|"he*o"



## Special sequence
- In regular expressions, there are certain special sequences that have a specific meaning and can be used to match certain types of characters or character classes. 
- A special sequence is a ** \ ** followed by any character. Here are some of the most commonly used special sequences in Python's regular expression syntax:

| Charatcer | Description | Example |
| ---------|--------------|---------|
|\A| If the specified character is at the beginning |\AThe|
|\d|Matches any digit character  (equivalent to [0-9])|\d|
|\D|Does not contain any digit character (equivalent to [^0-9])|\D|
|\s|Matches any whitespace character (equivalent to [ \t\n\r\f\v])|\s|
|\S|Matches any non-whitespace character (equivalent to [^ \t\n\r\f\v])|\S|
|\w|Matches any alphanumeric character (equivalent to [a-zA-Z0-9_]|\w|
|\b|Matches a word boundary (the position between a word character and a non-word character)|\b|
|\B|Matches a non-word boundary.|\B|
|\Z|if the specified characters are at the end of the string|\Z|


## Set

In regular expressions, a set is a collection of characters that can match a single character in a string. To create a set in a regular expression, we use **square brackets []** and list the characters you want to match inside.

[aeiou]: This set would **match** any single **vowel** character in a string.
<br>
<br>
[^aeiou]: This set would **match** any single **non-vowel** character in a string.
<br>
<br>
[0-9]: This set would **match** any **single digit** character in a string.
<br>
<br>
[a-d]: This set would **match** for any lower case character between a and d.

### Import library

Python has a built-in package in library

In [None]:
# import library
import re

## match

The **match()** function in Python's regular expression module (re) is used to match a regular expression pattern to the beginning of a string. It returns a match object if the pattern is found at the beginning of the string, or None if the pattern is not found.

#### Syntax:
re.match(pattern, string)
<br>
<br>
**pattern**: This is the regular expression pattern you want to match.
<br>
<br>
**string**: This is the string you want to search for the pattern.

#### In a sentence, we are interested to see if it starts with specific word/number or not. 
Sentence is: Data Science is the most demanding course.

In [None]:
pattern = "Data Science"
string = "Data Science is the most demanding course."

match_object = re.match(pattern, string)
print(match_object)

<re.Match object; span=(0, 12), match='Data Science'>


- It tells us the span. Starting from 0 and ending at 12. It also tells us match = Data Science.


#### Current batch number is DS25 and we want to be sure that  first word is starting with alphabet and ending with a number or not.
- We can try to write a pattern before hand by breaking it.
    * for start symbol is: ^
    * Then we want all alphabets, here uppercase only A-Z, as it is a set hence we will use [A-Z]
    * As there is more occurance of alphbet such as we have D then S, hence we have to use +
    * Then we want to match numbers or digits, hence \d 
    * As there are more digits 1 and then 7, hence we will use * otherwise it will end with first digit only.
    * As word, here line, would end with number hence dollar symbol
    * As we want to match with first word, hence we will use match function


In [None]:
# define a regular expression pattern to match
pattern = '^[A-Z]+\d*$'

# define a string to test against the pattern
string = 'DS27'

# use the match() function to search for the pattern in the string
match = re.match(pattern, string)

# check if a match was found
if match:
    print('Match found!')
else:
    print('No match found.')

Match found!


match.group()

### findall
The findall function is used to search a given string for all occurrences of a particular pattern, specified using a regular expression, and returns a list of all matched substrings.
<br>
<br>
Here's the basic syntax of the findall function in Python:

**re.findall(pattern, string)**
<br>
<br>
pattern: A string that represents the regular expression pattern to search for.
<br>
string: A string that represents the input string to be searched.


#### Sometimes we are interested to find specific pattern in a text. Here we are interested to find comma.
Sentence: DS 27, is in the Second Week of Python, and doing excellent. 

In [None]:
DS17 = "DS27, is in the Second Week of Python, and doing excellent"

# The findall() function returns a list containing all matches.
x = re.findall(",", DS17)  
print(x)


[',', ',']


### Search
The search function in Python's re module is used to search for a regular expression pattern in a given string. It returns a match object if a match is found, otherwise it returns None.
<br>
<br>
**Syntax**:
<br>
re.search(pattern, string)
<br>
<br>
pattern: The regular expression pattern to search for.
<br>
string: The string in which to search for the pattern.

DNA sequence contain A, G, C, T. In our DNA there are specific patterns that play important role in biological functions. Here is a DNA Sequence: ATGCGCCTACAATCGGTACGTCATCGCGCGCGCTTACGGCTCGGCTCTCCCCGGGCCTGCGCGCGCCTGATCGA. GCGCGCGC is a specific sequence. We want to see if this sequence is present or not.

In [None]:
Gene_Seq = "ATGCGCCTACAATCGGTACGTCATCGCGCGCGCTTACGGCTCGGCTCTCCCCGGGCCTGCGCGCGCCTGATCGA"

pattern= "GCGCGCGC"

# The search() function searches the string for a match, and returns a Match object if there is a match.
match = re.search(pattern, Gene_Seq) 

if match:
    print("Found a match at index %d: %s" % (match.start(), match.group()))
else:
    print("No match found.")

Found a match at index 25: GCGCGCGC


In this example, the search function finds the first occurrence of the sequence "GCGCGCGC" in the string text and returns a match object. The start() method of the match object returns the starting index of the match, and the group() method returns the matched text. If no match is found, the search function returns None and the code inside the else block is executed instead.

### finditer() function

To find the index of the matches, you can use the **re.finditer()** function to iterate over the match objects and obtain their start and end indices.

In [None]:
text = "ATGCGCCTACAATCGGTACGTCATCGCGC@@@@@TTACGGCTCGGCTCTCNNNCGGGCCTGCGCGCGCCTGATCGA"
pattern = "[^AGCT]"
matches = re.finditer(pattern, text)

for match in matches:
    print("Match found at index:", match.start())

Match found at index: 29
Match found at index: 30
Match found at index: 31
Match found at index: 32
Match found at index: 33
Match found at index: 50
Match found at index: 51
Match found at index: 52


In [None]:
text = "ATGCGCCTACAATCGGTACGTCATCGCGC@@@@@TTACGGCTCGGCTCTCNNNCGGGCCTGCGCGCGCCTGATCGA"
pattern = "[^AGCT]+"
matches = re.finditer(pattern, text)

for match in matches:
    print("Match found at index:", match.start())

Match found at index: 29
Match found at index: 50


- Although it tells me starting position but not ending position.

### Span
In regular expressions, the span() method is used to return the start and end indices of a match. It returns a tuple containing the start-, and end positions of the match.

#### Here we are interested to find starting and ending position of anything than AGCT
ATGCGCCTACAATCGGTACGTCATCGCGC@@@@@TTACGGCTCGGCTCTCNNNCGGGCCTGCGCGCGCCTGATCGA

In [None]:
text = "ATGCGCCTACAATCGGTACGTCATCGCGC@@@@@TTACGGCTCGGCTCTCNNNCGGGCCTGCGCGCGCCTGATCGA"
pattern = "[^AGCT]"

x = re.search(pattern, text)

print(x.span())

(29, 30)


- It gives the start and end of first sequence only.


### group
It returns the part of the string where there was a match

In [None]:
text = "ATGCGCCTACAATCGGTACGTCATCGCGC@@@@@TTACGGCTCGGCTCTCNNNCGGGCCTGCGCGCGCCTGATCGA"
pattern = "[^AGCT]+"

x = re.search(pattern, text)

print(x)
print(x.group())

<re.Match object; span=(29, 34), match='@@@@@'>
@@@@@


In [None]:
print(re.search("((a|b)c)","bcaac").groups())

('bc', 'b')


### split
- The split() function returns a list where the string has been split at each match:

In [None]:
Information = "There are 60 students in DS15 batch."
pattern = '60'
print(re.split(pattern, Information)) # here I want to split them using number

['There are ', ' students in DS15 batch.']


In [None]:
Date = "10-04-2022"
result = re.split("\D", Date, maxsplit=2) # maxsplit tells how many splits you want
# D is for non disgit characterstics
print(result)

['10', '04', '2022']


### sub
The sub() function works like a substitute. Replaces word as we want.

In [None]:
 # The sub() function replaces the matches with the text of your choice:
Gene_Seq = "ATGCGCGTACAATCGGTACGTCATCGCGCGCGCTTAC"
#  replaces the matches with the text of your choice:
print(re.sub("GCGCGCG", "ATTACHED", Gene_Seq)) #

ATGCGCGTACAATCGGTACGTCATCATTACHEDCTTAC
