# Regular expression in Python

Regex, short for Regular Expression, are powerful patterns used for text manipulation and searching. They provide a concise and flexible way to search, match, and manipulate text based on specific patterns.

In Python, regular expression are supported through the built-in `re` module. Here's how they are used:

**Importing the `re` module**

In [18]:
import re
import warnings
warnings.filterwarnings("ignore")

**Basic Pattern Matching**

In [4]:
text = "Hello, world"
pattern = r"Hello"
match = re.search(pattern, text)
if match:
    print(f"Match found, {match.group()}")
else:
    print(f"No match found")

Match found, Hello


In this example `re.search` searches for the pattern `"Hello"` in the `text` string. If a match is found, it returns a match object; otherwise, it returns `None`.

## Using Regex Metacharacters

Regular expression use metacharcters to define patterns. Some common metacharacters are:

* `.` (dot): matches any character except a newline
* `\d`: matches a digit character
* `\w`: matches a word character (letter, digit, or underscore)
* `\s`: matches a whitespace character
* `^`: matches the start of a string
* `$`: matches the end of a string
* `[]`: matches any character inside the brackets.
* `|`: matches either the expression before or after the `|`
* `*`: matches zero or more occurences of the preceding pattern
* `+`: matches zero or more occurences of the preceding pattern

In [16]:
email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$'
email = "itsgayendiptesh@gmail.com"
if re.match(email_pattern, email):
    print("Valid email address")
else:
    print("Invalid email address")

Valid email address


## Regex in Chemistry: SMILES Tokenizer

In Cheminformatics, regular expression are often used to process chemical represenations. A prominent example is the [Molecular Transformer](https://pubs.acs.org/doi/10.1021/acscentsci.9b00576) framework, where regex is employed to tokenize SMILES strings before they were given to ML models.

SMILES encodes a molecular structure as a linear string. However, the symbols in this string are not simple characters: a single chemical unit may consist of one character (C), multiple characters (Cl), or an entire bracket expression ([Fe+2]). Therefore, treating SMILES as ordinary text leads to incorrect interpretation of the chemistry.

The `smiles_tokenizer` function accomplishes this using a carefully designed regular expression that recognizes atoms, bonds, ring indices, and special constructs, and splits the SMILES into valid chemical components while preserving the original structure.

In [19]:
def smiles_tokenizer(smiles):
    """
    Tokenize a SMILES molecule or reaction
    """
    import re
    pattern =  "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
    regex = re.compile(pattern)
    tokens = [token for token in regex.findall(smiles)]
    assert smiles == ''.join(tokens)
    return ' '.join(tokens)

# List of molecules
smiles_list = [
    'C',                                    # Methane
    'CCO',                                  # Ethanol
    'CC(=O)O',                              # Acetic acid
    'c1ccccc1',                             # Benzene
    'N',                                    # Ammonia
    'S',                                    # Hydrogen sulfide
    'P',                                    # Phosphine
    'ClC(Cl)Cl',                            # Chloroform
    'C1CCCCC1',                             # Cyclohexane
    'c1ccc2ccccc2c1',                       # Naphthalene
    'C1C2CC3CC1CC(C2)C3',                   # Adamantane
    '[NH4+]',                               # Ammonium ion
    'CC(=O)[O-].[Na+]',                     # Sodium acetate
    '[NH2+]=C([NH2+])[NH2+]',               # Guanidinium
    '[2H]C([2H])([2H])[2H]',                # Deuterated methane
    '[13CH3][13CH2]O',                      # Carbon-13 labeled ethanol
    'CC(=O)OC1=CC=CC=C1C(=O)O',             # Aspirin
    'CC1(C(=O)NC(C(=O)N2C(C(=O)O)CS2)=C(O)C3=CC=CC=C3)C(=O)N(C)C(=O)N1',  # Penicillin G
    'Cl[Ir](Cl)(P(C3CCCCC3)3)=C(Cl)Cl',     # Vaska's complex
    'C1=CC=CC=C1.C=C=C=C>>C1C2=CC=CC=C2C3C=CC=CC3C1',  # Diels-Alder reaction
    '[NH3+]CC(=O)[O-].[CH3+]>>CC(=O)N.O'    # Nucleophilic substitution
]

# Test the tokenizer
for smiles in smiles_list:
    tokens = smiles_tokenizer(smiles)
    print(f"SMILES: {smiles}")
    print(f"Tokenized: {tokens}")
    print()

SMILES: C
Tokenized: C

SMILES: CCO
Tokenized: C C O

SMILES: CC(=O)O
Tokenized: C C ( = O ) O

SMILES: c1ccccc1
Tokenized: c 1 c c c c c 1

SMILES: N
Tokenized: N

SMILES: S
Tokenized: S

SMILES: P
Tokenized: P

SMILES: ClC(Cl)Cl
Tokenized: Cl C ( Cl ) Cl

SMILES: C1CCCCC1
Tokenized: C 1 C C C C C 1

SMILES: c1ccc2ccccc2c1
Tokenized: c 1 c c c 2 c c c c c 2 c 1

SMILES: C1C2CC3CC1CC(C2)C3
Tokenized: C 1 C 2 C C 3 C C 1 C C ( C 2 ) C 3

SMILES: [NH4+]
Tokenized: [NH4+]

SMILES: CC(=O)[O-].[Na+]
Tokenized: C C ( = O ) [O-] . [Na+]

SMILES: [NH2+]=C([NH2+])[NH2+]
Tokenized: [NH2+] = C ( [NH2+] ) [NH2+]

SMILES: [2H]C([2H])([2H])[2H]
Tokenized: [2H] C ( [2H] ) ( [2H] ) [2H]

SMILES: [13CH3][13CH2]O
Tokenized: [13CH3] [13CH2] O

SMILES: CC(=O)OC1=CC=CC=C1C(=O)O
Tokenized: C C ( = O ) O C 1 = C C = C C = C 1 C ( = O ) O

SMILES: CC1(C(=O)NC(C(=O)N2C(C(=O)O)CS2)=C(O)C3=CC=CC=C3)C(=O)N(C)C(=O)N1
Tokenized: C C 1 ( C ( = O ) N C ( C ( = O ) N 2 C ( C ( = O ) O ) C S 2 ) = C ( O ) C 3 = C C = C

This preprocessing step is crucial for task such as molecular property prediction, reaction prediction, and other machine learning applications in the field of chemistry or material science. 