# RegEx Use Cases

## 1. Finding Motifs in Genome Sequences

Let’s consider a DNA sequence `ATGCGATCGACGCTAGCGATCGCGATCGAGCGATCGCTAGCGATCGATCGCGATCG`. We can use regular expressions to search for a short sequence motif, such as `CGA`.

In [53]:
###### import re
seq = "ATGCGATCGACGCTAGCGATCGCGATCGAGCGATCGCTAGCGATCGATCGCGATCG"
motif = "CGA"
pattern = re.compile(motif)
matches = pattern.finditer(seq)
for match in matches:
   print(match.start(), match.end())

3 6
7 10
16 19
22 25
26 29
30 33
40 43
44 47
50 53


In the above example, we import the re module and define the DNA sequence and motif. Then, we use the re.finditer method to search for all matches of the motif in the DNA sequence, storing the match objects in a list.

## 2. Finding Transcription Factor Binding Sites (TFBSs)

###### Transcription factors are protein molecules that bind to specific sites on DNA sequences to initiate transcription.

Let’s take the example of the binding site for the transcription factor USF1, which is defined by the following consensus sequence: `TCAGGTCA`. We can use the following regular expression to search for this binding site:

In [54]:
import re
seq = "GCAGTGCTCAGGTCAACAGTGCTGAGCTCAGGTCA"
motif = "TCAGGTCA"
pattern = re.compile(motif)
matches = pattern.finditer(seq)
for match in matches:
 print(match.start(), match.end())

7 15
27 35


In this example we define a DNA sequence seq and a consensus TFBS motif motif, then use re.compile and finditer to search for all motif matches in the sequence. The start and end methods of the match object provide the indices of the matches.

## 3. Finding Open Reading Frames (ORFs)

###### Open reading frames (ORFs) are regions of DNA that can be translated into proteins.

A common method for identifying ORFs is to search for a start codon, such as `ATG`, followed by a series of codons, which can be any of the 64 possible codons. The ORF ends when a stop codon, such as `TAA`, `TAG`, or `TGA`, is encountered.

In [78]:
import re
seq = "ATGCGATCGACGCTAGCGTAAATCGCGATCGAGCGATCGCTAGCGATCATGGATCGCGATCGTAAAGGCTACGTTGAGTCAGTAA"
start_codon = "ATG"
stop_codons = ["TAA", "TAG", "TGA"]
pattern = re.compile(start_codon + "([ATGC]{3})*?(" + "|".join(stop_codons) + ")")
matches = pattern.finditer(seq)
for match in matches:
   print(match.start(), match.end())

0 21


In this example, We define a DNA sequence, start codon, and stop codons, then compile a regex pattern to find ORFs starting with the start codon and ending with a stop codon. The finditer method locates all matches, and the start and end methods provide the indices of these ORFs.

## 4. Identifying Restriction Enzyme Recognition Sites

###### Restriction enzymes are commonly used in molecular biology to cleave DNA at specific recognition sites.

In [56]:
import re
seq = "GATATCCTGACTGAACCTAGGTCCATGATTATGTACGAATTCCAGCTTTTACAAGGGTCCACTAGTCTAACAGAGGTCGCAGACGTT"
pattern = re.compile(r"(GATATC)")
matches = pattern.findall(seq)
print(matches)

['GATATC']


In this example, We define a DNA sequence containing the EcoRV recognition site and use re.findall with the pattern “GATATC” to get all matching sites in the sequence.

## 5. Identifying Protein Motifs

###### Proteins often contain specific amino acid sequences, known as motifs, that are involved in their function or structure.

In [57]:
import re
seq = "MFDYKDDDDKGKRKLSAELGTYYTDKPKLPGDATASYQCLVTQVDIAKNTFIQTKITTGTLMYMAKSYQLFVRVKDNIIDKLVIHLLVDLVVKDDEIEFLVHAQKHFSTLKGVLITDPDNHLYEGLFDRDEMILAAIAGKSSEKQDDQVGYYCVSHRSADPKNLKYGMEMADDLSYVKYGPYHLIKMIEFPEHFRYTNLSSEKINS"
pattern = re.compile(r"(VI\w{2}L\w{2})")
matches = pattern.findall(seq)
print(matches)

['VIHLLVD']


In this example, We define a protein sequence seq with the "VILL" motif and create a regex pattern VI[A-Z]{2}L[A-Z]{2} to match sequences of the form “VIxxLxx,” where ‘x’ represents any amino acid. The findall method returns all matching motifs found in the sequence.

## 6. Identifying Protein Domains

###### Proteins can be composed of multiple domains, which are regions of the protein that are independently folded and have specific functions.

In [58]:
import re
seq = "MQYFLFLLGLITLGESRALVFQPNCWHVLGCSWPEITLVQEPRGVLEEFFGVNPAVCKPGYTYDDSTSTNMFVGGKLTIKTTEKGYGYEIGPRIYEISRGYGTDEGAQFLQAKSHTLHKYDSFIELPIDGVKRTQEHQIARWWGTPVIPSSAGGDADIGLGLGETGSIMVITAGASESRITLAPGLVEEAVFDGIIKGAFAGIDSSVMLLGGDYVVL"
pattern = re.compile(r"(SR[AG])")
matches = pattern.findall(seq)
print(matches)

['SRA', 'SRG']


In this example, We define a protein sequence seq containing the "SR-rich" domain and use the regex pattern SRA|SRS to find occurrences of these motifs, with findall returning all matches in the sequence.

## 7. Identifying MicroRNA Target Sites

###### MicroRNAs are small non-coding RNAs that regulate gene expression by binding to target sequences in messenger RNAs.

In [60]:
import re
seq = "ATGCTGAGCTGCATGAGATGGAGTGACCATCCTGTAGCTCACAGGATTTCCAGTGTTGTACCTGGGAGACTGGTGGGAAGGCCACAGGAACTCAAGGTATGGGGAGCATCTCATGGGCCTCCAAGTGATTAAGGACCTCTGGTGTGGCCTGCCCAAGTACCCATGGTGTTGGAGACCTGGAAGTCTTCAAGACAGAAGTGCTTGTCTCTTAA"
pattern = re.compile(r"(TG\w{6}CA)")
matches = pattern.findall(seq)
print(matches)

['TGGGCCTCCA']


In this example, We define a DNA sequence seq and use the regex pattern TG[A-T]{6}CA to match potential miRNA target sites where TG is followed by six bases and ends with CA, retrieving all matches with findall.

## 8. Identifying RNA Secondary Structure

###### RNA molecules can fold into complex secondary structures that are important for their function.

In [65]:
import re
seq = "CACGCCGGGUCCACUGUACCAGGUAUCAGUGGAGGCGAAGCGCGCCUUGAAACAGCUGCGUAAAGCUUUCGUUUUUAAGCGU"
pattern = re.compile(r"((?:G|C){3,}|(?:A|U){3,})")
matches = pattern.findall(seq)
print(matches)

['CGCCGGG', 'UAU', 'GGCG', 'GCGCGCC', 'AAA', 'GCG', 'UAAA', 'UUU', 'UUUUUAA', 'GCG']


We define an RNA sequence seq and use the regex pattern ((?:G|C){3,}|(?:A|U){3,}) to match potential stem-loop structures with three or more consecutive Gs/Cs or As/Us. The findall method retrieves all matching structures in the sequence.

## 9. Identifying Conserved Protein Motifs

In [66]:
import re
seq = "MCDPALVRYKSIELRDDKGPLVLYLSQGRRSGVLGLVRFSSLGGNMQGRKNLISENNNSYWYRSFEVKSRLDLDAASGIFVHLGDSQEAPFPTGLLVQNTIIFKKLGGSAHAFYNTYDWDITQELIDGVIACSRGHNEAWHKLW"
pattern = re.compile(r"(L.{1,3}L.{1,3}L)")
matches = pattern.findall(seq)
print(matches)

['LVLYL']


We define a protein sequence seq and use the regex pattern (L.{1,3}L.{1,3}L) to match conserved motifs with leucines separated by up to three arbitrary residues, retrieving all matches with findall.