# Part 1.3 - Regular Expressions
A regular expression (shortened as regex or `RegEx`), sometimes referred to as a rational expression, is a sequence of characters that specifies a match pattern in text. Usually, such patterns are used by string-searching algorithms for "**find**" or "**find & replace**" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

**Objectives:**

By the end of this notebook, Parham will have learned what `Regular Expressions` are and their use cases in `NLP`. He will acquire the basic concepts and syntax, such as **Quantifiers**, **Anchors**, **Groups**, etc. Subsequently, he will dive into more practical problems, such as **Matching and Extracting data**, and **Substitution and Splitting**. Last but not least, Parham will grasp some advanced techniques in the aforementioned scope, including **Named Groups**, **Non-Capturing Groups**, and **Conditional Statements**. Finally, he will challenge himself with a set of small tasks in this context.

**Tabel of Content:**

1. Import Libraries
2. Introduction to Regular Expressions
3. Basic Concepts and Syntax
4. Practical Applications
5. Advanced Techniques
6. Exercises and Challenges
7.  Closing Thoughts

## 1) Import Libraries

In [3]:
import re

## 2) Introduction to Regular Expressions

In this part of the notebook, Parham will gain basic knowledge about `RegEx`, and some of his questions will be answered. These answers will help him understand the core concepts of `RegEx` and prepare him to dive into practical tasks with more confidence.

### 1. What are Regular Expressions? 
Regular expressions (shortened as `regex` or `regexp`) are sequences of characters that define a search pattern. Some operations such as searching, matching, and manipulating text based on specific condition can be achieved by leveraging these patterns. `RegEx` are integral to string-processing algorithms and are used extensively for tasks such as:

- **Finding and Replacing:** Searching for specific patterns within text and replacing them with new strings.
- **Input Validation:** Ensuring that input strings meet specific formats (e.g., email addresses, phone numbers).
- **Text Parsing and Extraction:** Extracting specific data from larger text structures, such as dates, URLs, or particular text segments.

Regular expressions are supported by various programming languages and tools, making them a versatile and powerful tool in text processing and data analysis.

### 2. How are `RegEx` useful in NLP?

Regular expressions are extremely useful in Natural Language Processing (NLP) for several reasons:

- **Text Cleaning and Preprocessing:** Regular expressions help in removing irrelevant and redundant characters, normalizing text, and preparing data for further analysis.
- **Tokenization:** Splitting text into tokens (words, sentences) based on patterns is a crucial step in many NLP tasks.
- **Pattern Matching:** Identifying specific patterns in text, such as dates, phone numbers, email addresses, etc.
- **Information Extraction:** Extracting structured information from unstructured text data.
- **Data Validation:** Ensuring that input data conforms to expected formats, such as validating email addresses and phone numbers.

### 3. How `RegEx` works in the core?

To address this question the user will expand his knowledge about the `Regex Engine`. A RegEx engine is the software component that interprets and executes regular expressions. It processes the pattern specified by the `RegEx` and performs the matching operation against the target text. Understanding how the regex engine works can help in writing efficient and effective regular expressions. 

There are primarily two types of regex engines:

- **DFA (Deterministic Finite Automaton) Engine:**
  - **Characteristics:** 
    - Processes the input text in a single pass.
    - Always finds the longest possible match.
    - No backtracking is involved.
  - **Advantages:**
    - Fast and efficient, especially for simple patterns.
  - **Disadvantages:**
    - Limited in terms of features and flexibility.
  - **Examples:** Lexical analyzers in compilers often use DFA engines.

- **NFA (Non-deterministic Finite Automaton) Engine:**
  - **Characteristics:**
    - Can backtrack to find matches.
    - Supports more complex patterns, including those with nested and recursive structures.
  - **Advantages:**
    - More flexible and powerful, supporting advanced regex features.
  - **Disadvantages:**
    - Can be slower due to backtracking, especially with complex patterns.
  - **Examples:** Most modern regex implementations, such as those in Perl, Python, and JavaScript, use NFA engines.

When you use a regular expression to search through text, the regex engine follows these general steps:

- **Compile the Pattern:**
  - The regex pattern is parsed and compiled into an internal representation that the engine can work with.

- **Search the Text:**
  - The engine begins scanning the target text from the start (or from the specified position).
  - It tries to match the pattern against the text character by character.

- **Backtracking (NFA Engines):**
  - If a partial match fails, the engine backtracks to try alternative paths in the pattern. This allows it to handle more complex matching scenarios.

- **Return the Result:**
  - Once a match is found, the engine returns the match object.
  - If no match is found, the engine returns a result indicating no match.

**Explain with an example:**

  Suppose the user is looking for a specific pattern in a text, namely `/coding/`, in the following sentence:

  `coding is a cool hobby.`

  Steps Followed by the Regex Engine:

  0. **Encoding the Pattern:** The pattern `/coding/` is encoded into tokens that can be interpreted by the regex engine.
    
  1. **Searching for the Pattern:** The engine begins to search for the specific pattern in the text. In this example, the first character `c` is the first matched candidate, followed by `o` as the second, and so on, until the entire pattern is matched. 
  
  - > The `/g` at the end of the pattern stands for a **global** match, indicating that the engine should find all matches in the text. Without the global flag, the regex engine would return only the first match.

  2. **Backtracking:** If a partial match is found but not completed, the engine backtracks. For example, when matching `coding` in `cool`, the engine matches `c` and `o`, but `o` does not match the third character `d` in `coding`. Therefore, the engine backtracks to the beginning of `cool` and continues searching.

  3. **Returning Match Objects:** Once all global matches are found, the engine returns the match object(s).

  **Step-by-Step Matching**

  - The engine starts at the beginning of the text: `coding is a cool hobby.`
  - It matches `c`, `o`, `d`, `i`, `n`, `g` with the corresponding characters in `coding` and completes the pattern match.
  - The global flag `/g` ensures the engine continues searching through the text, but since there are no other instances of `coding`, the search ends.

  **Diagram Illustration**

  1. Initial Matching:

  ```
  coding is a cool hobby.
  ^
  c
  ```

  2. Continuing Match:

  ```
  coding is a cool hobby.
  ^^^^^^
  coding
  ```

  3. Backtracking (if needed):

  ```
  coding is a cool hobby.
                ^
                d (does not match next part of "coding")
  ```

  4. Completion:
  - Once all matches are found, the engine returns the results.



## 3) Basic Concepts and Syntax

In this section, Parham will learn about fundamental concepts and syntax used in `RegEx` including:
- Literals and Meta-characters
- Character Classes and Sets
- Quantifiers
- Anchors
- Groups and Ranges
- Escape Sequences

### 3.1) Literals and Meta-characters
The most basic regular expression consists of a single literal character, such as `a`. It matches the first occurrence of that character in the string. If the string is `Jack is a boy`, it matches the `a` after the `J`. The fact that this `a` is in the middle of the word does not matter to the regex engine. If it matters to you, you will need to tell that to the regex engine by using [word boundaries](https://www.regular-expressions.info/wordboundaries.html). We will get to that later.