<img src='https://www.di.uniroma1.it/sites/all/themes/sapienza_bootstrap/logo.png' width="200"/> 

# Part 1.3 - Regular Expressions
A regular expression (shortened as regex or `RegEx`), sometimes referred to as a rational expression, is a sequence of characters that specifies a match pattern in text. Usually, such patterns are used by string-searching algorithms for "**find**" or "**find & replace**" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
 
### **Objectives:**

By the end of this notebook, Parham will have learned what `Regular Expressions` are and their use cases in `NLP`. He will acquire the basic concepts and syntax, such as **Quantifiers**, **Anchors**, **Groups**, etc. Subsequently, he will dive into more practical problems, such as **Matching and Extracting data**, and **Substitution and Splitting**. Last but not least, Parham will grasp some advanced techniques in the aforementioned scope, including **Named Groups**, **Non-Capturing Groups**, and **Conditional Statements**. Finally, he will challenge himself with a set of small tasks in this context.

### **References**: 
- [https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)
- [https://www.geeksforgeeks.org/regular-expression-python-examples/](https://www.geeksforgeeks.org/regular-expression-python-examples/)
### **Tutors**:
- Professor Stefano Farali
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: Stefano.faralli@uniroma1.it
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/stefano-faralli-b1183920/) 
- Professor Iacopo Masi
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: masi@di.uniroma1.it  
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/iacopomasi/)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**: [GitHub](https://github.com/iacopomasi)  

### **Contributors**:
- Parham Membari
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: p.membari96@gmail.com
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/p-mem/)
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**:  [GitHub](https://github.com/parham075)
    - <img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Medium_logo_Monogram.svg" alt="Logo" width="20" height="20"> **Medium**: [Medium](https://medium.com/@p.membari96)

**Tabel of Content:**

1. Import Libraries
2. Introduction to Regular Expressions
3. Basic Concepts and Syntax
4. Practical Applications
5. Exercises and Challenges
6. Closing Thoughts

## 1. Import Libraries

In [None]:
# @title
import re
from loguru import logger
import os
import sys
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize

# Configure logging
nltk.data.path.append(os.getcwd() + "datasets/")

## 2. Introduction to Regular Expressions

In this part of the notebook, Parham will gain basic knowledge about `RegEx`, and some of his questions will be answered. These answers will help him understand the core concepts of `RegEx` and prepare him to dive into practical tasks with more confidence.

### 1. What are Regular Expressions? 
Regular expressions (shortened as `regex` or `regexp`) are sequences of characters that define a search pattern. Some operations such as searching, matching, and manipulating text based on specific condition can be achieved by leveraging these patterns. `RegEx` are integral to string-processing algorithms and are used extensively for tasks such as:

- **Finding and Replacing:** Searching for specific patterns within text and replacing them with new strings.
- **Input Validation:** Ensuring that input strings meet specific formats (e.g., email addresses, phone numbers).
- **Text Parsing and Extraction:** Extracting specific data from larger text structures, such as dates, URLs, or particular text segments.

Regular expressions are supported by various programming languages and tools, making them a versatile and powerful tool in text processing and data analysis.

### 2. How are `RegEx` useful in NLP?

Regular expressions are extremely useful in Natural Language Processing (NLP) for several reasons:

- **Text Cleaning and Preprocessing:** Regular expressions help in removing irrelevant and redundant characters, normalizing text, and preparing data for further analysis.
- **Tokenization:** Splitting text into tokens (words, sentences) based on patterns is a crucial step in many NLP tasks.
- **Pattern Matching:** Identifying specific patterns in text, such as dates, phone numbers, email addresses, etc.
- **Information Extraction:** Extracting structured information from unstructured text data.
- **Data Validation:** Ensuring that input data conforms to expected formats, such as validating email addresses and phone numbers.

### 3. How `RegEx` works in the core?

To address this question the user will expand his knowledge about the `Regex Engine`. A RegEx engine is the software component that interprets and executes regular expressions. It processes the pattern specified by the `RegEx` and performs the matching operation against the target text. Understanding how the regex engine works can help in writing efficient and effective regular expressions. 

There are primarily two types of regex engines:

- **DFA (Deterministic Finite Automaton) Engine:**
  - **Characteristics:** 
    - Processes the input text in a single pass.
    - Always finds the longest possible match.
    - No backtracking is involved.
  - **Advantages:**
    - Fast and efficient, especially for simple patterns.
  - **Disadvantages:**
    - Limited in terms of features and flexibility.
  - **Examples:** Lexical analyzers in compilers often use DFA engines.

- **NFA (Non-deterministic Finite Automaton) Engine:**
  - **Characteristics:**
    - Can backtrack to find matches.
    - Supports more complex patterns, including those with nested and recursive structures.
  - **Advantages:**
    - More flexible and powerful, supporting advanced regex features.
  - **Disadvantages:**
    - Can be slower due to backtracking, especially with complex patterns.
  - **Examples:** Most modern regex implementations, such as those in Perl, Python, and JavaScript, use NFA engines.

When you use a regular expression to search through text, the regex engine follows these general steps:

- **Compile the Pattern:**
  - The regex pattern is parsed and compiled into an internal representation that the engine can work with.

- **Search the Text:**
  - The engine begins scanning the target text from the start (or from the specified position).
  - It tries to match the pattern against the text character by character.

- **Backtracking (NFA Engines):**
  - If a partial match fails, the engine backtracks to try alternative paths in the pattern. This allows it to handle more complex matching scenarios.

- **Return the Result:**
  - Once a match is found, the engine returns the match object.
  - If no match is found, the engine returns a result indicating no match.

**Explain with an example:**

  Suppose the user is looking for a specific pattern in a text, namely `/coding/`, in the following sentence:

  `coding is a cool hobby.`

  Steps Followed by the Regex Engine:

  0. **Encoding the Pattern:** The pattern `/coding/` is encoded into tokens that can be interpreted by the regex engine.
    
  1. **Searching for the Pattern:** The engine begins to search for the specific pattern in the text. In this example, the first character `c` is the first matched candidate, followed by `o` as the second, and so on, until the entire pattern is matched. 
  
  - > The `/g` at the end of the pattern stands for a **global** match, indicating that the engine should find all matches in the text. Without the global flag, the regex engine would return only the first match.

  2. **Backtracking:** If a partial match is found but not completed, the engine backtracks. For example, when matching `coding` in `cool`, the engine matches `c` and `o`, but `o` does not match the third character `d` in `coding`. Therefore, the engine backtracks to the beginning of `cool` and continues searching.

  3. **Returning Match Objects:** Once all global matches are found, the engine returns the match object(s).

  **Step-by-Step Matching**

  - The engine starts at the beginning of the text: `coding is a cool hobby.`
  - It matches `c`, `o`, `d`, `i`, `n`, `g` with the corresponding characters in `coding` and completes the pattern match.
  - The global flag `/g` ensures the engine continues searching through the text, but since there are no other instances of `coding`, the search ends.

  **Diagram Illustration**

  1. Initial Matching:

  ```
  coding is a cool hobby.
  ^
  c
  ```

  2. Continuing Match:

  ```
  coding is a cool hobby.
  ^^^^^^
  coding
  ```

  3. Backtracking (if needed):

  ```
  coding is a cool hobby.
                ^
                d (does not match next part of "coding")
  ```

  4. Completion:
  - Once all matches are found, the engine returns the results.



## 3. Basic Concepts and Syntax

In this section, Parham will learn about fundamental concepts and syntax used in `RegEx` including:
- Literals and Meta-characters
- Character Classes and Sets
- Quantifiers
- Anchors
- Groups and Ranges
- Escape Sequences

### 3.1) Literals and Meta-characters


#### 3.1.1) Literals
The most basic regular expression consists of a single literal character, such as `a`. It matches the first occurrence of that character in the string. If the string is `Jack is a boy`, it matches the `a` after the `J`. The fact that this `a` is in the middle of the word does not matter to the regex engine. If it matters to you, you will need to tell that to the regex engine by using [word boundaries](https://www.regular-expressions.info/wordboundaries.html). We will get to that later.

>The re module offers a set of functions that allows us to search a string for a match:
> 
>| Function  | Description                                                   |
>|-----------|---------------------------------------------------------------|
>| findall   | Returns a list containing all matches                         |
>| search    | Returns a Match object if there is a match anywhere in the string |
>| split     | Returns a list where the string has been split at each match  |
>| sub       | Replaces one or many matches with a string                    |
>| find      | Returns the lowest index of the substring if found in the string |
>| finditer  | Returns an iterator yielding Match objects for all matches    |



In [2]:
# Define the text to search
text = "Jack is a boy"

# Define the pattern to search for
pattern = r"a"

# Use re.search to find the first match of the pattern in the text
match = re.search(pattern, text)

# Check if a match is found and print the matched text
if match:
    print(f"Pattern: '{match[0]}' is found")
else:
    print("No match found.")

Pattern: 'a' is found


**Excercise 1:**
  
Given the following text, find all occurrences of the letter `a`, and count how many times it appears before the first occurrence of the word `apple`. Return both the matches and their positions.

In [3]:
text = """In a faraway land, 
        there was a magical forest where apple treees grew abundantly. 
        One sunny afternoon, 
        a young adventurer found herself wandering among the apple treees. 
        She marveled at the vibrant colors and sweet scents. 
        The apples seemed to be glowing in the sunlight, 
        and she picked a few to enjoy later. 
        As she continued her journey, 
        she found that the apples had a mysterious quality to them."""
pattern = r"a"
len(text)

470

In [4]:
# @title 🧑🏿‍💻 Your code here

In [5]:
# @title 👀 Solution

# Find the position of the first occurrence of the word 'apple'
apple_position = text.find("apple")

# Use re.finditer to find all matches of the pattern in the text
matches = re.finditer(pattern, text)

# Initialize the count of 'a' before 'apple'
count_before_apple = 0

# Iterate over the matches and print their positions
for match in matches:
    position = match.start()
    # Increment the count if the position is before 'apple'
    if position < apple_position:
        print(f"Matched text: {match[0]}, Location: {position}")
        count_before_apple += 1

print(f"Count of 'a' before the first occurrence of 'apple': {count_before_apple}")

Matched text: a, Location: 3
Matched text: a, Location: 6
Matched text: a, Location: 8
Matched text: a, Location: 10
Matched text: a, Location: 14
Matched text: a, Location: 35
Matched text: a, Location: 38
Matched text: a, Location: 41
Matched text: a, Location: 45
Count of 'a' before the first occurrence of 'apple': 9


**Exercise 2**: 

Find all occurrences of the letter a, and count how many times it appears between the first and second occurrences of the word apple. Return both the matches and their positions, and the count of a characters between the two occurrences of apple.

In [6]:
# @title 🧑🏿‍💻 Your code here

In [7]:
# @title 👀 Solution
# Find the positions of the first and second occurrences of the word 'apple'
apple_positions = [m.start() for m in re.finditer(r"apple", text)]


first_apple_position = apple_positions[0] + len("apple")
second_apple_position = apple_positions[1]

# Use re.finditer to find all matches of the pattern in the text
matches = re.finditer(pattern, text)

# Initialize the count of 'a' between the two 'apple'
count_between_apples = 0

# Iterate over the matches and count 'a' between the two 'apple'
for match in matches:
    position = match.start()
    if first_apple_position <= position < second_apple_position:
        count_between_apples += 1
        print(f"Matched text: {match[0]}, Location: {position}")

print(
    f"Count of 'a' between the first and second occurrences of 'apple': {count_between_apples}"
)

Matched text: a, Location: 79
Matched text: a, Location: 84
Matched text: a, Location: 110
Matched text: a, Location: 130
Matched text: a, Location: 138
Matched text: a, Location: 164
Matched text: a, Location: 173
Count of 'a' between the first and second occurrences of 'apple': 7


#### 3.1.2) Meta-characters

Meta-characters are special characters in regular expressions that have specific meanings and functions. They help define complex patterns and provide powerful matching capabilities. Here are some common meta-characters and their meanings:

| Symbol                      | Description                                       | Pattern  | Text Examples               | Matches        |
|-----------------------------|---------------------------------------------------|----------|-----------------------------|-----------------|
| **Dot (`.`)**               | Matches any single character except a newline.   | `c.t`    | `cat`, `cot`, `cut`         | Matches all     |
| **Caret (`^`)**             | Matches the start of a string.                   | `^The`   | `The cat`, `A cat`          | `The cat`       |
| **Dollar (`$`)**            | Matches the end of a string.                     | `cat$`   | `The cat`, `The cat is here`| `The cat`       |
| **Asterisk (`*`)**          | It specifies that the preceding element may occur zero or more times. | `ca*t`   | `ct`, `cat`, `caat`         | Matches all     |
| **Plus (`+`)**              | Matches one or more of the preceding character.  | `ca+t`   | `cat`, `caat`, `ct`         | `cat`, `caat`   |
| **Question Mark (`?`)**     | Matches zero or one of the preceding character.  | `ca?t`   | `cat`, `ct`                 | `cat`, `ct`     |
| **Braces (`{n,m}`)**        | Matches between `n` and `m` occurrences of the preceding character. | `a{2,4}` | `aa`, `aaa`, `aaaa` | Matches all     |
| **Square Brackets (`[]`)**  | Matches any one of the characters inside the brackets. | `[abc]` | `a`, `b`, `c`            | Matches all     |
| **Pipe (`\|`)**             | Acts as an OR operator.                         | `cat\|dog` | `cat`, `dog`               | Matches either  |
| **Parentheses (`()`)**      | Groups patterns and captures the matched text.  | `(ab)+`  | `ab`, `abab`               | Matches both    |




In [8]:
# Dot (.)
logger.info("Example for Dot (.)")
pattern = r"a.d"
matches = re.findall(pattern, text)
print(f"Dot matches: {matches if matches else 'No match'}")
sys.stdout.flush()
# Caret (^)
logger.info("Example for Caret (^)")
pattern = r"^I"
match = re.search(pattern, text)
print(f"Caret match: {match[0] if match else 'No match'}")
sys.stdout.flush()
# Dollar ($)
logger.info("Example for Dollar ($)")
pattern = r"them\.$"
match = re.search(pattern, text)
print(f"Dollar match: {match[0] if match else 'No match'}")
sys.stdout.flush()
# Asterisk (*)
logger.info("Example for Asterisk (*)")
pattern = r"ap*"
matches = re.findall(pattern, text)
print(f"Asterisk matches: {matches}")
sys.stdout.flush()

# Plus (+)
logger.info("Example for Plus (+)")
pattern = r"s+"
matches = re.findall(pattern, text)
print(f"Plus matches: {matches}")
sys.stdout.flush()
# Question Mark (?) make the previous character optional
logger.info("Example for Question Mark (?)")
pattern = r"t.?e"  # find zero/one char between "t" and the "e"
matches = re.findall(pattern, text)
print(f"Question Mark matches: {matches}")
sys.stdout.flush()

logger.info("Example for Braces (`{n,m}`)")
# Define the pattern to match sequences of 2 to 3 'l's
pattern = r"p{1,3}"
# Find all matches of the pattern
matches = re.finditer(pattern, text)
# Print the matches and the words containing them
for match in matches:
    # Find the start and end indices of the match
    start_index = match.start()
    end_index = match.end()
    print(
        f"Match: '{text[start_index:end_index]}' found in positions {start_index}:{end_index}"
    )

sys.stdout.flush()
logger.info("Example for Square Brackets (`[]`)")
txt = "The rain in Rome"
# Find all lower case characters alphabetically between "a" and "m":
matches = re.findall("[a-m]", txt)
print(matches)
sys.stdout.flush()

logger.info("Example for Pipe (`|`)")
pattern = r"tree|there"
matches = re.findall(pattern, text)
if matches:
    print(matches)
    print(f"Yes, there is/are {len(matches)} match(es)!")
else:
    print("No match")
sys.stdout.flush()
logger.info("Example for Parentheses (`()`)")

# Define the pattern to match the word "race" followed by "in" and a time (e.g., "2 hours")
txt = "The runner finished the race in 2 hours and 30 minutes."

pattern = r"(race) in (\d+ hours) and (\d+ minutes.)"

# Find all matches of the pattern
matches = re.finditer(pattern, txt)

# Print the matches and the groups
for match in matches:
    # Print the entire match
    print(f"Match: '{match.group(0)}'")
    # Print the captured groups
    print(f"Group 1 (race): '{match.group(1)}'")
    print(f"Group 2 (time): '{match.group(2)}'")
    print(f"Group 3 (time): '{match.group(3)}'")

[32m2025-03-12 12:44:24.834[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mExample for Dot (.)[0m


Dot matches: ['and', 'and', 'and', 'and']


[32m2025-03-12 12:44:24.836[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1mExample for Caret (^)[0m


Caret match: I


[32m2025-03-12 12:44:24.839[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m14[0m - [1mExample for Dollar ($)[0m


Dollar match: them.


[32m2025-03-12 12:44:24.842[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m20[0m - [1mExample for Asterisk (*)[0m


Asterisk matches: ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'app', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'app', 'a', 'a', 'a', 'a', 'app', 'a', 'a', 'a', 'a', 'app', 'a', 'a', 'a']


[32m2025-03-12 12:44:24.843[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m27[0m - [1mExample for Plus (+)[0m


Plus matches: ['s', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's']


[32m2025-03-12 12:44:24.845[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m33[0m - [1mExample for Question Mark (?)[0m


Question Mark matches: ['the', 'tre', 'te', 'the', 'tre', 'the', 'the', 'te', 'the', 'te', 'the']


[32m2025-03-12 12:44:24.847[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m39[0m - [1mExample for Braces (`{n,m}`)[0m


Match: 'pp' found in positions 62:64
Match: 'pp' found in positions 184:186
Match: 'pp' found in positions 273:275
Match: 'p' found in positions 334:335
Match: 'pp' found in positions 431:433


[32m2025-03-12 12:44:24.849[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m54[0m - [1mExample for Square Brackets (`[]`)[0m


['h', 'e', 'a', 'i', 'i', 'm', 'e']


[32m2025-03-12 12:44:24.851[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m61[0m - [1mExample for Pipe (`|`)[0m


['there', 'tree', 'tree']
Yes, there is/are 3 match(es)!


[32m2025-03-12 12:44:24.854[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m70[0m - [1mExample for Parentheses (`()`)[0m


Match: 'race in 2 hours and 30 minutes.'
Group 1 (race): 'race'
Group 2 (time): '2 hours'
Group 3 (time): '30 minutes.'


Exercise: Write a Python script to extract and format phone number information from a given text using regular expressions. The result must be the extracted information including country names, country codes, and phone numbers.

```
txt = Dario called his mom in the Italy at +39-335-880-5661.
Romina called her dad in the Iran at +98-912-052-0182.
John called his friend in the USA at +1-202-555-0173.
Mary contacted her colleague in the UK at +44-20-7946-0958.
Ravi made a call to his family in India at +91-22-6789-1234.
Anna dialed her cousin in Australia at +61-2-9876-5432.
Carlos phoned his business partner in Brazil at +55-11-2345-6789.
```


In [9]:
# @title 🧑🏿‍💻 Your code here
txt = """
Dario called his mom in the Italy at +39-335-880-5661.
Romina called her dad in the Iran at +98-912-052-0182.
John called his friend in the USA at +1-202-555-0173.
Mary contacted her colleague in the UK at +44-20-7946-0958.
Ravi made a call to his family in India at +91-22-6789-1234.
Anna dialed her cousin in Australia at +61-2-9876-5432.
Carlos phoned his business partner in Brazil at +55-11-2345-6789.

"""

In [10]:
# @title 👀 Solution
# Text with multiple phone numbers including country codes
txt = """
Dario called his mom in the Italy at +39-335-880-5661.
Romina called her dad in the Iran at +98-912-052-0182.
John called his friend in the USA at +1-202-555-0173.
Mary contacted her colleague in the UK at +44-20-7946-0958.
Ravi made a call to his family in India at +91-22-6789-1234.
Anna dialed her cousin in Australia at +61-2-9876-5432.
Carlos phoned his business partner in Brazil at +55-11-2345-6789.

"""

# Regex pattern to extract the country, country code, and phone number
pattern = (
    r"(?P<country>\w+)\s+at\s+(?P<code>\+\d{1,3})-(?P<number>\d{1,4}-\d{1,4}-\d{1,4})"
)


# Find all matches in the text
matches = re.finditer(pattern, txt)

# Initialize an empty DataFrame
df = pd.DataFrame(columns=["Country", "Code", "Phone Number"])

# List to store each row of data
rows = []
# Append each match as a dictionary to the list
for idx, match in enumerate(matches):
    df.loc[idx] = [
        match.group("country"),
        match.group("code"),
        re.sub(r"-", "", match.group("number")),
    ]

df

Unnamed: 0,Country,Code,Phone Number
0,Italy,39,3358805661
1,Iran,98,9120520182
2,USA,1,2025550173
3,UK,44,2079460958
4,India,91,2267891234
5,Australia,61,298765432
6,Brazil,55,1123456789


**Exercise**: Write a code to extract the date and time from the given text based on the conditional pattern.
Format the results in a DataFrame with columns "Date" and "Time".
>Hint: The regex conditional is an `IF…THEN…ELSE` construct. Its basic form is this:
>`(?(A)X|Y)`

In [11]:
# @title 🧑🏿‍💻 Your code here
txt = """
Meeting scheduled on 2023-12-15 at 10:00 AM in the conference room.
Event is set for 01/12/2023 at 2:30 PM at the main hall.
The report is due on 2023-11-30 at 3:00 PM in the office.
Join us on 10/11/2024 for a webinar at 6:45 PM.
Please confirm by 2024-05-20 at 9:00 AM in the boardroom.
"""

In [12]:
# @title 👀 Solution
txt = """
Meeting scheduled on 2023-12-15 at 10:00 AM in the conference room.
Event is set for 01/12/2023 at 2:30 PM at the main hall.
The report is due on 2023-11-30 at 3:00 PM in the office.
Join us on 10/11/2024 for a webinar at 6:45 PM.
Please confirm by 2024-05-20 at 9:00 AM in the boardroom.
"""

# Regex pattern to extract date and time with conditional logic

pattern = r"(?P<date>\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4})\s+at\s+(?P<time>\d{1,2}:\d{2}\s[AP]M)"

# Find all matches in the text
matches = re.finditer(pattern, txt)

# Initialize an empty DataFrame
df = pd.DataFrame(columns=["Date", "Time"])

# Populate the DataFrame using loc
for idx, match in enumerate(matches):
    df.loc[idx] = [match.group("date"), match.group("time")]


df

Unnamed: 0,Date,Time
0,2023-12-15,10:00 AM
1,01/12/2023,2:30 PM
2,2023-11-30,3:00 PM
3,2024-05-20,9:00 AM


## 4. Practical Applications

When discussing the Practical Applications of Regular Expressions (Regex) in the context of Natural Language Processing (NLP), it's crucial to highlight how these tools are used to handle, manipulate, and analyze text data effectively. Here’s how each of the mentioned topics **Substitution and Splitting** can be applied in NLP tasks.

### Substitution and Splitting

In this section, Parham will explore two fundamental `Regex` **Substitution** and **Splitting** techniques that are essential in many NLP tasks. These techniques play a critical role in processes like text normalization and tokenization, which are foundational steps in any NLP application. 

For example, text normalization often involves converting text into a standard format, such as splitting sentences into separate records, which is useful for tasks like **Sentence-Level Analysis**, **Data Augmentation**, and **Simplifying Models**. However, these techniques may not be suitable for applications like **Contextual Understanding**, **Paragraph-Level Classification**, or **Sequence Models**. Additionally, he will learn how to lowercase text, remove special characters, and replace contractions, ensuring that your data is clean and consistent for further analysis.

In [20]:
from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset, load_from_disk

ds = load_dataset("GroNLP/ik-nlp-22_slp", "paragraphs")["train"].select_columns("text")
nltk.download('punkt_tab', download_dir=os.getcwd() + '/datasets/nltk_data')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/lorenzo/Documenti/GitHub/Computer-Science-
[nltk_data]     Sapienza/NLP/Part_1/datasets/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [21]:
def preprocessed(sample):
    for text in sample["text"]:
        # Step 1: Sentence tokenization
        sentences = sent_tokenize(text)
        # Apply your other preprocessing steps to each sentence
        processed_sentences = []
        # proccess each sentence in each sample
        for sentence in sentences:
            # Remove special characters
            normalized_text = re.sub(r"[^a-zA-Z0-9\s]", "", sentence)
            # Lowercase the text
            normalized_text = normalized_text.lower()
            # Replace contractions
            contractions = {
                "don't": "do not",
                "can't": "cannot",
                "i'm": "i am",
                "it's": "it is",
                r"won\'t": "will not",
                r"can\'t": "cannot",
                r"ain\'t": "is not",
                r"(\w+)\'ll": "\g<1> will",
                r"(\w+)n\'t": "\g<1> not",
                r"(\w+)\'ve": "\g<1> have",
                r"(\w+)\'s": "\g<1> is",
                r"(\w+)\'re": "\g<1> are",
                r"(\w+)\'d": "\g<1> would",
            }
            for contraction, replacement in contractions.items():
                normalized_text = re.sub(
                    r"\b" + contraction + r"\b", replacement, normalized_text
                )

            processed_sentences.append(normalized_text)

        # Return each processed sentence as a new record
        return {"text": processed_sentences}


# Apply the preprocessing function
ds_preprocessed = ds.map(preprocessed, batched=True, batch_size=1)
ds_preprocessed = ds_preprocessed.filter(
    lambda x: len(x["text"]) > 0
)  # Remove any empty records
ds_preprocessed["text"][:5]

['the dialogue above is from eliza an early natural language processing system eliza that could carry on a limited conversation with a user by imitating the responses of a rogerian psychotherapist weizenbaum 1966',
 'eliza is a surprisingly simple program that uses pattern matching to recognize phrases like i need x and translate them into suitable outputs like what would it mean to you if you got x',
 'this simple technique succeeds in this domain because eliza doesnt actually need to know anything to mimic a rogerian psychotherapist',
 'as weizenbaum notes this is one of the few dialogue genres where listeners can act as if they know nothing of the world',
 'elizas mimicry of human conversation was remarkably successful many people who interacted with eliza came to believe that it really understood them and their problems many continued to believe in elizas abilities even after the programs operation was explained to them weizenbaum 1976 and even today such chatbots are a fun diversi

# 5. Exercise: Regex-Based Named Entity Extraction

### **Objective:**  
In this exercise, you'll deepen your understanding of `Regex` by applying it to a named entity extraction task using a real-world dataset. You'll focus on identifying and extracting specific types of information, such as dates, usernames, and hashtags, from text data.

### **Dataset:**  
You will work with the [TweetEval](https://huggingface.co/datasets/cardiffnlp/tweet_eval) dataset, available through **Hugging Face**. This dataset contains a diverse collection of tweets labeled for various tasks.

### **Steps:**

1. **Load the Dataset:**  
   - Load the `text` subset of the TweetEval dataset from Hugging Face.

2. **Entity Extraction:**  
   - Use regular expressions to extract the following entities from the tweets:
     - **Dates:** Extract mentions of dates in formats like `DD/MM/YYYY` or `Month Day, Year` or `Day of the week, Year`.
     - **URLs:** Extract any URLs from the tweets.
     - **Usernames:** Extract any usernames mentioned in the tweets.
     - **Hashtags:** Extract any hashtags mentioned in the tweets.

3. **Data Enhancement:**  
   - Store each of the extracted entities in a new field within the dataset.

4. **Export the Results:**  
   - Save the enhanced dataset in `JsonL` format.

> **Note:** For more information on using the Hugging Face Dataset module, please refer to this [documentation](https://huggingface.co/docs/datasets/v1.1.1/processing.html). 


In [23]:
# @title 🧑🏿‍💻 Your code here

# Percorso dove salvare il dataset
dataset_path = "./datasets/tweet_eval_sentiment"

# Controlla se il dataset esiste già in locale
if os.path.exists(dataset_path):
    print("Caricamento dataset da disco...")
    dataset = load_from_disk(dataset_path)
else:
    print("Scaricamento dataset da Hugging Face...")
    dataset = load_dataset("cardiffnlp/tweet_eval", "sentiment")
    dataset.save_to_disk(dataset_path)  # Salva il dataset in locale per uso futuro

# Verifica che il dataset sia stato caricato correttamente
print(dataset)
print(dataset["train"][0])

Scaricamento dataset da Hugging Face...


Saving the dataset (0/1 shards):   0%|          | 0/45615 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/12284 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45615
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})
{'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"', 'label': 2}


# 6. Closing Thoughts
Through this Notebooks Parham performed the following activities:
- Acquired knowlege of how `Regex` works.
- Applied different `Regex` syntaxes on different texts
- Introduced with a small real world usage of `Regex` in NLP tasks
- Challenged him/her self with a real NLP task leveraging `Regex` and **Hugging Face Dataset** library.
