# regular expressions module

## 📘 Python `re` Module – Notes

### 🔹 What is `re`?

- `re` stands for **regular expression**.
- It's a built-in Python module used to **search**, **match**, **replace**, or **split** text based on patterns.

---

### ✅ How to use it

```python
import re
```

---

### 🔸 Common `re` Functions

| Function            | Description |
|---------------------|-------------|
| `re.findall()`      | Returns all matches in a list |
| `re.search()`       | Returns the first match (as a Match object) |
| `re.match()`        | Checks for a match **only at the beginning** of the string |
| `re.sub()`          | Replaces parts of the string |
| `re.split()`        | Splits string by pattern |

---

### 🔹 What is a **pattern**?

A pattern is a set of characters used to define what you're looking for in a string.

---

### 🔸 Common Regex Patterns

| Pattern   | Meaning                     | Example Match       |
|-----------|-----------------------------|----------------------|
| `\d`      | Any digit (0–9)              | `2` in `I have 2 cats` |
| `\w`      | Any word character (A–Z, a–z, 0–9, _) | `hello_123` |
| `\s`      | Any whitespace               | Space, tab, newline  |
| `\b`      | Word boundary                | Start/end of words   |
| `.`       | Any character **except newline** | `a`, `1`, `@` etc. |
| `+`       | One or more                  | `\d+` matches `123` |
| `*`       | Zero or more                 | `a*` matches `aaa`, `a`, or `""` |
| `[]`      | Set of characters            | `[aeiou]` matches any vowel |
| `^`       | Start of string              | `^Hello` matches if string starts with Hello |
| `$`       | End of string                | `world$` matches if string ends with world |

---

### 🔹 Raw String (`r''`)

- Write regex patterns using **raw strings**: `r'\d+'`
- Raw string means Python will **not escape backslashes**
- Without `r`, you’d have to write `'\\d+'` — which is harder to read

---

### 🔸 Examples

```python
import re

# Find all words
re.findall(r'\b\w+\b', "Hello world!")  # ['Hello', 'world']

# Find all numbers
re.findall(r'\d+', "I have 2 cats and 10 dogs")  # ['2', '10']

# Replace digits
re.sub(r'\d+', '#', "Room 101")  # 'Room #'

# Split sentence by spaces
re.split(r'\s+', "This is a test")  # ['This', 'is', 'a', 'test']
```

---

### ✅ Tips

- Always use `r''` for regex strings
- Use `re.findall()` if you want all matches
- Use `re.search()` if you only need the first match
- Test your patterns at [regex101.com](https://regex101.com) (select Python flavor)

# re.findall()

Absolutely! Here are some **clear examples** of using `re.findall()` for different tasks. Each example will show how `re.findall()` extracts matches based on a specific pattern.

---

### 1. **Extract All Words in a Sentence**
   Pattern: `\b\w+\b`

```python
import re

text = "Hello world! This is a test."
words = re.findall(r'\b\w+\b', text)

print(words)  # ['Hello', 'world', 'This', 'is', 'a', 'test']
```

**Explanation**:
- `\b\w+\b` matches any word (letters/numbers/underscores), and ignores punctuation marks.
  
---

### 2. **Extract All Numbers**
   Pattern: `\d+`

```python
import re

text = "There are 3 apples, 15 bananas, and 42 oranges."
numbers = re.findall(r'\d+', text)

print(numbers)  # ['3', '15', '42']
```

**Explanation**:
- `\d+` matches **one or more digits**. It will extract all numbers in the text.
  
---

### 3. **Extract Email Addresses**
   Pattern: `\w+@\w+\.\w+`

```python
import re

text = "Contact us at support@example.com or sales@company.com."
emails = re.findall(r'\w+@\w+\.\w+', text)

print(emails)  # ['support@example.com', 'sales@company.com']
```

**Explanation**:
- `\w+` matches one or more word characters (letters, digits, underscores).
- `@` and `.` are literal characters to match in an email address.
- This pattern extracts all email addresses.

---

### 4. **Extract Words Starting with a Specific Letter**
   Pattern: `\b[Aa]\w+\b` (words starting with 'A' or 'a')

```python
import re

text = "Alice went to the art gallery."
words_with_a = re.findall(r'\b[Aa]\w+\b', text)

print(words_with_a)  # ['Alice', 'art']
```

**Explanation**:
- `[Aa]` matches either an uppercase `A` or a lowercase `a`.
- This pattern extracts all words starting with 'A' or 'a'.

---

### 5. **Extract Dates (DD-MM-YYYY format)**
   Pattern: `\b\d{2}-\d{2}-\d{4}\b`

```python
import re

text = "The event will be held on 25-12-2022 and 01-01-2023."
dates = re.findall(r'\b\d{2}-\d{2}-\d{4}\b', text)

print(dates)  # ['25-12-2022', '01-01-2023']
```

**Explanation**:
- `\d{2}` matches exactly two digits.
- `-` is the literal separator between day, month, and year.
- `\d{4}` matches exactly four digits for the year.

---

### 6. **Extract Hashtags (e.g., #Python, #DataScience)**
   Pattern: `#\w+`

```python
import re

text = "I love #Python and #DataScience!"
hashtags = re.findall(r'#\w+', text)

print(hashtags)  # ['#Python', '#DataScience']
```

**Explanation**:
- `#` is the literal character used in hashtags.
- `\w+` matches one or more word characters after `#`.

---

### 7. **Extract All Words with a Specific Length**
   Pattern: `\b\w{5}\b` (words with exactly 5 characters)

```python
import re

text = "I have a dream of creating great code."
words_with_5 = re.findall(r'\b\w{5}\b', text)

print(words_with_5)  # ['dream', 'great']
```

**Explanation**:
- `\w{5}` matches any word that has exactly **5 characters**.
  
---

### 8. **Extract URLs**
   Pattern: `https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+`

```python
import re

text = "Visit us at https://www.example.com or http://example.org"
urls = re.findall(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)

print(urls)  # ['https://www.example.com', 'http://example.org']
```

**Explanation**:
- `https?://` matches the **http** or **https** protocol in URLs.
- `(?:...)` is a **non-capturing group**, which groups characters without capturing them as separate matches.
  
---

### 9. **Extract All Words Without Vowels**
   Pattern: `\b[^aeiouAEIOU\s\d\W]+\b`

```python
import re

text = "The quick brown fox jumps over the lazy dog."
words_without_vowels = re.findall(r'\b[^aeiouAEIOU\s\d\W]+\b', text)

print(words_without_vowels)  # ['Th', 'qck', 'brwn', 'fx', 'jmps', 'vr', 'th', 'lzy', 'dg']
```

**Explanation**:
- `[^aeiouAEIOU]` matches any character **except vowels** (both lowercase and uppercase).
- This pattern helps extract words that **don’t contain any vowels**.

---

### 10. **Extract All Words with Numbers in Them**
   Pattern: `\b\w*\d\w*\b`

```python
import re

text = "My code is v1.0, and I have 100 items."
words_with_numbers = re.findall(r'\b\w*\d\w*\b', text)

print(words_with_numbers)  # ['v1.0', '100']
```

**Explanation**:
- `\w*\d\w*` matches words that contain **at least one digit**.

---

### Final Summary of `re.findall()`:

- **`re.findall(pattern, string)`** returns all non-overlapping matches of the pattern in the string as a list.
- It's great for **searching through strings** and extracting parts that match a given pattern.

Would you like any of these examples explained in more detail or additional practice examples? 😊

In [5]:
import re

text = 'Jai Sree Ram.....'
result = re.findall(r'\b\w+\b', text.lower())
print(result)


['jai', 'sree', 'ram']
