# **11. Advanced Features & Optimization**

## 🔤 7. **Advanced String Operations**

In [1]:
import pandas as pd

### 1. **Purpose & When to Use It**

**Purpose:**
Advanced string operations in pandas are used to manipulate, clean, extract, transform, and analyze textual (string) data within `Series` or DataFrame columns using vectorized string methods under `.str`.

**When to Use It:**

* Working with categorical data in textual format
* Cleaning or standardizing text (e.g., lowercasing, trimming)
* Extracting patterns using regex
* Performing tokenization or substring operations
* Encoding or formatting text for feature engineering


### 2. **Different Methods and Techniques**

#### ✅ **String Methods via `.str` accessor** (Vectorized)

* `str.lower()`, `str.upper()`, `str.title()` – case handling
* `str.strip()`, `str.lstrip()`, `str.rstrip()` – remove whitespaces
* `str.contains()`, `str.startswith()`, `str.endswith()` – condition checks
* `str.replace()` – substring replacement
* `str.extract()` – regex-based extraction
* `str.split()`, `str.get()` – splitting strings
* `str.len()` – length of each string
* `str.pad()`, `str.zfill()`, `str.center()` – formatting strings

#### ✅ **Regex-Based Techniques**

* Use `str.extract(r'regex')` or `str.contains(r'regex')`
* Useful for parsing structured text patterns (e.g., extracting phone numbers, emails)

#### ✅ **Case Manipulation**

* Standardize text by changing case (`str.lower()` or `str.title()`)

#### ✅ **Replacing & Cleaning**

* `str.replace(old, new)`, supports regex too
* Clean symbols, typos, or unnecessary characters

#### ✅ **Splitting & Expanding**

* `str.split(delimiter, expand=True)` — to convert a string into multiple columns

### 3. **Examples with Code**

#### 🔹 Case Standardization

```python
df['name'] = df['name'].str.lower()
```

#### 🔹 Extracting Emails from Text

```python
df['email'] = df['contact_info'].str.extract(r'([\w\.-]+@[\w\.-]+)')
```

#### 🔹 Checking Substring Presence

```python
df[df['remarks'].str.contains('urgent', case=False)]
```

#### 🔹 Splitting Column into Multiple

```python
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)
```

#### 🔹 Replacing Typos

```python
df['product'] = df['product'].str.replace(r'celllphone', 'cellphone', regex=True)
```


### 4. **Performance Considerations**

* **Vectorized methods via `.str`** are much faster than using `apply()` with `lambda` for text processing.
* Using regex (`str.extract`, `str.contains`) is powerful but may be **slower on large datasets** — optimize or pre-compile regex patterns if reused.
* String operations can be **memory-intensive**, especially when chaining multiple steps.

---

### 5. **Common Pitfalls & Mistakes**

* Using `.apply(lambda x: x.lower())` instead of vectorized `str.lower()` — leads to performance hits.
* Misusing regex (wrong patterns, greedy vs lazy matching).
* Forgetting to handle `NaN` values — many string methods return `NaN` or errors if not handled with `na=False`.

  ```python
  df['col'].str.contains('pattern', na=False)
  ```
* Not using `expand=True` when needed during `split()` — may lead to incorrect parsing.

---

### 6. **Best Practices**

* Prefer **vectorized `.str` methods** over Python `apply()` for string handling.
* Use **compiled regex patterns** if reusing them multiple times for better performance.
* Chain operations when it improves clarity — e.g., `.str.lower().str.strip()`.
* Always handle **missing values** (`NaN`) properly.
* Document or comment on **complex regex** patterns used.

---

### 7. **Use Cases in Real Projects**

| Use Case                       | Example                                                          |
| ------------------------------ | ---------------------------------------------------------------- |
| **Customer feedback analysis** | Extract keywords, filter negative reviews using `str.contains()` |
| **User profile parsing**       | Extract usernames or email domains using `str.extract()`         |
| **Product catalog cleaning**   | Normalize product names, remove symbols via `str.replace()`      |
| **Web scraping**               | Clean and extract structured info from HTML/text dumps           |
| **Log data analysis**          | Extract timestamps, IP addresses using regex                     |
| **Form entry validation**      | Check for invalid patterns (e.g., missing `@` in emails)         |


<center><b>Thanks</b></center>