# Topic 05 - Problem 03: Extracting Information Using Patterns

---

## 1. About the Problem

In real datasets, important information is often **embedded inside text**, not neatly separated.

Examples:
- Product codes inside descriptions
- Numbers inside strings
- IDs mixed with text

In this problem, I will extract the **numeric order ID** from a text column.

This is a very common task in:
- Data cleaning
- Feature engineering
- Log analysis
- NLP preprocessing

---


## 2. Solution Code

In [8]:
import pandas as pd

# Sample dataset
data = {
    "order_info": [
        "Order ID: 1234578",
        "Order ID: 9876545",
        "Order ID: 4567869",
        "Order ID: 1122312"
    ]
}

df = pd.DataFrame(data)

# Extract numeric order ID using regex
df['order_id']=df['order_info'].str.extract(r'(\d+)')

print(df)


          order_info order_id
0  Order ID: 1234578  1234578
1  Order ID: 9876545  9876545
2  Order ID: 4567869  4567869
3  Order ID: 1122312  1122312


---

## 3. Explanation (What is happening)

- **str.extract()**  
  → Extracts a part of the string based on a pattern

- **r"(\\d+)"**  
  → Regular expression:
  - `\\d` = digit
  - `+` = one or more times

- Only the number part is extracted into a new column

This converts:
- `"Order ID: 12345"` → `12345`

---

## 4. Summary / Takeaways

By solving this problem, I learned:

1. How to extract data from unstructured text
2. The basics of regex for data science
3. Why pattern extraction is powerful in preprocessing
4. How raw text becomes usable features

This problem shows **intermediate string manipulation skills** and is very valuable for GitHub.

---

Next, I’ll move toward:
- Replacing values in strings
- Cleaning inconsistent text patterns


