# **Data Transformation**

## **5. Combining columns / splitting strings**

In [13]:
import numpy as np 
import pandas as pd 

## ✅ 1. **What it does and when to use it**

### **What:**

This category of operations allows you to:

* **Combine (concatenate)** multiple string/text columns into a single column.
* **Split** a single column into multiple columns based on delimiters or patterns.
* **Extract** specific string patterns (like parts of a date, phone number, etc.) using regex.
* Clean up or restructure textual data.

### **When to Use:**

* Data cleaning: Normalize inconsistent formats.
* Feature engineering: Split 'City-State' into separate features.
* Parsing logs or semi-structured fields (e.g., `log_level:timestamp:message`).
* Creating unified labels or IDs by merging fields (e.g., `CustomerID-Region`).
* Working with datasets having combined info in one field (e.g., `"John Doe | HR"`).


## ✅ 2. **Syntax and Core Parameters**

### 🔹 `+` operator

Simple way to combine two strings.

```python
df['full_name'] = df['first_name'] + ' ' + df['last_name']
```

---

### 🔹 `str.cat()`

Flexible method to concatenate columns/Series with separator.

```python
df['full_name'] = df['first_name'].str.cat(df['last_name'], sep=' ')
```

**Key Parameters:**

* `others`: list or Series to concatenate with
* `sep`: separator string (e.g., `'_'`, `' '`)

---

### 🔹 `str.split()`

Splits strings using a delimiter into a list or columns.

```python
df[['city', 'state']] = df['location'].str.split(',', expand=True)
```

**Key Parameters:**

* `pat`: delimiter or regex pattern
* `expand`: If `True`, returns DataFrame; if `False`, returns Series of lists
* `n`: Max number of splits

---

### 🔹 `str.extract()`

Extracts patterns using **regular expressions (regex)**.

```python
df['year'] = df['text'].str.extract(r'(\d{4})')
```

---

### 🔹 `str.replace()`

For targeted replacements in strings.

```python
df['clean_code'] = df['code'].str.replace('-', '', regex=False)
```


## ✅ 3. **Different Methods and Techniques**

| Operation                    | Method / Function | Description                        |
| ---------------------------- | ----------------- | ---------------------------------- |
| Combine columns (simple)     | `+`               | Concatenate directly               |
| Combine with separator       | `str.cat()`       | Join strings with separator        |
| Split into multiple columns  | `str.split()`     | Split by delimiter                 |
| Extract pattern              | `str.extract()`   | Regex-based partial extract        |
| Separate parts conditionally | `.apply(lambda)`  | Custom logic                       |
| Clean up formats             | `str.replace()`   | Clean/remove patterns              |
| Join multiple columns        | `str.cat(axis=1)` | Combine multiple DataFrame columns |


## ✅ 4. **Common Pitfalls and Best Practices**

### ❌ Pitfalls:

* **Missing values (`NaN`)** will break string operations like `+` unless handled.
* **Forgetting `expand=True`** on `str.split()` will return list instead of columns.
* **Using `+` on numeric columns** by mistake → can raise errors.
* **Regex confusion**: `str.extract()` needs correct pattern grouping `()`.

### ✅ Best Practices:

* Use `.fillna('')` before combining to handle `NaN`.
* Always verify column data types using `df.dtypes`.
* Prefer `str.cat()` over `+` for multiple columns.
* Use `str.extract()` for precise string pattern capture.
* Chain `.str.strip()` to remove whitespace after splitting.


## ✅ 5. **Examples on Real/Pseudo Data**

In [14]:
df = pd.DataFrame({
    'first_name': ['John', 'Jane', 'Alice'],
    'last_name': ['Doe', 'Smith', 'Johnson'],
    'location': ['New York,NY', 'Los Angeles,CA', 'Austin,TX'],
    'code': ['ab-123', 'xy-456', 'zz-789'],
    'info': ['EmpID:1001|HR', 'EmpID:1002|Finance', 'EmpID:1003|IT']
})

df

Unnamed: 0,first_name,last_name,location,code,info
0,John,Doe,"New York,NY",ab-123,EmpID:1001|HR
1,Jane,Smith,"Los Angeles,CA",xy-456,EmpID:1002|Finance
2,Alice,Johnson,"Austin,TX",zz-789,EmpID:1003|IT


In [15]:
# 1. Combine names
df['full_name'] = df['first_name'] + ' ' + df['last_name']
df

Unnamed: 0,first_name,last_name,location,code,info,full_name
0,John,Doe,"New York,NY",ab-123,EmpID:1001|HR,John Doe
1,Jane,Smith,"Los Angeles,CA",xy-456,EmpID:1002|Finance,Jane Smith
2,Alice,Johnson,"Austin,TX",zz-789,EmpID:1003|IT,Alice Johnson


In [16]:
df['full_name1'] = df['first_name'].str.cat(df['last_name'], sep=' ')
df

Unnamed: 0,first_name,last_name,location,code,info,full_name,full_name1
0,John,Doe,"New York,NY",ab-123,EmpID:1001|HR,John Doe,John Doe
1,Jane,Smith,"Los Angeles,CA",xy-456,EmpID:1002|Finance,Jane Smith,Jane Smith
2,Alice,Johnson,"Austin,TX",zz-789,EmpID:1003|IT,Alice Johnson,Alice Johnson


In [17]:
# 2. Split location
df['location'].str.split(',')

0       [New York, NY]
1    [Los Angeles, CA]
2         [Austin, TX]
Name: location, dtype: object

In [18]:
df['location'].str.split(',', expand=True)

Unnamed: 0,0,1
0,New York,NY
1,Los Angeles,CA
2,Austin,TX


In [19]:
df[['city', 'state']] = df['location'].str.split(',', expand=True)
df

Unnamed: 0,first_name,last_name,location,code,info,full_name,full_name1,city,state
0,John,Doe,"New York,NY",ab-123,EmpID:1001|HR,John Doe,John Doe,New York,NY
1,Jane,Smith,"Los Angeles,CA",xy-456,EmpID:1002|Finance,Jane Smith,Jane Smith,Los Angeles,CA
2,Alice,Johnson,"Austin,TX",zz-789,EmpID:1003|IT,Alice Johnson,Alice Johnson,Austin,TX


In [20]:
# 3. Extract numbers from code
df['num_codes'] = df['code'].str.extract(r'(\d+)')
df

Unnamed: 0,first_name,last_name,location,code,info,full_name,full_name1,city,state,num_codes
0,John,Doe,"New York,NY",ab-123,EmpID:1001|HR,John Doe,John Doe,New York,NY,123
1,Jane,Smith,"Los Angeles,CA",xy-456,EmpID:1002|Finance,Jane Smith,Jane Smith,Los Angeles,CA,456
2,Alice,Johnson,"Austin,TX",zz-789,EmpID:1003|IT,Alice Johnson,Alice Johnson,Austin,TX,789


In [21]:
# 4. Replace hyphen in code
df['clean_code'] = df['code'].str.replace('-', '', regex=False)
df

Unnamed: 0,first_name,last_name,location,code,info,full_name,full_name1,city,state,num_codes,clean_code
0,John,Doe,"New York,NY",ab-123,EmpID:1001|HR,John Doe,John Doe,New York,NY,123,ab123
1,Jane,Smith,"Los Angeles,CA",xy-456,EmpID:1002|Finance,Jane Smith,Jane Smith,Los Angeles,CA,456,xy456
2,Alice,Johnson,"Austin,TX",zz-789,EmpID:1003|IT,Alice Johnson,Alice Johnson,Austin,TX,789,zz789


In [22]:
# 5. Split complex info using regex
df[['emp_id', 'dept']] = df['info'].str.extract(r'EmpID:(\d+)\|(\w+)')
df

Unnamed: 0,first_name,last_name,location,code,info,full_name,full_name1,city,state,num_codes,clean_code,emp_id,dept
0,John,Doe,"New York,NY",ab-123,EmpID:1001|HR,John Doe,John Doe,New York,NY,123,ab123,1001,HR
1,Jane,Smith,"Los Angeles,CA",xy-456,EmpID:1002|Finance,Jane Smith,Jane Smith,Los Angeles,CA,456,xy456,1002,Finance
2,Alice,Johnson,"Austin,TX",zz-789,EmpID:1003|IT,Alice Johnson,Alice Johnson,Austin,TX,789,zz789,1003,IT


## ✅ 6. **Real World Use Cases**

| Use Case                       | Description                                                               |
| ------------------------------ | ------------------------------------------------------------------------- |
| 🔍 **Parsing full names**      | Combine `first_name` + `last_name` into `full_name`                       |
| 🏙️ **Extracting geolocation** | Split `location` like `City,State` into separate columns                  |
| 📦 **Product identifiers**     | Split product SKUs like `123-ABC-456`                                     |
| 📝 **Log or message parsing**  | Extract datetime, error code from logs like `ERROR[2023-07-21]: Timeout`  |
| 🧾 **Invoice parsing**         | Split `CustomerID InvoiceAmount Date` into fields                         |
| 📊 **Feature engineering**     | Create new features like `CustomerSegment-Region` by combining            |
| 📞 **Cleaning contact data**   | Remove special characters from phone numbers like `+91-9876543210`        |


<center><b>Thanks</b></center>