## 🔄 **Data Transformation Techniques (pandas)**

### 1. **Applying Functions**

* **What**: Apply custom or built-in functions element-wise, row-wise, or column-wise.
* **When**: To perform operations like normalization, transformations, or feature creation.
* **Key methods**:

  * `.apply(func, axis=0/1)`
  * `.map(func)` → element-wise on Series
  * `.applymap(func)` → element-wise on DataFrame
* **Example**:

  ```python
  df['log_col'] = df['col'].apply(np.log)
  ```
* **Pitfalls**: Misusing `applymap` vs `apply`. Avoid loops when vectorization suffices.

---

### 2. **Lambda Functions**

* **What**: Anonymous one-line functions used inline.
* **When**: Quick transformations within `.apply()` or `.map()`.
* **Syntax**: `lambda x: x * 2`
* **Use**:

  ```python
  df['adjusted'] = df['price'].apply(lambda x: x * 1.1 if x > 100 else x)
  ```
* **Best Practice**: Prefer readability for complex logic—avoid deeply nested lambdas.

---

### 3. **Discretization & Binning**

* **What**: Convert continuous values into discrete bins.
* **When**: Useful in feature engineering, decision trees, histogram plots.
* **Methods**:

  * `pd.cut()` → for custom or equal-width bins
  * `pd.qcut()` → for quantile-based bins
* **Example**:

  ```python
  df['age_group'] = pd.cut(df['age'], bins=[0, 18, 60, 100], labels=['child', 'adult', 'senior'])
  ```
* **Pitfall**: Mismatched bin edges or wrong labels can silently give incorrect groupings.

---

### 4. **Sorting**

* **What**: Order your data by index or values.
* **When**: For top-k analysis, rankings, etc.
* **Methods**:

  * `.sort_values(by=..., ascending=...)`
  * `.sort_index()`
* **Example**:

  ```python
  df.sort_values(by='sales', ascending=False)
  ```
* **Pitfall**: Sorting is not always in-place; don’t forget to assign or use `inplace=True`.

---

### 5. **Combining Columns / Splitting Strings**

* **What**: Combine multiple columns or split a string column.
* **When**: Feature extraction, data parsing, formatting.
* **Methods**:

  * String operations: `.str.split()`, `.str.extract()`, `.str.cat()`
* **Example**:

  ```python
  df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)
  ```
* **Best Practice**: Use `expand=True` to get a DataFrame when splitting.

---

### 6. **Value Transformations**

* **What**: Replace, scale, normalize, or encode values.
* **When**: Clean data, prepare features for ML.
* **Methods**:

  * `.replace()`, `.fillna()`, `.astype()`, `.map()`
  * Scaling via `sklearn.preprocessing`
* **Example**:

  ```python
  df['gender'] = df['gender'].replace({'Male': 0, 'Female': 1})
  ```
* **Pitfall**: `.map()` works only on Series; be cautious with missing keys.

---

### 7. **Handling Duplicates or Unique Transformations**

* **What**: Remove or identify duplicates/uniques.
* **When**: Clean raw datasets, detect anomalies.
* **Methods**:

  * `.duplicated()`, `.drop_duplicates()`, `.nunique()`, `.unique()`
* **Example**:

  ```python
  df.drop_duplicates(subset=['user_id'])
  ```
* **Best Practice**: Always verify if `.drop_duplicates()` drops what you expect.

---

### 8. **Conditional Transformations / New Columns**

* **What**: Create new columns based on conditions.
* **When**: Derive features, flag categories, group data.
* **Methods**:

  * `np.where()`, `.apply(lambda)`, `df.loc[cond, 'new_col']`
* **Example**:

  ```python
  df['category'] = np.where(df['price'] > 100, 'Premium', 'Standard')
  ```
* **Best Practice**: Use vectorized operations like `np.where()` over row-wise `.apply()` for speed.

---

### 9. **Pivot / Melt for Long-Wide Formats**

* **What**: Convert between wide and long formats.
* **When**: Reshape data for analysis or plotting.
* **Methods**:

  * `.pivot()`, `.pivot_table()`, `pd.melt()`
* **Example**:

  ```python
  pd.melt(df, id_vars=['ID'], value_vars=['Jan', 'Feb'])
  ```
* **Pitfall**: `pivot()` fails with duplicate indices—use `pivot_table()` for aggregation.

---

### 10. **Reshaping / Transposing**

* **What**: Rearranging the structure of rows/columns.
* **When**: Align models, perform broadcasting, or data transformation.
* **Methods**:

  * `.T`, `.stack()`, `.unstack()`, `.melt()`, `.pivot()`
* **Example**:

  ```python
  df.T  # Transpose
  ```
* **Best Practice**: Be aware of index hierarchy when using `.stack()` and `.unstack()`.

---

### 11. **Window Functions (Basic)**

* **What**: Perform operations over a sliding window of rows.
* **When**: Time series analysis, moving averages, rolling stats.
* **Methods**:

  * `.rolling(window=n).mean()`, `.expanding().sum()`, `.ewm()`
* **Example**:

  ```python
  df['rolling_avg'] = df['sales'].rolling(window=3).mean()
  ```
* **Pitfall**: Be cautious about NaNs introduced at edges of windows.

---

## ✅ Summary Table

| Technique                      | Common Functions / Methods                | Primary Use Case                |
| ------------------------------ | ----------------------------------------- | ------------------------------- |
| Applying Functions             | `.apply()`, `.map()`, `.applymap()`       | Transform elements/rows/columns |
| Lambda Functions               | `lambda x: ...`                           | Quick inline transformations    |
| Discretization & Binning       | `pd.cut()`, `pd.qcut()`                   | Bucket numeric values           |
| Sorting                        | `.sort_values()`, `.sort_index()`         | Rank/order data                 |
| Combining/Splitting Strings    | `.str.split()`, `.str.cat()`              | Feature engineering             |
| Value Transformations          | `.replace()`, `.fillna()`, `.astype()`    | Cleaning / Encoding             |
| Handling Duplicates/Uniqueness | `.duplicated()`, `.drop_duplicates()`     | De-duplication                  |
| Conditional Columns            | `np.where()`, `.loc[]`, `.apply(lambda)`  | Feature derivation              |
| Pivot/Melt                     | `.pivot()`, `.pivot_table()`, `pd.melt()` | Long ↔ Wide format reshaping    |
| Reshaping/Transposing          | `.T`, `.stack()`, `.unstack()`            | Row/column transformations      |
| Window Functions (Basic)       | `.rolling()`, `.expanding()`, `.ewm()`    | Time-based summaries            |



<center><b>Thanks</b></center>