# **Data Transformation**

## **4. Sorting**

In [1]:
import numpy as np 
import pandas as pd 

## 1. 🧠 What It Does & When to Use It

### 📌 What?

**Sorting** in pandas refers to **reordering** the rows or columns of a DataFrame or Series based on:

* The **values** of one or more columns (`.sort_values()`), or
* The **index** (`.sort_index()`)

---

### 🎯 When to Use?

Use sorting when:

* You want to **organize data** for display or reporting
* You’re trying to find **top or bottom records** (e.g., top 5 highest-paid employees)
* You need **ordered data** for time series analysis or plotting
* You want to **detect anomalies** (e.g., negative values sorted to the top)


## 2. 🧾 Syntax & Core Parameters

### 🔹 `df.sort_values()`

```python
df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
```

| Parameter     | Description                                                               |
| ------------- | ------------------------------------------------------------------------- |
| `by`          | Column(s) to sort by (`str` or `list`)                                    |
| `axis`        | 0 = sort rows, 1 = sort columns                                           |
| `ascending`   | Sort ascending (`True`) or descending (`False`)                           |
| `inplace`     | Whether to sort in-place                                                  |
| `kind`        | Sorting algorithm: `'quicksort'`, `'mergesort'`, `'heapsort'`, `'stable'` |
| `na_position` | `'first'` or `'last'` — where to place NaNs                               |

---

### 🔹 `df.sort_index()`

```python
df.sort_index(axis=0, ascending=True, inplace=False)
```

| Parameter   | Description                                       |
| ----------- | ------------------------------------------------- |
| `axis`      | 0 = sort rows by index, 1 = sort columns by label |
| `ascending` | Sort ascending or descending                      |
| `inplace`   | Whether to sort in-place                          |


## 3. 🧰 Different Methods & Techniques

### 🔸 A. `sort_values()` by a single column

```python
df.sort_values(by='age')
```

---

### 🔸 B. `sort_values()` by multiple columns

```python
df.sort_values(by=['department', 'salary'], ascending=[True, False])
```

Sorts by `department` alphabetically, then by `salary` descending **within** each department.

---

### 🔸 C. `sort_index()` for sorting by row or column index

```python
df.sort_index()              # Sort rows by index
df.sort_index(axis=1)        # Sort columns alphabetically
```

---

### 🔸 D. Sorting with `na_position` and `inplace`

```python
df.sort_values('score', na_position='first', inplace=True)
```

---

### 🔸 E. Sorting a Series

```python
df['score'].sort_values(ascending=False)
```


## 4. ⚠️ Common Pitfalls & Best Practices

### ❌ Pitfalls

| Problem                                            | Explanation                                            |
| -------------------------------------------------- | ------------------------------------------------------ |
| Sorting not applied                                | Forgetting `inplace=True` or not assigning result back |
| NaNs affect order                                  | Use `na_position` to handle NaNs explicitly            |
| Sorting column doesn’t exist                       | Typos in `by=` column name                             |
| Unexpected result on multi-column sort             | Not aligning `ascending` list with `by` list           |
| Sorting columns when intending rows (wrong `axis`) | Use `axis=0` for rows, `axis=1` for columns            |

---

### ✅ Best Practices

* Prefer `inplace=False` (default) and assign result to a variable for **safe transformations**
* Always verify **column names** used in sorting
* Use `sort_values()` with `kind='mergesort'` if **stable sort** is needed (maintains order of equal elements)
* Use `sort_index()` when working with **indexed time series or hierarchical data**


## 5. 🧪 Examples on Real/Pseudo Data

In [2]:
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 32, 37, 29],
    'salary': [50000, 60000, 80000, 50000],
    'department': ['HR', 'Tech', 'Tech', 'HR']
}
df = pd.DataFrame(data)

df

Unnamed: 0,name,age,salary,department
0,Alice,25,50000,HR
1,Bob,32,60000,Tech
2,Charlie,37,80000,Tech
3,Diana,29,50000,HR


### ▶️ Sort by Age (Ascending)

In [3]:
df.sort_values('age')

Unnamed: 0,name,age,salary,department
0,Alice,25,50000,HR
3,Diana,29,50000,HR
1,Bob,32,60000,Tech
2,Charlie,37,80000,Tech


### ▶️ Sort by Department Ascending, then Salary Descending

In [5]:
df.sort_values(['department', 'salary'], ascending=[True, False])

Unnamed: 0,name,age,salary,department
0,Alice,25,50000,HR
3,Diana,29,50000,HR
2,Charlie,37,80000,Tech
1,Bob,32,60000,Tech


### ▶️ Sort by Index (Row)

In [7]:
df.sort_index(ascending=False)

Unnamed: 0,name,age,salary,department
3,Diana,29,50000,HR
2,Charlie,37,80000,Tech
1,Bob,32,60000,Tech
0,Alice,25,50000,HR


### ▶️ Sort Columns Alphabetically

In [8]:
df.sort_index(axis=1)

Unnamed: 0,age,department,name,salary
0,25,HR,Alice,50000
1,32,Tech,Bob,60000
2,37,Tech,Charlie,80000
3,29,HR,Diana,50000


### ▶️ Sort with NaNs First

In [11]:
df.loc[2, 'salary'] = None
df

Unnamed: 0,name,age,salary,department
0,Alice,25,50000.0,HR
1,Bob,32,60000.0,Tech
2,Charlie,37,,Tech
3,Diana,29,50000.0,HR


In [12]:
df.sort_values('salary', na_position='first')

Unnamed: 0,name,age,salary,department
2,Charlie,37,,Tech
0,Alice,25,50000.0,HR
3,Diana,29,50000.0,HR
1,Bob,32,60000.0,Tech


## 6. 🌍 Real-World Use Cases

### 📈 Finance

* Sort transactions by **date** to prepare for time series analysis:

  ```python
  df.sort_values('transaction_date')
  ```

* Sort customers by **credit score** descending to assign risk:

  ```python
  df.sort_values('credit_score', ascending=False)
  ```

---

### 🏥 Healthcare

* Sort patients by **priority level** and then by **arrival time**:

  ```python
  df.sort_values(by=['priority', 'arrival_time'], ascending=[True, True])
  ```

---

### 🧾 HR Analytics

* Sort employees by **salary** descending to identify top earners:

  ```python
  df.sort_values('salary', ascending=False)
  ```

* Sort by **department and age** for retirement planning:

  ```python
  df.sort_values(['department', 'age'])
  ```

---

### 🛒 E-Commerce

* Sort products by **rating** and **sales volume**:

  ```python
  df.sort_values(['rating', 'units_sold'], ascending=[False, False])
  ```


## ✅ Summary Table

| Method          | Purpose                      | Notes                        |
| --------------- | ---------------------------- | ---------------------------- |
| `sort_values()` | Sort by column(s)            | Use for organizing data rows |
| `sort_index()`  | Sort by index (rows/columns) | Use for indexed data/columns |
| `ascending`     | Order of sort                | Can be list for multi-cols   |
| `na_position`   | Handle NaNs                  | `'first'` or `'last'`        |


<center><b>Thanks</b></center>