# **Data Selection & Indexing**

#  Best Practices in Data Selection & Indexing

Understanding the mechanics of selection is important, but **knowing *how* to use them effectively and safely** is what makes you a real data professional. This section teaches **best practices** to avoid pitfalls, write cleaner code, and improve performance and readability.

---

## 📋 Topics Covered

| Subsection | Concept                                                |
| ---------- | ------------------------------------------------------ |
| 1.        | Use `.loc[]` and `.iloc[]` instead of chained indexing |
| 2.        | Avoid chained indexing (`df[col][row]`)                |
| 3.        | Use `.copy()` when slicing                             |
| 4.        | Use `select_dtypes()` for dynamic column selection     |
| 5.        | Use boolean masks for filtering over loops             |
| 6.        | Use `.query()` and `.eval()` for cleaner expressions   |
| 7.        | Don’t rely on row positions in production              |
| 8.        | Always check `.dtypes` before slicing                  |
| 9.        | Avoid using hardcoded column names repeatedly          |
| 10.       | Use vectorized operations over row-wise loops          |

---

### 1 Use `.loc[]` and `.iloc[]` over chained indexing

❌ Bad:

```python
df['col'][0] = 100  # Might not modify original df
```

✅ Good:

```python
df.loc[0, 'col'] = 100
```

➡️ **Why?** Chained indexing creates a temporary copy and may not issue warnings. This can silently fail.

---

### 2 Avoid chained indexing (it's ambiguous)

Instead of this:

```python
df['salary'][df['salary'] > 50000]
```

Do this:

```python
df.loc[df['salary'] > 50000, 'salary']
```

➡️ Safer, more readable, no ambiguous behavior.

---

### 3 Use `.copy()` when creating slices

```python
young_employees = df[df['age'] < 30].copy()
```

➡️ Without `.copy()`, assigning to this view can cause unintended behavior on original `df`.

---

### 4 Use `select_dtypes()` for type-safe selections

```python
numeric_df = df.select_dtypes(include='number')
```

➡️ Dynamic, scalable, and avoids hardcoded assumptions.

---

### 5 Prefer Boolean Masks to Row-by-Row Filters

❌ Inefficient:

```python
filtered = []
for i in range(len(df)):
    if df.loc[i, 'age'] > 25:
        filtered.append(df.loc[i])
```

✅ Vectorized:

```python
filtered = df[df['age'] > 25]
```

➡️ 100x faster and much more readable.

---

### 6 Use `.query()` and `.eval()` for readable filters

```python
df.query('age > 25 and department == "HR"')
```

➡️ Especially useful in notebooks and dynamic queries.

---

### 7 Don’t Rely on Row Positions in Production

Avoid:

```python
df.iloc[0]
```

➡️ Row order might change after sorting, shuffling, or merging.

---

### 8 Always Inspect `.dtypes` Before Slicing

```python
print(df.dtypes)
```

➡️ Ensures correct handling — e.g., avoiding selecting an object column expecting it to be numeric.

---

### 9 Avoid Hardcoding Column Names

Instead of repeating column names:

```python
avg = df['sales'].mean()
df['sales_normalized'] = df['sales'] / avg
```

You can assign:

```python
col = 'sales'
avg = df[col].mean()
df[f'{col}_normalized'] = df[col] / avg
```

➡️ Makes code reusable and DRY (Don’t Repeat Yourself).

---

### 10 Prefer Vectorized Operations Over Row-wise Loops

❌ Inefficient:

```python
df['tax'] = df.apply(lambda row: row['income'] * 0.1, axis=1)
```

✅ Efficient:

```python
df['tax'] = df['income'] * 0.1
```

➡️ Performance matters a lot for large datasets!

---

## ✅ Real-Time Best Practice Scenarios

| Scenario                          | Best Practice                                     |
| --------------------------------- | ------------------------------------------------- |
| Data cleaning or filtering        | Use `.loc[]` with boolean masks                   |
| Subset of data for transformation | Use `.copy()` to avoid SettingWithCopyWarning     |
| Working in ML pipelines           | Use `select_dtypes()` to keep numeric columns     |
| Building dashboards or filters    | Use `.query()` for dynamic user filters           |
| Collaborating in teams            | Avoid ambiguous indexing for readability & safety |

---

## 🧠 Mini Task Challenge

Here's a dataset:

```python
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Department': ['HR', 'IT', 'HR', 'Finance'],
    'Salary': [50000, 60000, 70000, 80000]
})
```

### Task:

* Select only numeric columns using best practice
* Filter all employees in `HR` with `Salary > 60000` using `.query()`
* Create a slice for employees under 35 **safely**
* Normalize salary column dynamically
* Use `.copy()` correctly when slicing

<center><b>Thanks</b></center>