# Indexing and Selection

## Learning Objectives

By the end of this notebook, you will be able to:

1. Select columns using bracket notation and dot notation
2. Use `.loc[]` for label-based selection
3. Use `.iloc[]` for position-based selection
4. Apply boolean indexing to filter data
5. Use the `query()` method for complex filtering
6. Combine multiple selection methods

---

## Setup

In [None]:
import pandas as pd
import numpy as np

# Create a sample DataFrame for demonstrations
np.random.seed(42)

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Engineering', 'Marketing', 'Engineering', 'HR', 'Marketing', 'Engineering', 'HR', 'Marketing'],
    'age': [28, 35, 42, 31, 29, 45, 38, 33],
    'salary': [75000, 65000, 90000, 60000, 70000, 95000, 62000, 68000],
    'years_exp': [3, 8, 15, 5, 4, 18, 10, 7],
    'performance': ['A', 'B', 'A', 'B', 'A', 'A', 'C', 'B']
})

# Set a custom index
df.index = ['E001', 'E002', 'E003', 'E004', 'E005', 'E006', 'E007', 'E008']
df.index.name = 'emp_id'

print("Sample DataFrame:")
print(df)

---

## 1. Column Selection

### 1.1 Single Column Selection

In [None]:
# Bracket notation - always works
names = df['name']
print("Using bracket notation:")
print(names)
print(f"\nType: {type(names)}")

In [None]:
# Dot notation - works for valid Python identifiers
salaries = df.salary
print("Using dot notation:")
print(salaries)

In [None]:
# Note: Dot notation doesn't work for columns with spaces or special characters
# df['column name']  # works
# df.column name     # doesn't work

# Dot notation also doesn't work if column name conflicts with a method
# df.count  # returns the count method, not a column named 'count'

print("Best practice: Use bracket notation for reliability")

### 1.2 Multiple Column Selection

In [None]:
# Select multiple columns with a list
subset = df[['name', 'department', 'salary']]
print("Multiple columns:")
print(subset)
print(f"\nType: {type(subset)}")

In [None]:
# Select columns by pattern (using filter)
# Select columns containing 'a'
cols_with_a = df.filter(like='a')
print("Columns containing 'a':")
print(cols_with_a)

---

## 2. Label-Based Selection with `.loc[]`

`.loc[]` selects data by **labels** (index and column names). The syntax is:
```python
df.loc[row_labels, column_labels]
```

### 2.1 Selecting Rows by Label

In [None]:
# Select single row by index label
employee = df.loc['E003']
print("Single row (as Series):")
print(employee)
print(f"\nType: {type(employee)}")

In [None]:
# Select multiple rows by label list
employees = df.loc[['E001', 'E003', 'E005']]
print("Multiple rows:")
print(employees)

In [None]:
# Select range of rows (inclusive on both ends!)
range_df = df.loc['E002':'E005']
print("Range of rows (E002 to E005, inclusive):")
print(range_df)

### 2.2 Selecting Rows and Columns

In [None]:
# Select specific rows and columns
result = df.loc['E001', 'name']
print(f"Single cell value: {result}")

In [None]:
# Select multiple rows and columns
result = df.loc[['E001', 'E002', 'E003'], ['name', 'salary']]
print("Specific rows and columns:")
print(result)

In [None]:
# Select all rows, specific columns
result = df.loc[:, ['name', 'department']]
print("All rows, specific columns:")
print(result)

In [None]:
# Range of rows and columns
result = df.loc['E002':'E004', 'name':'salary']
print("Range of rows and columns:")
print(result)

---

## 3. Position-Based Selection with `.iloc[]`

`.iloc[]` selects data by **integer positions** (0-based). The syntax is:
```python
df.iloc[row_positions, column_positions]
```

### 3.1 Selecting Rows by Position

In [None]:
# Select single row by position
first_row = df.iloc[0]
print("First row (position 0):")
print(first_row)

In [None]:
# Select multiple rows by position list
rows = df.iloc[[0, 2, 4]]
print("Rows at positions 0, 2, 4:")
print(rows)

In [None]:
# Select range of rows (exclusive on end!)
range_df = df.iloc[1:5]
print("Rows 1-4 (position 1 to 5, exclusive):")
print(range_df)

In [None]:
# Negative indexing
last_row = df.iloc[-1]
print("Last row:")
print(last_row)

print("\nLast 3 rows:")
print(df.iloc[-3:])

### 3.2 Selecting Rows and Columns by Position

In [None]:
# Select specific cell
value = df.iloc[0, 0]
print(f"Cell at (0, 0): {value}")

In [None]:
# Select specific rows and columns
result = df.iloc[[0, 1, 2], [0, 3]]
print("Rows 0-2, columns 0 and 3:")
print(result)

In [None]:
# Range of rows and columns
result = df.iloc[1:4, 0:3]
print("Rows 1-3, columns 0-2:")
print(result)

In [None]:
# Using step in slicing
every_other = df.iloc[::2, :3]
print("Every other row, first 3 columns:")
print(every_other)

### 3.3 Key Difference: loc vs iloc

| Feature | `.loc[]` | `.iloc[]` |
|---------|----------|----------|
| Selection by | Labels | Integer positions |
| End of range | Inclusive | Exclusive |
| Accepts | Labels, lists, slices, booleans | Integers, lists, slices |

In [None]:
# Demonstration of inclusive vs exclusive
print("loc 'E002':'E004' (inclusive):")
print(df.loc['E002':'E004', 'name'])

print("\niloc 1:4 (exclusive):")
print(df.iloc[1:4, 0])

---

## 4. Boolean Indexing

Boolean indexing allows you to filter rows based on conditions.

### 4.1 Basic Boolean Conditions

In [None]:
# Create a boolean mask
high_salary = df['salary'] > 70000
print("Boolean mask (salary > 70000):")
print(high_salary)

In [None]:
# Apply the mask to filter rows
high_earners = df[high_salary]
print("Employees with salary > 70000:")
print(high_earners)

In [None]:
# Inline (most common pattern)
engineers = df[df['department'] == 'Engineering']
print("Engineering employees:")
print(engineers)

### 4.2 Multiple Conditions

In [None]:
# AND condition (use & and parentheses)
result = df[(df['department'] == 'Engineering') & (df['salary'] > 80000)]
print("Engineering AND salary > 80000:")
print(result)

In [None]:
# OR condition (use |)
result = df[(df['department'] == 'HR') | (df['performance'] == 'A')]
print("HR OR performance A:")
print(result)

In [None]:
# NOT condition (use ~)
result = df[~(df['department'] == 'Marketing')]
print("NOT Marketing:")
print(result)

In [None]:
# Complex condition
result = df[
    ((df['department'] == 'Engineering') | (df['department'] == 'Marketing')) &
    (df['age'] < 40) &
    (df['performance'] == 'A')
]
print("Complex filter:")
print(result)

### 4.3 Using `.isin()` for Multiple Values

In [None]:
# Filter by multiple values
depts = ['Engineering', 'Marketing']
result = df[df['department'].isin(depts)]
print("Engineering or Marketing:")
print(result)

In [None]:
# NOT in list
result = df[~df['department'].isin(['HR'])]
print("Not in HR:")
print(result)

### 4.4 String Methods for Filtering

In [None]:
# Filter using string methods
result = df[df['name'].str.startswith('A')]
print("Names starting with 'A':")
print(result)

In [None]:
# Contains pattern
result = df[df['name'].str.contains('a', case=False)]
print("Names containing 'a' (case insensitive):")
print(result)

### 4.5 Boolean Indexing with `.loc[]`

In [None]:
# Combine boolean indexing with column selection
result = df.loc[df['salary'] > 70000, ['name', 'salary']]
print("High earners (name and salary only):")
print(result)

In [None]:
# Modify values using boolean indexing with loc
df_copy = df.copy()
df_copy.loc[df_copy['performance'] == 'A', 'salary'] *= 1.1
print("After 10% raise for A performers:")
print(df_copy[['name', 'performance', 'salary']])

---

## 5. The `query()` Method

The `query()` method provides a more readable way to filter data using string expressions.

In [None]:
# Basic query
result = df.query('salary > 70000')
print("query: salary > 70000")
print(result)

In [None]:
# Multiple conditions (use 'and', 'or', 'not')
result = df.query('department == "Engineering" and salary > 80000')
print("query: Engineering and salary > 80000")
print(result)

In [None]:
# Using variables with @
min_salary = 65000
max_age = 35
result = df.query('salary >= @min_salary and age <= @max_age')
print(f"query: salary >= {min_salary} and age <= {max_age}")
print(result)

In [None]:
# Using 'in' operator
result = df.query('department in ["Engineering", "HR"]')
print("query: department in Engineering or HR")
print(result)

In [None]:
# Comparing columns
result = df.query('years_exp > age / 10')
print("query: years_exp > age/10")
print(result)

In [None]:
# Query with index
result = df.query('index in ["E001", "E003", "E005"]')
print("query: specific employee IDs")
print(result)

---

## 6. Advanced Selection Techniques

### 6.1 Selecting by Data Type

In [None]:
# Select only numeric columns
numeric_df = df.select_dtypes(include=['int64', 'float64'])
print("Numeric columns only:")
print(numeric_df)

In [None]:
# Select non-numeric columns
non_numeric_df = df.select_dtypes(exclude=['int64', 'float64'])
print("Non-numeric columns:")
print(non_numeric_df)

### 6.2 Using `at[]` and `iat[]` for Scalar Access

For accessing single values, `at[]` and `iat[]` are faster than `loc[]` and `iloc[]`.

In [None]:
# at[] for label-based scalar access
value = df.at['E001', 'name']
print(f"Value at E001, name: {value}")

# iat[] for position-based scalar access
value = df.iat[0, 0]
print(f"Value at (0, 0): {value}")

### 6.3 Getting Unique Values and Counts

In [None]:
# Unique values
print(f"Unique departments: {df['department'].unique()}")
print(f"Number of unique departments: {df['department'].nunique()}")

In [None]:
# Value counts
print("Department value counts:")
print(df['department'].value_counts())

---

## Exercises

In [None]:
# Create a fresh DataFrame for exercises
sales_data = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones', 'Camera', 'Speaker', 'Charger'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Wearables', 'Audio', 'Electronics', 'Audio', 'Accessories'],
    'price': [999.99, 699.99, 449.99, 299.99, 149.99, 599.99, 199.99, 29.99],
    'quantity': [50, 150, 80, 200, 300, 40, 120, 500],
    'rating': [4.5, 4.7, 4.3, 4.1, 4.6, 4.4, 4.2, 3.9]
}, index=['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008'])
sales_data.index.name = 'product_id'
print("Sales Data:")
print(sales_data)

### Exercise 1: Basic Selection

1. Select only the 'product' and 'price' columns
2. Select the row for product P003 using `.loc[]`
3. Select the first 3 rows using `.iloc[]`

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Select columns
print("Product and price:")
print(sales_data[['product', 'price']])

# 2. Select row by label
print("\nProduct P003:")
print(sales_data.loc['P003'])

# 3. Select first 3 rows
print("\nFirst 3 rows:")
print(sales_data.iloc[:3])
```
</details>

### Exercise 2: loc and iloc Practice

1. Use `.loc[]` to select products P002 through P005, but only the 'product', 'price', and 'rating' columns
2. Use `.iloc[]` to select the last 4 rows and the first 3 columns
3. Get the price of product P006 using `.at[]`

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. loc with row and column ranges
print("Products P002-P005, selected columns:")
print(sales_data.loc['P002':'P005', ['product', 'price', 'rating']])

# 2. iloc with negative indexing
print("\nLast 4 rows, first 3 columns:")
print(sales_data.iloc[-4:, :3])

# 3. at for scalar access
print(f"\nPrice of P006: ${sales_data.at['P006', 'price']}")
```
</details>

### Exercise 3: Boolean Indexing

1. Find all products with price greater than 200
2. Find all Electronics products with quantity less than 100
3. Find products that are either in the 'Audio' category OR have a rating above 4.5

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Price > 200
print("Products with price > 200:")
print(sales_data[sales_data['price'] > 200])

# 2. Electronics with quantity < 100
print("\nElectronics with quantity < 100:")
print(sales_data[(sales_data['category'] == 'Electronics') & (sales_data['quantity'] < 100)])

# 3. Audio OR rating > 4.5
print("\nAudio OR rating > 4.5:")
print(sales_data[(sales_data['category'] == 'Audio') | (sales_data['rating'] > 4.5)])
```
</details>

### Exercise 4: Using query()

Use the `query()` method to:
1. Find products with price between 100 and 500
2. Find products in 'Electronics' or 'Audio' categories with rating >= 4.4
3. Find products where quantity is greater than 10 times the price (quantity > price * 10)

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Price between 100 and 500
print("Price between 100 and 500:")
print(sales_data.query('100 <= price <= 500'))

# 2. Electronics/Audio with rating >= 4.4
print("\nElectronics/Audio with rating >= 4.4:")
print(sales_data.query('category in ["Electronics", "Audio"] and rating >= 4.4'))

# 3. Quantity > price * 10
print("\nQuantity > price * 10:")
print(sales_data.query('quantity > price * 10'))
```
</details>

### Exercise 5: Combined Selection

Find all products that:
- Have a price less than 300
- Are NOT in the 'Accessories' category
- Have a rating of at least 4.2

Return only the 'product', 'category', and 'price' columns, sorted by price (highest first).

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
result = sales_data[
    (sales_data['price'] < 300) &
    (sales_data['category'] != 'Accessories') &
    (sales_data['rating'] >= 4.2)
][['product', 'category', 'price']].sort_values('price', ascending=False)

print("Filtered and sorted products:")
print(result)

# Alternative using query:
# result = sales_data.query(
#     'price < 300 and category != "Accessories" and rating >= 4.2'
# )[['product', 'category', 'price']].sort_values('price', ascending=False)
```
</details>

---

## Summary

In this notebook, you learned:

1. **Column Selection**:
   - Bracket notation: `df['column']` or `df[['col1', 'col2']]`
   - Dot notation: `df.column` (with limitations)

2. **`.loc[]` - Label-based Selection**:
   - Uses row and column labels
   - Range is inclusive on both ends
   - Syntax: `df.loc[rows, columns]`

3. **`.iloc[]` - Position-based Selection**:
   - Uses integer positions (0-based)
   - Range is exclusive on end
   - Syntax: `df.iloc[rows, columns]`

4. **Boolean Indexing**:
   - Create masks: `df['col'] > value`
   - Combine with `&` (and), `|` (or), `~` (not)
   - Use parentheses for complex conditions
   - `.isin()` for multiple values

5. **`query()` Method**:
   - String-based filtering
   - More readable for complex conditions
   - Use `@` for external variables

---

## Next Steps

Continue to the next notebook: **[04_data_cleaning.ipynb](04_data_cleaning.ipynb)** to learn how to handle missing data, duplicates, and data type conversions.