# DataFrame Indexing and Selection

## Overview
Pandas provides multiple methods for selecting data from DataFrames:

### Primary Indexing Methods:
1. **Column Selection**: `df['column']` or `df.column`
2. **Row Selection**:
   - Label-based: `.loc[]` (uses index labels)
   - Position-based: `.iloc[]` (uses integer positions)
3. **Boolean Indexing**: `df[condition]`

### Key Concepts:
- **Index Alignment**: Operations align on index values
- **View vs Copy**: Understanding when you get a view vs a copy
- **Chained Indexing**: Why to avoid `df[][][]` pattern

In [1]:
import pandas as pd
import numpy as np

# Create sample DataFrame for demonstrations
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 32, 40],
    'Department': ['HR', 'Engineering', 'Marketing', 'Engineering', 'HR', 'Sales'],
    'Salary': [50000, 80000, 60000, 75000, 52000, 90000],
    'Experience': [2, 5, 3, 6, 2, 10],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'New York', 'London']
}

df = pd.DataFrame(data)
df.index = ['E001', 'E002', 'E003', 'E004', 'E005', 'E006']  # Custom index
print("Sample DataFrame:")
print(df)
print(f"\nShape: {df.shape}")
print(f"Index: {df.index.tolist()}")
print(f"Columns: {df.columns.tolist()}")

Sample DataFrame:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

Shape: (6, 6)
Index: ['E001', 'E002', 'E003', 'E004', 'E005', 'E006']
Columns: ['Name', 'Age', 'Department', 'Salary', 'Experience', 'City']


## 1. Selecting Columns

### Methods:
1. **Bracket notation**: `df['column_name']` (returns Series)
2. **Dot notation**: `df.column_name` (returns Series, only works with valid Python identifiers)
3. **Multiple columns**: `df[['col1', 'col2']]` (returns DataFrame)

In [2]:
print("Original DataFrame:")
print(df)
print()

# Single column - returns Series
print("1. Single column (Series):")
name_series = df['Name']
print(f"Type: {type(name_series)}")
print(name_series)
print()

# Dot notation (only for valid Python identifiers)
print("2. Dot notation:")
age_series = df.Age
print(f"Type: {type(age_series)}")
print(age_series)
print()

# Multiple columns - returns DataFrame
print("3. Multiple columns (DataFrame):")
subset_df = df[['Name', 'Department', 'Salary']]
print(f"Type: {type(subset_df)}")
print(f"Shape: {subset_df.shape}")
print(subset_df)
print()

# Selecting non-existent column (error)
try:
    df['NonExistent']
except KeyError as e:
    print(f"4. Error when selecting non-existent column: {e}")

Original DataFrame:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Single column (Series):
Type: <class 'pandas.core.series.Series'>
E001      Alice
E002        Bob
E003    Charlie
E004      Diana
E005        Eve
E006      Frank
Name: Name, dtype: object

2. Dot notation:
Type: <class 'pandas.core.series.Series'>
E001    25
E002    30
E003    35
E004    28
E005    32
E006    40
Name: Age, dtype: int64

3. Multiple columns (DataFrame):
Type: <class 'pandas.core.frame.DataFrame'>
Shape: (6, 3)
         Name   Department  Salary
E001    Alice           HR   50000
E002      Bob  Engineering   80000
E003  C

## 2. Label-based Indexing with .loc[]

### Syntax: `df.loc[row_selection, column_selection]`

### Rules:
1. Uses index labels and column names
2. **Inclusive** of both start and end in slices
3. Accepts:
   - Single label
   - List of labels
   - Slice with labels
   - Boolean array

In [3]:
print("DataFrame with custom index:")
print(df)
print()

# Single row by label
print("1. Single row by index label (.loc['E002']):")
print(df.loc['E002'])
print(f"Type: {type(df.loc['E002'])}")  # Series
print()

# Single cell (row, column)
print("2. Single cell (.loc['E003', 'Salary']):")
print(f"Value: {df.loc['E003', 'Salary']}")
print()

# Multiple rows, single column
print("3. Multiple rows, single column (.loc[['E001', 'E003'], 'Name']):")
print(df.loc[['E001', 'E003'], 'Name'])
print()

# Multiple rows, multiple columns
print("4. Multiple rows & columns (.loc[['E002', 'E004'], ['Name', 'Department']]):")
print(df.loc[['E002', 'E004'], ['Name', 'Department', 'Salary']])
print()

# Slice of rows (inclusive!)
print("5. Slice of rows (.loc['E002':'E004']):")
print(df.loc['E002':'E004'])  # Includes E002, E003, and E004
print()

# Slice of rows and columns
print("6. Slice of rows and columns (.loc['E002':'E004', 'Name':'Salary']):")
print(df.loc['E002':'E004', 'Name':'Salary'])  # All columns from Name to Salary

DataFrame with custom index:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Single row by index label (.loc['E002']):
Name                  Bob
Age                    30
Department    Engineering
Salary              80000
Experience              5
City               London
Name: E002, dtype: object
Type: <class 'pandas.core.series.Series'>

2. Single cell (.loc['E003', 'Salary']):
Value: 60000

3. Multiple rows, single column (.loc[['E001', 'E003'], 'Name']):
E001      Alice
E003    Charlie
Name: Name, dtype: object

4. Multiple rows & columns (.loc[['E002', 'E004'], ['Name', 'Department']]):
       Na

## 3. Position-based Indexing with .iloc[]

### Syntax: `df.iloc[row_positions, column_positions]`

### Rules:
1. Uses integer positions (0-based indexing)
2. **Exclusive** of end in slices (like Python lists)
3. Accepts:
   - Single integer
   - List of integers
   - Slice with integers
   - Boolean array

In [4]:
print("DataFrame (remember: positions start at 0):")
print(df)
print()

# Single row by position
print("1. Single row by position (.iloc[1]):")
print(df.iloc[1])  # Second row (position 1)
print()

# Single cell by position
print("2. Single cell (.iloc[2, 3]):")
print(f"Value at row 2, column 3: {df.iloc[2, 3]}")  # Charlie's Salary
print(f"Column 3 name: {df.columns[3]}")
print()

# Multiple rows, multiple columns
print("3. Multiple rows & columns (.iloc[[0, 2, 4], [1, 3]]):")
print(df.iloc[[0, 2, 4], [1, 3]])  # Rows 0,2,4 and Columns 1,3
print()

# Slice of rows (exclusive of end!)
print("4. Slice of rows (.iloc[1:4]):")
print(df.iloc[1:4])  # Rows at positions 1, 2, 3 (NOT 4)
print()

# Slice of rows and columns
print("5. Slice of rows and columns (.iloc[1:4, 0:3]):")
print(df.iloc[1:4, 0:3])  # Rows 1-3, Columns 0-2
print()

# Using negative indices
print("6. Using negative indices (.iloc[-3:, -2:]):")
print(df.iloc[-3:, -2:])  # Last 3 rows, last 2 columns

DataFrame (remember: positions start at 0):
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Single row by position (.iloc[1]):
Name                  Bob
Age                    30
Department    Engineering
Salary              80000
Experience              5
City               London
Name: E002, dtype: object

2. Single cell (.iloc[2, 3]):
Value at row 2, column 3: 60000
Column 3 name: Salary

3. Multiple rows & columns (.iloc[[0, 2, 4], [1, 3]]):
      Age  Salary
E001   25   50000
E003   35   60000
E005   32   52000

4. Slice of rows (.iloc[1:4]):
         Name  Age   Department  Salary  Experience    C

## 4. Boolean Indexing

### Concept:
- Create boolean masks (True/False arrays)
- Use masks to filter rows
- Can combine conditions with & (and), | (or), ~ (not)

### Important:
- Use parentheses around each condition
- Use bitwise operators (&, |, ~) not logical operators (and, or, not)

In [5]:
print("Original DataFrame:")
print(df)
print()

# Simple condition
print("1. Employees with Salary > 60000:")
mask1 = df['Salary'] > 60000
print(f"Boolean mask:\n{mask1}")
print(f"\nFiltered DataFrame:\n{df[mask1]}")
print()

# Multiple conditions (AND)
print("2. Employees in Engineering AND Age > 30:")
mask2 = (df['Department'] == 'Engineering') & (df['Age'] > 30)
print(df[mask2])
print()

# Multiple conditions (OR)
print("3. Employees in HR OR Sales:")
mask3 = (df['Department'] == 'HR') | (df['Department'] == 'Sales')
print(df[mask3])
print()

# NOT condition
print("4. Employees NOT in London:")
mask4 = ~(df['City'] == 'London')
print(df[mask4])
print()

# Complex condition
print("5. Employees with (Salary > 70000) OR (Experience >= 5 AND Department != 'HR'):")
mask5 = (df['Salary'] > 70000) | ((df['Experience'] >= 5) & (df['Department'] != 'HR'))
print(df[mask5])
print()

# Using .loc with boolean indexing
print("6. Using .loc with boolean mask (select specific columns):")
mask6 = df['Age'] > 30
print(df.loc[mask6, ['Name', 'Age', 'Salary']])

Original DataFrame:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Employees with Salary > 60000:
Boolean mask:
E001    False
E002     True
E003    False
E004     True
E005    False
E006     True
Name: Salary, dtype: bool

Filtered DataFrame:
       Name  Age   Department  Salary  Experience    City
E002    Bob   30  Engineering   80000           5  London
E004  Diana   28  Engineering   75000           6   Paris
E006  Frank   40        Sales   90000          10  London

2. Employees in Engineering AND Age > 30:
Empty DataFrame
Columns: [Name, Age, Department, Salary, Experience, City]
Index: []

3. Em

## 5. Selection Using Callable Functions

### Advanced technique:
- Pass a function to .loc[] or .iloc[]
- Function should return valid indexer
- Useful for complex selection logic

In [6]:
print("Original DataFrame:")
print(df)
print()

# Using lambda function with .loc
print("1. Using lambda to select high earners:")
high_earners = df.loc[lambda x: x['Salary'] > 70000]
print(high_earners)
print()

# Chain operations with callable
print("2. Chain operations with callable:")
result = df.loc[lambda d: d['Department'] == 'Engineering'].loc[lambda d: d['Age'] < 35]
print(result)
print()

# More complex callable
print("3. Complex selection with callable:")

def senior_employees(df):
    """Select employees with high experience or high salary"""
    return (df['Experience'] > 5) | (df['Salary'] > 75000)

seniors = df.loc[senior_employees]
print(seniors)
print()

# Using .iloc with callable
print("4. Using .iloc with callable (first 3 rows, first 2 columns):")
subset = df.iloc[lambda x: slice(0, 3), lambda x: slice(0, 2)]
print(subset)

Original DataFrame:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Using lambda to select high earners:
       Name  Age   Department  Salary  Experience    City
E002    Bob   30  Engineering   80000           5  London
E004  Diana   28  Engineering   75000           6   Paris
E006  Frank   40        Sales   90000          10  London

2. Chain operations with callable:
       Name  Age   Department  Salary  Experience    City
E002    Bob   30  Engineering   80000           5  London
E004  Diana   28  Engineering   75000           6   Paris

3. Complex selection with callable:
       Name  Age   Departm

## 6. Fast Scalar Access: .at[] and .iat[]

### Purpose:
- **.at[]**: Fast label-based scalar access (like .loc but for single values)
- **.iat[]**: Fast integer-based scalar access (like .iloc but for single values)

### Use when:
- You need to get/set a single value
- Performance matters (faster than .loc/.iloc for single cells)

In [7]:
print("DataFrame:")
print(df)
print()

# .at[] for label-based scalar access
print("1. Using .at[] (label-based):")
print(f"Employee E003 Name: {df.at['E003', 'Name']}")
print(f"Employee E002 Salary: {df.at['E002', 'Salary']}")
print()

# .iat[] for position-based scalar access
print("2. Using .iat[] (position-based):")
print(f"Row 0, Column 1 (position): {df.iat[0, 1]}")
print(f"Row 2, Column 3 (position): {df.iat[2, 3]}")
print()

# Setting values with .at[] and .iat[]
print("3. Setting values:")

# Create a copy to avoid modifying original
df_copy = df.copy()

# Using .at[] to set value
df_copy.at['E001', 'Salary'] = 55000
print("After df.at['E001', 'Salary'] = 55000:")
print(df_copy[['Name', 'Salary']].head(2))

# Using .iat[] to set value
df_copy.iat[2, 3] = 65000  # Row 2, Column 3 (Charlie's Salary)
print("\nAfter df.iat[2, 3] = 65000:")
print(df_copy.loc[['E003'], ['Name', 'Salary']])

# Performance comparison (conceptual)
print("\n4. Performance note:")
print(".at[] and .iat[] are optimized for single value access")
print("Use instead of .loc/.iloc when accessing single cells")

DataFrame:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Using .at[] (label-based):
Employee E003 Name: Charlie
Employee E002 Salary: 80000

2. Using .iat[] (position-based):
Row 0, Column 1 (position): 25
Row 2, Column 3 (position): 60000

3. Setting values:
After df.at['E001', 'Salary'] = 55000:
       Name  Salary
E001  Alice   55000
E002    Bob   80000

After df.iat[2, 3] = 65000:
         Name  Salary
E003  Charlie   65000

4. Performance note:
.at[] and .iat[] are optimized for single value access
Use instead of .loc/.iloc when accessing single cells


## 7. Advanced Slicing Techniques

### Types of Slices:
1. **Label slices**: Inclusive of both ends (use with .loc)
2. **Position slices**: Exclusive of end (use with .iloc)
3. **Step slices**: `start:end:step`
4. **Implicit slices**: Using just `:` for all rows/columns

In [9]:
print("DataFrame:")
print(df)
print()

# Basic slices
print("1. Basic row slice (all columns):")
print(df.loc['E002':'E004'])  # Inclusive
print()

print("2. Position slice (all columns):")
print(df.iloc[1:4])  # Exclusive
print()

# Slices with steps
print("3. Slice with step (every other row):")
print(df.iloc[::2])  # Every 2nd row
print()

print("4. Slice with step for columns:")
print(df.iloc[:, ::2])  # Every 2nd column
print()

# Mixed slicing
print("5. Row slice with column selection:")
print(df.loc['E002':'E004', ['Name', 'Salary', 'City']])
print()

print("6. Position slice with column slice:")
print(df.iloc[1:4, 0:3])  # Rows 1-3, Columns 0-2
print()

# Reverse slicing
print("7. Reverse order:")
print(df.iloc[::-1].head())  # All rows in reverse
print()

print("8. Reverse column order:")
print(df.iloc[:, ::-1].head())  # All columns in reverse
print()

# Partial slices
print("9. From start to label:")
print(df.loc[:'E003'])  # All rows up to and including E003
print()

print("10. From position to end:")
print(df.iloc[3:])  # From position 3 to end

DataFrame:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Basic row slice (all columns):
         Name  Age   Department  Salary  Experience    City
E002      Bob   30  Engineering   80000           5  London
E003  Charlie   35    Marketing   60000           3   Tokyo
E004    Diana   28  Engineering   75000           6   Paris

2. Position slice (all columns):
         Name  Age   Department  Salary  Experience    City
E002      Bob   30  Engineering   80000           5  London
E003  Charlie   35    Marketing   60000           3   Tokyo
E004    Diana   28  Engineering   75000           6   Paris

3. Sl

## 8. Setting Values Using Indexing Methods

### Methods to set values:
1. Using `.loc[]` or `.iloc[]` on left side of assignment
2. Using `.at[]` or `.iat[]` for single values
3. Using boolean masks

### Important: SettingWithCopyWarning
- Avoid chained indexing: `df[condition]['column'] = value`
- Use `.loc` instead: `df.loc[condition, 'column'] = value`

In [12]:
# Create a copy to work with
df_modified = df.copy()
print("Original DataFrame (copy):")
print(df_modified)
print()

# 1. Set single value with .loc
print("1. Set single value with .loc:")
df_modified.loc['E001', 'Salary'] = 55000
print(f"E001's new salary: {df_modified.loc['E001', 'Salary']}")
print()

# 2. Set multiple values with .loc
print("2. Set multiple values with .loc:")
df_modified.loc[['E002', 'E004'], 'Salary'] = [85000, 80000]
print(df_modified.loc[['E002', 'E004'], ['Name', 'Salary']])
print()

# 3. Set with boolean mask
print("3. Give raise to HR department:")
hr_mask = df_modified['Department'] == 'HR'
df_modified.loc[hr_mask, 'Salary'] = df_modified.loc[hr_mask, 'Salary'] * 1.10  # 10% raise
print(df_modified.loc[hr_mask, ['Name', 'Salary']])
print()

# 4. Set entire column
print("4. Create new column:")
df_modified['Bonus'] = df_modified['Salary'] * 0.10  # 10% bonus
print(df_modified[['Name', 'Salary', 'Bonus']].head())
print()

# 5. Set with condition based on other columns
print("5. Set Senior status based on Experience:")
df_modified['Senior'] = df_modified['Experience'] >= 5
print(df_modified[['Name', 'Experience', 'Senior']])
print()

# 6. WARNING: Chained indexing (what NOT to do)
print("6. Chained indexing (problematic):")
try:
    df_modified[df_modified['Department'] == 'Engineering']['Salary'] = 99999
    print("This might not work as expected!")
except Exception as e:
    print(f"Warning/Error: {type(e).__name__}")
    print("Use .loc instead: df.loc[mask, 'column'] = value")

Original DataFrame (copy):
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Set single value with .loc:
E001's new salary: 55000

2. Set multiple values with .loc:
       Name  Salary
E002    Bob   85000
E004  Diana   80000

3. Give raise to HR department:
       Name   Salary
E001  Alice  60500.0
E005    Eve  57200.0

4. Create new column:
         Name   Salary   Bonus
E001    Alice  60500.0  6050.0
E002      Bob  85000.0  8500.0
E003  Charlie  60000.0  6000.0
E004    Diana  80000.0  8000.0
E005      Eve  57200.0  5720.0

5. Set Senior status based on Experience:
         Name  Experience  Senior
E001 

  df_modified.loc[hr_mask, 'Salary'] = df_modified.loc[hr_mask, 'Salary'] * 1.10  # 10% raise
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_modified[df_modified['Department'] == 'Engineering']['Salary'] = 99999


## 9. Understanding Index Alignment

### Key Concept:
- When performing operations between DataFrames/Series
- Pandas aligns data by index values
- Missing values (NaN) appear where indices don't match

In [14]:
# Create two DataFrames with different indices
df1 = pd.DataFrame({
    'Value': [10, 20, 30, 40],
    'Category': ['A', 'B', 'C', 'D']
}, index=['ID1', 'ID2', 'ID3', 'ID4'])

df2 = pd.DataFrame({
    'Multiplier': [1.5, 2.0, 2.5],
    'Active': [True, False, True]
}, index=['ID2', 'ID3', 'ID5'])

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
print()

# Addition with index alignment
print("1. Addition (aligns by index):")
result = df1['Value'] + df2['Multiplier']  # Series addition
print(result)
print("\nNote: ID1 and ID4 get NaN (not in df2)")
print("      ID5 gets NaN (not in df1)")
print()

# Using .add() with fill_value
print("2. Addition with fill_value (treat missing as 0):")
result_filled = df1['Value'].add(df2['Multiplier'], fill_value=0)
print(result_filled)
print()

# DataFrame operations with alignment
print("3. DataFrame operations (creates union of indices):")
df_combined = pd.DataFrame({
    'Value': df1['Value'],
    'Multiplier': df2['Multiplier']
})
print(df_combined)

DataFrame 1:
     Value Category
ID1     10        A
ID2     20        B
ID3     30        C
ID4     40        D

DataFrame 2:
     Multiplier  Active
ID2         1.5    True
ID3         2.0   False
ID5         2.5    True

1. Addition (aligns by index):
ID1     NaN
ID2    21.5
ID3    32.0
ID4     NaN
ID5     NaN
dtype: float64

Note: ID1 and ID4 get NaN (not in df2)
      ID5 gets NaN (not in df1)

2. Addition with fill_value (treat missing as 0):
ID1    10.0
ID2    21.5
ID3    32.0
ID4    40.0
ID5     2.5
dtype: float64

3. DataFrame operations (creates union of indices):
     Value  Multiplier
ID1   10.0         NaN
ID2   20.0         1.5
ID3   30.0         2.0
ID4   40.0         NaN
ID5    NaN         2.5


## 10. Introduction to MultiIndex

### Concept:
- Hierarchical indexing (index with multiple levels)
- Allows higher-dimensional data in 2D structure
- Useful for grouping and aggregation

### Basic operations:
- Creating MultiIndex
- Selecting from MultiIndex DataFrames
- Cross-section selection with `.xs()`

In [16]:
# Create DataFrame with MultiIndex
arrays = [
    ['Team A', 'Team A', 'Team A', 'Team B', 'Team B', 'Team B'],
    ['John', 'Sarah', 'Mike', 'Emma', 'David', 'Lisa']
]

index = pd.MultiIndex.from_arrays(arrays, names=['Team', 'Member'])
data = {
    'Q1_Sales': [100, 150, 120, 200, 180, 160],
    'Q2_Sales': [110, 140, 130, 210, 170, 175],
    'Region': ['North', 'North', 'South', 'South', 'North', 'South']
}

df_multi = pd.DataFrame(data, index=index)
print("MultiIndex DataFrame:")
print(df_multi)
print(f"\nIndex names: {df_multi.index.names}")
print(f"Index levels: {df_multi.index.nlevels}")
print()

# Selecting from MultiIndex
print("1. Select all members of Team A:")
print(df_multi.loc['Team A'])
print()

print("2. Select specific member:")
print(df_multi.loc[('Team A', 'John')])
print()

print("3. Select across teams (same member position):")
print(df_multi.xs('Sarah', level='Member'))  # Sarah from all teams
print()

# Partial indexing
print("4. Partial indexing (all teams, then filter):")
print(df_multi.loc[(slice(None), ['John', 'Emma']), :])  # All teams, only John and Emma
print()


MultiIndex DataFrame:
               Q1_Sales  Q2_Sales Region
Team   Member                           
Team A John         100       110  North
       Sarah        150       140  North
       Mike         120       130  South
Team B Emma         200       210  South
       David        180       170  North
       Lisa         160       175  South

Index names: ['Team', 'Member']
Index levels: 2

1. Select all members of Team A:
        Q1_Sales  Q2_Sales Region
Member                           
John         100       110  North
Sarah        150       140  North
Mike         120       130  South

2. Select specific member:
Q1_Sales      100
Q2_Sales      110
Region      North
Name: (Team A, John), dtype: object

3. Select across teams (same member position):
        Q1_Sales  Q2_Sales Region
Team                             
Team A       150       140  North

4. Partial indexing (all teams, then filter):
               Q1_Sales  Q2_Sales Region
Team   Member                           


## 11. Using `.query()` Method

### Advantages:
- SQL-like syntax for filtering
- Can use column names directly (no quotes for simple names)
- Often more readable for complex conditions

### Syntax: `df.query('expression')`

In [18]:
print("Original DataFrame:")
print(df)
print()

# Basic query
print("1. Simple query (Age > 30):")
print(df.query('Age > 30'))
print()

# Multiple conditions
print("2. Multiple conditions:")
print(df.query('Department == "Engineering" and Salary > 70000'))
print()

# Using variables in query
print("3. Query with variable:")
min_salary = 60000
print(df.query(f'Salary > {min_salary}'))
print()

# Complex conditions
print("4. Complex query:")
print(df.query('(Department in ["HR", "Sales"]) and (City == "New York")'))
print()

# Comparison with boolean indexing
print("5. Same filter with boolean indexing:")
mask = (df['Department'].isin(['HR', 'Sales'])) & (df['City'] == 'New York')
print(df[mask])
print("\nNote: .query() can be more readable for complex conditions")

Original DataFrame:
         Name  Age   Department  Salary  Experience      City
E001    Alice   25           HR   50000           2  New York
E002      Bob   30  Engineering   80000           5    London
E003  Charlie   35    Marketing   60000           3     Tokyo
E004    Diana   28  Engineering   75000           6     Paris
E005      Eve   32           HR   52000           2  New York
E006    Frank   40        Sales   90000          10    London

1. Simple query (Age > 30):
         Name  Age Department  Salary  Experience      City
E003  Charlie   35  Marketing   60000           3     Tokyo
E005      Eve   32         HR   52000           2  New York
E006    Frank   40      Sales   90000          10    London

2. Multiple conditions:
       Name  Age   Department  Salary  Experience    City
E002    Bob   30  Engineering   80000           5  London
E004  Diana   28  Engineering   75000           6   Paris

3. Query with variable:
       Name  Age   Department  Salary  Experience    

## 12. Performance Tips for Indexing

### Fastest to Slowest (generally):
1. `.iat[]` / `.at[]` (single scalar values)
2. `.iloc[]` / `.loc[]` (vectorized operations)
3. Boolean indexing
4. `.apply()` (row-wise operations)
5. Iterating with loops (slowest)

### Best Practices:
1. Use vectorized operations when possible
2. Avoid chained indexing `df[][][]`
3. Precompute boolean masks for repeated use
4. Use appropriate dtypes to save memory

In [19]:
import time

# Create larger DataFrame for timing
large_df = pd.DataFrame({
    'A': np.random.randn(10000),
    'B': np.random.randn(10000),
    'C': np.random.randn(10000)
})

print("Performance comparison (timing may vary):")
print(f"DataFrame size: {large_df.shape}")
print()

# Time different selection methods
start = time.time()
for _ in range(1000):
    val = large_df.iloc[5000, 1]  # Single value with .iloc
iloc_time = time.time() - start

start = time.time()
for _ in range(1000):
    val = large_df.iat[5000, 1]  # Single value with .iat
iat_time = time.time() - start

print(f".iloc time for 1000 accesses: {iloc_time:.4f}s")
print(f".iat time for 1000 accesses: {iat_time:.4f}s")
print(f"Speedup: {iloc_time/iat_time:.1f}x faster with .iat")
print()

# Boolean indexing performance
start = time.time()
mask = large_df['A'] > 0
result = large_df[mask]
bool_time = time.time() - start
print(f"Boolean indexing time: {bool_time:.6f}s")
print(f"Rows selected: {len(result)}/{len(large_df)}")

Performance comparison (timing may vary):
DataFrame size: (10000, 3)

.iloc time for 1000 accesses: 0.0119s
.iat time for 1000 accesses: 0.0080s
Speedup: 1.5x faster with .iat

Boolean indexing time: 0.000529s
Rows selected: 5013/10000


# Summary: DataFrame Indexing and Selection

## Selection Methods Cheat Sheet:

### Column Selection
- Single column: `df['col']` or `df.col`
- Multiple columns: `df[['col1', 'col2']]`

### Row Selection
- **`.loc[]`** (label-based):
  - Single row: `df.loc['label']`
  - Single cell: `df.loc['row', 'col']`
  - Multiple: `df.loc[['r1','r2'], ['c1','c2']]`
  - Slice: `df.loc['start':'end']` (inclusive!)

- **`.iloc[]`** (position-based):
  - Single row: `df.iloc[0]`
  - Single cell: `df.iloc[0, 1]`
  - Multiple: `df.iloc[[0,2], [1,3]]`
  - Slice: `df.iloc[0:3]` (exclusive!)

### Boolean Indexing
- `df[df['col'] > value]`
- `df[(cond1) & (cond2)]` (use parentheses!)
- `df[~condition]` (NOT operator)

### Scalar Access (Fast)
- `.at['row', 'col']` - label-based single value
- `.iat[0, 1]` - position-based single value

## Best Practices:
1. **Use `.loc` and `.iloc`** for explicit indexing
2. **Avoid chained indexing** (`df[][]=value`)
3. **Remember**: `.loc` slices are inclusive, `.iloc` are exclusive
4. **Use vectorized operations** instead of loops
5. **Be mindful of index alignment** in operations
6. **Use `.copy()`** when you need a separate DataFrame
7. **Prefer `.query()`** for complex, readable conditions

## Common Patterns:
```python
# Select rows meeting condition, specific columns
df.loc[df['col'] > value, ['col1', 'col2']]

# Select first n rows, last m columns
df.iloc[:n, -m:]

# Update values based on condition
df.loc[condition, 'column'] = new_value

# Select using callable
df.loc[lambda x: x['col'] > threshold]


---

## Cell 17: Practice Exercises
```markdown
# Practice Exercises

## Exercise 1: Basic Selection
1. Select the 'Name' and 'Salary' columns for all employees
2. Select employees in the Engineering department
3. Select employees aged between 30 and 35 (inclusive)

## Exercise 2: Using .loc and .iloc
1. Use .loc to select employees E002, E004, and E006
2. Use .iloc to select the 2nd, 4th, and 6th rows
3. Select the Salary of employee E003 using both .loc and .at

## Exercise 3: Boolean Indexing
1. Find employees with Salary > 70000 OR Experience > 5
2. Find employees in London who are NOT in Sales department
3. Create a new column 'High_Earner' (True if Salary > 75000)

## Exercise 4: Setting Values
1. Give a 10% raise to all employees in HR
2. Change Diana's city to 'Berlin'
3. Add a new column 'Total_Comp' = Salary + (Salary * 0.1)

## Exercise 5: Advanced
1. Use .query() to find employees in Engineering with Salary > 65000
2. Select every other employee (1st, 3rd, 5th)
3. Create a boolean mask for employees with names longer than 5 characters