# Pandas Fundamentals II - Part 1: Indexing and Selection

## Week 3, Day 1 (Wednesday) - April 23rd, 2025

### Overview
This is the first part of our second session on Pandas fundamentals, focusing on indexing and selection techniques. Understanding how to efficiently select, access, and extract specific data from DataFrames is essential for effective data analysis.

### Learning Objectives
- Master different ways to select data from Pandas DataFrames
- Understand the difference between label-based and position-based indexing
- Learn how to select data using single and multiple criteria
- Apply SQL-like thinking to Pandas data selection

### Prerequisites
- Python fundamentals (Week 1)
- NumPy basics (Week 2, Day 1)
- Pandas Fundamentals I (Week 2, Day 2)
- SQL knowledge (prior to course)

## 1. Introduction to Indexing and Selection

Selecting and accessing specific data from a DataFrame is one of the most common operations in data analysis. Pandas provides multiple ways to do this, often more flexible than SQL's `SELECT` statements.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Electronics'],
    'price': [1200, 800, 450, 150, 300],
    'in_stock': [True, True, False, True, True]
}

df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)

Sample DataFrame:
  product_id product_name     category  price  in_stock
0       P001       Laptop  Electronics   1200      True
1       P002   Smartphone  Electronics    800      True
2       P003       Tablet  Electronics    450     False
3       P004   Headphones  Accessories    150      True
4       P005      Monitor  Electronics    300      True


## 2. Basic Column Selection

### Selecting a Single Column

In SQL, you use: `SELECT column_name FROM table`. In Pandas, there are multiple ways to select columns:

In [10]:
# Method 1: Using dictionary-like notation (returns a Series)
product_names = df['product_name']
print("\nProduct names (Series):")
print(product_names)



Product names (Series):
0        Laptop
1    Smartphone
2        Tablet
3    Headphones
4       Monitor
Name: product_name, dtype: object


In [11]:

# Method 2: Using attribute notation (dot notation)
# Only works if column name has no spaces/special chars and doesn't overlap with DataFrame methods
categories = df.category
print("\nCategories (Series):")
print(categories)


Categories (Series):
0    Electronics
1    Electronics
2    Electronics
3    Accessories
4    Electronics
Name: category, dtype: object


### Selecting Multiple Columns

In SQL: `SELECT col1, col2 FROM table`. In Pandas:

In [12]:
# Select multiple columns (returns a DataFrame)
product_info = df[['product_name', 'price']]
print("\nProduct info (multiple columns):")
print(product_info)


Product info (multiple columns):
  product_name  price
0       Laptop   1200
1   Smartphone    800
2       Tablet    450
3   Headphones    150
4      Monitor    300


## 3. Position-Based Indexing with iloc

`iloc` is used for integer-location based indexing (like array indexing in NumPy).

### Basic Row Selection

In [14]:
# Select a single row by position
first_row = df.iloc[0]
print("\nFirst row:")
print(first_row)



First row:
product_id             P001
product_name         Laptop
category        Electronics
price                  1200
in_stock               True
Name: 0, dtype: object


In [15]:

# Select multiple rows
first_three_rows = df.iloc[0:3]
print("\nFirst three rows:")
print(first_three_rows)


First three rows:
  product_id product_name     category  price  in_stock
0       P001       Laptop  Electronics   1200      True
1       P002   Smartphone  Electronics    800      True
2       P003       Tablet  Electronics    450     False


### Row and Column Selection

In [16]:
# Select specific rows and columns by position
# Format: df.iloc[row_selection, column_selection]

# Select first 2 rows and columns 1,2 (0-indexed)
subset = df.iloc[0:2, 1:3]
print("\nSubset with iloc (first 2 rows, columns 1-2):")
print(subset)



Subset with iloc (first 2 rows, columns 1-2):
  product_name     category
0       Laptop  Electronics
1   Smartphone  Electronics


In [17]:

# Select specific rows and columns with lists
specific_cells = df.iloc[[0, 2, 4], [1, 3]]  # Rows 0,2,4 and columns 1,3
print("\nSpecific cells with iloc:")
print(specific_cells)


Specific cells with iloc:
  product_name  price
0       Laptop   1200
2       Tablet    450
4      Monitor    300


## 4. Label-Based Indexing with loc

`loc` is used for label-based indexing, working with row and column labels instead of positions.

### Basic Row Selection with loc

In [18]:
# First, let's set the product_id as the index to demonstrate label-based indexing
df_indexed = df.set_index('product_id')
print("\nDataFrame with product_id as index:")
print(df_indexed)



DataFrame with product_id as index:
           product_name     category  price  in_stock
product_id                                           
P001             Laptop  Electronics   1200      True
P002         Smartphone  Electronics    800      True
P003             Tablet  Electronics    450     False
P004         Headphones  Accessories    150      True
P005            Monitor  Electronics    300      True


In [19]:

# Select a single row by label
product_p002 = df_indexed.loc['P002']
print("\nProduct P002:")
print(product_p002)



Product P002:
product_name     Smartphone
category        Electronics
price                   800
in_stock               True
Name: P002, dtype: object


In [20]:

# Select multiple rows by label
selected_products = df_indexed.loc[['P001', 'P003', 'P005']]
print("\nSelected products:")
print(selected_products)


Selected products:
           product_name     category  price  in_stock
product_id                                           
P001             Laptop  Electronics   1200      True
P003             Tablet  Electronics    450     False
P005            Monitor  Electronics    300      True


### Row and Column Selection with loc

In [None]:
# Select specific rows and columns by label
# Format: df.loc[row_selection, column_selection]

# Select specific products and columns
product_details = df_indexed.loc[['P001', 'P004'], ['product_name', 'price']]
print("\nProduct details with loc:")
print(product_details)


In [21]:

# Slicing with loc (inclusive of end point)
slice_products = df_indexed.loc['P001':'P003', 'product_name':'in_stock']
print("\nSliced products with loc (inclusive):")
print(slice_products)


Sliced products with loc (inclusive):
           product_name     category  price  in_stock
product_id                                           
P001             Laptop  Electronics   1200      True
P002         Smartphone  Electronics    800      True
P003             Tablet  Electronics    450     False


## 5. Using Boolean Indexing for Filtering

One of the most powerful features of Pandas is boolean indexing - similar to SQL's WHERE clause.

In [22]:
# Back to our original DataFrame
print("\nOriginal DataFrame:")
print(df)

# Filter products with price > 500
expensive_products = df[df['price'] > 500]
print("\nExpensive products (price > 500):")
print(expensive_products)

# Multiple conditions (using & for AND, | for OR)
# In SQL: SELECT * FROM products WHERE category = 'Electronics' AND price < 500
affordable_electronics = df[(df['category'] == 'Electronics') & (df['price'] < 500)]
print("\nAffordable electronics:")
print(affordable_electronics)

# Complex filters
# In SQL: SELECT * FROM products WHERE (category = 'Electronics' AND price > 300) OR in_stock = False
complex_filter = df[((df['category'] == 'Electronics') & (df['price'] > 300)) | (df['in_stock'] == False)]
print("\nComplex filter results:")
print(complex_filter)


Original DataFrame:
  product_id product_name     category  price  in_stock
0       P001       Laptop  Electronics   1200      True
1       P002   Smartphone  Electronics    800      True
2       P003       Tablet  Electronics    450     False
3       P004   Headphones  Accessories    150      True
4       P005      Monitor  Electronics    300      True

Expensive products (price > 500):
  product_id product_name     category  price  in_stock
0       P001       Laptop  Electronics   1200      True
1       P002   Smartphone  Electronics    800      True

Affordable electronics:
  product_id product_name     category  price  in_stock
2       P003       Tablet  Electronics    450     False
4       P005      Monitor  Electronics    300      True

Complex filter results:
  product_id product_name     category  price  in_stock
0       P001       Laptop  Electronics   1200      True
1       P002   Smartphone  Electronics    800      True
2       P003       Tablet  Electronics    450     Fals

## 6. Combining Different Selection Methods

You can combine these methods for more complex selections:

In [23]:
# Use boolean indexing first, then select specific columns
# In SQL: SELECT product_name, price FROM products WHERE category = 'Electronics'
electronics_names_prices = df[df['category'] == 'Electronics'][['product_name', 'price']]
print("\nElectronics names and prices:")
print(electronics_names_prices)



Electronics names and prices:
  product_name  price
0       Laptop   1200
1   Smartphone    800
2       Tablet    450
4      Monitor    300


In [24]:

# Alternative using loc
electronics_names_prices_alt = df.loc[df['category'] == 'Electronics', ['product_name', 'price']]
print("\nElectronics names and prices (using loc):")
print(electronics_names_prices_alt)


Electronics names and prices (using loc):
  product_name  price
0       Laptop   1200
1   Smartphone    800
2       Tablet    450
4      Monitor    300


## 7. Selecting Specific Data Types

Sometimes you want to select columns based on their data types:

In [25]:
# Select numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64'])
print("\nNumeric columns:")
print(numeric_cols)



Numeric columns:
   price
0   1200
1    800
2    450
3    150
4    300


In [26]:

# Select object (string) columns
string_cols = df.select_dtypes(include=['object'])
print("\nString columns:")
print(string_cols)



String columns:
  product_id product_name     category
0       P001       Laptop  Electronics
1       P002   Smartphone  Electronics
2       P003       Tablet  Electronics
3       P004   Headphones  Accessories
4       P005      Monitor  Electronics


In [27]:

# Select boolean columns
bool_cols = df.select_dtypes(include=['bool'])
print("\nBoolean columns:")
print(bool_cols)


Boolean columns:
   in_stock
0      True
1      True
2     False
3      True
4      True


## 8. The .at and .iat Accessors for Single Value Selection

For fast scalar lookup:

In [28]:
# .at for label-based scalar lookups
# Get the price of product P003 (assuming product_id is the index)
p003_price = df_indexed.at['P003', 'price']
print(f"\nPrice of P003: {p003_price}")



Price of P003: 450


In [29]:

# .iat for integer-based scalar lookups
# Get the value in the first row, third column
first_row_third_col = df.iat[0, 2]
print(f"\nValue at first row, third column: {first_row_third_col}")


Value at first row, third column: Electronics


## 9. SQL Translation Guide

Here's a quick reference for translating SQL SELECT operations to Pandas:

| SQL Operation | Pandas Equivalent |
|--------------|-------------------|
| `SELECT * FROM table` | `df` |
| `SELECT col FROM table` | `df['col']` or `df.col` |
| `SELECT col1, col2 FROM table` | `df[['col1', 'col2']]` |
| `SELECT * FROM table WHERE col = value` | `df[df['col'] == value]` |
| `SELECT * FROM table WHERE col1 = val1 AND col2 = val2` | `df[(df['col1'] == val1) & (df['col2'] == val2)]` |
| `SELECT * FROM table WHERE col IN (val1, val2)` | `df[df['col'].isin([val1, val2])]` |
| `SELECT * FROM table LIMIT 5` | `df.head(5)` |
| `SELECT * FROM table ORDER BY col` | `df.sort_values('col')` |

## 10. Practice Exercises

### Exercise 1: Basic Selection
Using the sample DataFrame, select all products that are in stock.

In [None]:
# Your code here

### Exercise 2: Label-Based Selection
Using the indexed version of the DataFrame, select the product name and category for products P002 and P005.

In [None]:
# Your code here

### Exercise 3: SQL to Pandas Translation
Translate the following SQL query to Pandas:
```sql
SELECT product_name, price
FROM products
WHERE category = 'Electronics' 
AND price BETWEEN 300 AND 1000
ORDER BY price DESC
```

In [None]:
# Your code here

### Exercise 4: Complex Selection
Select all electronics products with prices above the average price of all products.

In [None]:
# Your code here

## Next Steps

In the next parts, we'll continue with:
- Part 2: Filtering data
- Part 3: Handling missing values

Continue to Part 2: Filtering Data when you're ready to proceed.