# Pandas Toolkit Series (Part 1) - Selection, Filtering & Transformation

**INST447 Data Source and Manipulation**  
Author: **Wei Ai** (aiwei@umd.edu)  
University of Maryland

---

Today we'll explore the fundamental operations for working with pandas DataFrames:
- **Selection**: Choosing specific rows and columns
- **Filtering**: Finding data that meets certain conditions  
- **Transformation**: Creating new data from existing columns

We'll use a realistic dataset of personal flight history to demonstrate these concepts.

In [2]:
import pandas as pd
import numpy as np

# Personal flight history dataset - notice the mixed data types!
flights_data = [
    ("2024-01-15", "UA1247", "BWI", "ORD", "651", "B737", "12A", 289.50, 15),
    ("2024-01-22", "DL456", "ORD", "LAX", "1745", "A321", "8F", 425.00, None),
    ("2024-02-08", "WN2891", "LAX", "PHX", "370", "B737", "", 149.99, 0),
    ("2024-02-10", "WN1055", "PHX", "DEN", "602", "B737", "15C", None, 45),
    ("2024-03-05", "AA892", "DEN", "DFW", "663", "B737", "21B", 198.75, None),
    ("2024-03-12", "UA634", "DFW", "IAD", "1216", "B777", "9A", 345.25, 12),
    ("2024-04-20", "B61840", "IAD", "BOS", "429", "", "11D", 179.50, 0),
    ("2024-05-15", "DL1123", "BOS", "ATL", "946", "A220", "4A", 267.00, 25),
    ("2024-05-18", "DL2967", "ATL", "MIA", "594", "B737", "", None, 8),
    ("2024-06-02", "AA1456", "MIA", "LGA", "1095", "A321", "18F", 312.80, None)
]

columns = ["flight_date", "flight_number", "origin", "destination",
          "distance", "aircraft", "seat", "price", "delay_min"]

flights = pd.DataFrame(flights_data, columns=columns)
flights

Unnamed: 0,flight_date,flight_number,origin,destination,distance,aircraft,seat,price,delay_min
0,2024-01-15,UA1247,BWI,ORD,651,B737,12A,289.5,15.0
1,2024-01-22,DL456,ORD,LAX,1745,A321,8F,425.0,
2,2024-02-08,WN2891,LAX,PHX,370,B737,,149.99,0.0
3,2024-02-10,WN1055,PHX,DEN,602,B737,15C,,45.0
4,2024-03-05,AA892,DEN,DFW,663,B737,21B,198.75,
5,2024-03-12,UA634,DFW,IAD,1216,B777,9A,345.25,12.0
6,2024-04-20,B61840,IAD,BOS,429,,11D,179.5,0.0
7,2024-05-15,DL1123,BOS,ATL,946,A220,4A,267.0,25.0
8,2024-05-18,DL2967,ATL,MIA,594,B737,,,8.0
9,2024-06-02,AA1456,MIA,LGA,1095,A321,18F,312.8,


In [None]:
# Separate airline lookup table - we'll use this later for joins!
airlines_data = [
    ("UA", "United Airlines"),
    ("DL", "Delta Air Lines"),
    ("WN", "Southwest Airlines"),
    ("AA", "American Airlines"),
    ("B6", "JetBlue Airways")
]

airlines = pd.DataFrame(airlines_data, columns=["code", "airline_name"])
airlines

## 1. Row Selection - Getting Subsets of Data

Let's start with the basics: how do we look at parts of our data?

In [None]:
# Basic exploration
flights.head()

In [None]:
flights.tail(3)

**What if I want to see specific rows by position?**

This is where `.iloc` comes in handy:

In [None]:
# Positional selection with iloc
flights.iloc[2:5]

In [None]:
flights.loc[2:5]

## 2. Column Selection - Choosing What Data to Work With

Now let's focus on selecting specific columns from our dataset.

In [None]:
# Single column selection - dot notation
flights.origin

In [None]:
# Single column selection - bracket notation
flights['origin']

**What if I want multiple columns?** Notice the difference between `[]` and `[[]]`:

In [None]:
# Multiple column selection
flights[['flight_number', 'origin', 'destination']]

What is the difference between flights.origin, flights['origin'], and flights[['origin']]?

In [None]:
# What are all the column names?
flights.columns

## 3. Filtering as Selection

The power of pandas comes from boolean filtering. Let's break this down step by step:

In [None]:
# Boolean filtering - step by step
flights.origin == 'BWI'

Now we use that boolean list to filter:

In [None]:
# Using the boolean mask
flights[flights.origin == 'BWI']

**What if I want to check if values are in a list?** Use `.isin()`:

In [None]:
# Using .isin() for membership testing
east_coast = ['BWI', 'BOS', 'LGA', 'IAD']
flights[flights.origin.isin(east_coast)]

**What if I want to combine multiple conditions?** Use `&` (and) or `|` (or):

In [None]:
# Multiple conditions
flights[(flights.price > 200) & (flights.price.notna())]

You may notice that the [] operator is overloaded -- it does different things depending on what you put inside it.
```
flights['origin']              # Column selection
flights[['origin', 'price']]   # Multiple column selection
flights[flights.price > 200]   # Row selection (boolean filtering)
```

**What if I want to avoid confusion between row and column selection?** Use `.loc`:

In [None]:
# Using .loc for explicit row selection
flights.loc[flights.destination == 'LAX']

In [None]:
# Using .loc for explicit column selection
flights.loc[:, ['origin', 'destination']]

In [None]:
# Using .loc for explicit row and column selection
flights.loc[flights.destination == 'LAX', ['origin', 'destination']]

## 3. SQL-Style Queries

**What if you're more comfortable with SQL syntax?** Pandas has you covered:

In [None]:
# Using query method - familiar to SQL users
flights.query("origin == 'BWI'")

In [None]:
# More complex queries
flights.query("price > 300 and origin == 'ORD'")

## 5. Data Transformation - Creating New Information

Now let's create new columns from existing data:

What if we want to double the distance to calculate the "round-trip" distance of a flight?

In [None]:
flights.distance * 2

Wait? What? why does "651" becomes 651651 instead of 1302??

**Pay attention to data type**

In [None]:
# Notice something interesting about the distance column
flights.distance.dtype

**What if I need to convert that distance to a number for calculations?**

In [None]:
flights.distance.astype(int)

In [None]:
# Convert string distance to numeric
flights['distance'] = flights.distance.astype(int)

**What if I want to combine text columns?**

In [None]:
# Combining text columns
flights['route'] = flights.origin + '-' + flights.destination
flights[['flight_number', 'route']].head()

In [None]:
'-'.join(['BWI', 'LAX'])

In [1]:
# Extract airline codes from flight numbers
flights.flight_number[:2]

NameError: name 'flights' is not defined

In [None]:
flights.flight_number.str[:2]

In [None]:
# More string operations
flights.aircraft.str.upper()

In [None]:
# Check if aircraft contains specific text
flights.aircraft.str.contains('B7')

## 6. Lambda Expressions and Apply - Advanced Transformations

**What if I need more complex logic for transformations?** Lambda expressions and `.apply()` give us flexibility:

In [None]:
# Lambda expressions on a single column (cell-wise operations)
flights.distace.apply(lambda x: x * 1.6)  # Convert miles to kilometers

What is happening under the hood: lambda defines an anonymous function - `x: x * 1.6`

**What if I want to categorize flights based on distance?**

In [None]:
# Complex lambda - categorizing flight distances
flights['flight_type'] = flights.distance_num.apply(lambda x: 'Short' if x < 500 else 'Medium' if x < 1000 else 'Long')
flights[['flight_number', 'distance', 'flight_type']].head()

**What if I need to use multiple columns in a single transformation?** Use `axis=1`:

In [None]:
# Lambda expressions on rows (access multiple columns)
flights.apply(lambda row: f"Flight {row['flight_number']} from {row['origin']} to {row['destination']}", axis=1)

**What if the logic gets really complex?** Write a proper function:

In [None]:
# More complex function
def flight_experience(row):
    if pd.isna(row['delay_min']) or row['delay_min'] == 0:
        delay_status = 'On Time'
    elif row['delay_min'] < 30:
        delay_status = 'Minor Delay'
    else:
        delay_status = 'Major Delay'

    airline_code = row['flight_number'][:2]
    return f"{airline_code} flight: {delay_status}"

flights['experience'] = flights.apply(flight_experience, axis=1)
flights[['flight_number', 'delay_min', 'experience']].head()

**Question**: What if I want to know the day-of-week of each flight, so that I can know if I travel more on weekday or weekend?