# Pandas Essentials - Data Science Koans

Welcome to Notebook 02: Pandas Essentials! This notebook builds your intuition for labeled data structures so you can slice, aggregate, and summarize information with confidence.

## What You Will Learn
- Creating Series and DataFrames
- Selecting and filtering data
- Computing descriptive statistics
- Grouping and aggregating data
- Sorting and ranking records

## Prerequisites
- NumPy fundamentals from Notebook 01 (arrays, broadcasting, and the new linear algebra KOANs 1.11-1.24)

## How to Use
1. Run the setup cell below.
2. Read the objective for each koan.
3. Complete the `TODO` marked sections.
4. Execute the validation cell right below to check your work.
5. Rerun until every koan passes and celebrate your progress!


In [None]:
# Setup - Run first!
import sys
sys.path.append('../..')
import numpy as np
import pandas as pd
from koans.core.validator import KoanValidator
from koans.core.progress import ProgressTracker

validator = KoanValidator("02_pandas_essentials")
tracker = ProgressTracker()
print("Setup complete!")
print(f"Progress: {tracker.get_notebook_progress('02_pandas_essentials')}%")


### Optional reference
Run the cell below to see the toy datasets used throughout the koans. Feel free to experiment with them as you work through each exercise.


In [None]:
# Preview datasets used in multiple koans (optional)
from IPython.display import display
datasets = {
    "employees": pd.DataFrame(
        {
            "name": ["Alice", "Bob", "Carla", "Dan"],
            "city": ["NY", "SF", "NY", "LA"],
            "salary": [120_000, 110_000, 125_000, 105_000],
        }
    ),
    "sales": pd.DataFrame(
        {
            "region": ["East", "West", "East", "West", "South"],
            "rep": ["Alice", "Bob", "Carla", "Dan", "Eve"],
            "revenue": [120, 150, 100, 200, 90],
        }
    ),
}

for name, df in datasets.items():
    print(f"=== {name} ===")
    display(df)
    print()


## KOAN 2.1: Creating Series
**Objective**: Build a labeled one-dimensional structure
**Difficulty**: Beginner


In [None]:
def create_series():
    '''Return a Series [10, 20, 30] with index ['a', 'b', 'c'].'''
    # TODO: Create a pandas Series with the specified values and index.
    pass


@validator.koan(1, "Creating Series", difficulty="Beginner")
def validate():
    result = create_series()
    assert isinstance(result, pd.Series), "Return a pandas Series."
    assert result.index.tolist() == ['a', 'b', 'c'], "Use the requested index order."
    assert result.tolist() == [10, 20, 30], "Populate the Series with the given values."


validate()


## KOAN 2.2: Series Operations
**Objective**: Apply vectorized arithmetic
**Difficulty**: Beginner


In [None]:
def double_series():
    '''Return a new Series where each value is doubled.'''
    s = pd.Series([1, 2, 3, 4, 5], name="base")
    # TODO: Produce a new Series with values doubled while leaving `s` unchanged.
    pass


@validator.koan(2, "Series Operations", difficulty="Beginner")
def validate():
    result = double_series()
    assert isinstance(result, pd.Series), "Return a pandas Series."
    assert result.tolist() == [2, 4, 6, 8, 10], "Double every value."
    assert result.name == "base", "Preserve the original Series name."


validate()


## KOAN 2.3: Creating DataFrames
**Objective**: Assemble tabular data
**Difficulty**: Beginner


In [None]:
def create_dataframe():
    '''Return a DataFrame with columns 'name' and 'age'.'''
    data = [("Alice", 25), ("Bob", 30)]
    # TODO: Build the DataFrame using the tuples above.
    pass


@validator.koan(3, "Creating DataFrames", difficulty="Beginner")
def validate():
    result = create_dataframe()
    expected = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]})
    assert isinstance(result, pd.DataFrame), "Return a pandas DataFrame."
    pd.testing.assert_frame_equal(result.reset_index(drop=True), expected)


validate()


## KOAN 2.4: DataFrame Properties
**Objective**: Inspect structure and metadata
**Difficulty**: Beginner


In [None]:
def dataframe_properties():
    '''Return key structural details for the example DataFrame.'''
    df = pd.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.5, 6.0]})
    # TODO: Return a dictionary with shape, column names, and dtypes.
    pass


@validator.koan(4, 'DataFrame Properties', difficulty='Beginner')
def validate():
    result = dataframe_properties()
    assert isinstance(result, dict), 'Return a dictionary summarizing the DataFrame.'
    assert result.get('shape') == (3, 2), 'Include the DataFrame shape.'
    assert result.get('columns') == ['A', 'B'], 'List column names in order.'
    dtypes = result.get('dtypes')
    assert isinstance(dtypes, dict), 'Provide dtypes as a dictionary.'
    expected_dtypes = {'A': 'int64', 'B': 'float64'}
    assert {k: str(v) for k, v in dtypes.items()} == expected_dtypes, 'Report dtypes using pandas dtype strings.'


validate()


## KOAN 2.5: Column Selection
**Objective**: Access single and multiple columns
**Difficulty**: Beginner


In [None]:
def select_columns():
    '''Return the name and city columns from the employees DataFrame.'''
    df = pd.DataFrame(
        {
            "name": ["Alice", "Bob", "Carla", "Dan"],
            "city": ["NY", "SF", "NY", "LA"],
            "salary": [120_000, 110_000, 125_000, 105_000],
        }
    )
    # TODO: Return just the name and city columns as a DataFrame.
    pass


@validator.koan(5, "Column Selection", difficulty="Beginner")
def validate():
    result = select_columns()
    assert isinstance(result, pd.DataFrame), "Return a DataFrame with the requested columns."
    assert list(result.columns) == ["name", "city"], "Include only name and city."
    assert len(result) == 4, "Preserve the original row count."


validate()


## KOAN 2.6: Row Selection
**Objective**: Use label and position-based indexing
**Difficulty**: Beginner


In [None]:
def select_row_by_label():
    '''Return the record for San Francisco using label-based indexing.'''
    df = pd.DataFrame(
        {
            "population": [880000, 420000, 390000],
            "median_age": [36.5, 37.2, 35.8],
        },
        index=pd.Index(["NY", "SF", "LA"], name="city"),
    )
    # TODO: Use label-based selection to return the SF row as a Series.
    pass


@validator.koan(6, "Row Selection", difficulty="Beginner")
def validate():
    result = select_row_by_label()
    assert isinstance(result, pd.Series), "Return the selected row as a Series."
    assert result.name == "SF", "Preserve the city label."
    assert result.loc["population"] == 420000, "Ensure population matches the SF entry."


validate()


## KOAN 2.7: Boolean Indexing
**Objective**: Filter rows using conditions
**Difficulty**: Beginner


In [None]:
def filter_high_earners():
    '''Return employees earning at least 120k.'''
    df = pd.DataFrame(
        {
            "name": ["Alice", "Bob", "Carla", "Dan"],
            "salary": [120_000, 110_000, 125_000, 105_000],
            "department": ["Data", "Sales", "Data", "HR"],
        }
    )
    # TODO: Filter employees with salary >= 120_000.
    pass


@validator.koan(7, "Boolean Indexing", difficulty="Beginner")
def validate():
    result = filter_high_earners()
    assert isinstance(result, pd.DataFrame), "Return a DataFrame of filtered rows."
    assert result.shape[0] == 2, "Two employees should meet the salary threshold."
    assert set(result["name"]) == {"Alice", "Carla"}, "Keep the matching employee names."


validate()


## KOAN 2.8: Basic Statistics
**Objective**: Compute descriptive metrics
**Difficulty**: Beginner


In [None]:
def revenue_statistics():
    '''Return basic statistics for the revenue column.'''
    df = pd.DataFrame(
        {
            "region": ["East", "West", "East", "West", "South"],
            "revenue": [120, 150, 100, 200, 90],
        }
    )
    # TODO: Return a dictionary with mean, median, and standard deviation of revenue.
    pass


@validator.koan(8, "Basic Statistics", difficulty="Beginner")
def validate():
    result = revenue_statistics()
    assert isinstance(result, dict), "Return a dictionary of summary statistics."
    assert np.isclose(result.get("mean"), 132.0), "Compute the correct mean."
    assert result.get("median") == 120, "Compute the correct median."
    assert np.isclose(result.get("std"), np.std([120, 150, 100, 200, 90], ddof=1)), "Use sample standard deviation (ddof=1)."


validate()


## KOAN 2.9: GroupBy Operations
**Objective**: Aggregate data by category
**Difficulty**: Beginner


In [None]:
def average_revenue_by_region():
    '''Return average revenue per region as a Series.'''
    df = pd.DataFrame(
        {
            "region": ["East", "West", "East", "West", "South"],
            "revenue": [120, 150, 100, 200, 90],
        }
    )
    # TODO: Group by region and compute the mean revenue for each.
    pass


@validator.koan(9, "GroupBy Operations", difficulty="Beginner")
def validate():
    result = average_revenue_by_region()
    assert isinstance(result, pd.Series), "Return a Series indexed by region."
    expected = pd.Series({"East": 110.0, "West": 175.0, "South": 90.0}).sort_index()
    pd.testing.assert_series_equal(result.sort_index(), expected)


validate()


## KOAN 2.10: Sorting and Ranking
**Objective**: Order records by value
**Difficulty**: Beginner


In [None]:
def sort_by_column():
    '''Return the DataFrame sorted by the values column in ascending order.'''
    df = pd.DataFrame({"values": [30, 10, 20]})
    # TODO: Sort the DataFrame by values from smallest to largest.
    pass


@validator.koan(10, "Sorting and Ranking", difficulty="Beginner")
def validate():
    result = sort_by_column()
    expected = pd.DataFrame({"values": [10, 20, 30]}).reset_index(drop=True)
    assert isinstance(result, pd.DataFrame), "Return a DataFrame."
    pd.testing.assert_frame_equal(result.reset_index(drop=True), expected)


validate()


## Congratulations!

You completed Pandas Essentials!


In [None]:
progress = tracker.get_notebook_progress('02_pandas_essentials')
print(f"Final Progress: {progress}%")
if progress == 100:
    print("Excellent! You mastered Pandas Essentials!")
