<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari — 
        📧 <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> — 
        🐙 <a href="https://github.com/mr-pylin/pandas-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pandas.pydata.org/" target="_blank" rel="noopener noreferrer">
            <img src="../assets/images/pandas/logo/pandas_white.svg" 
                 alt="Pandas Logo"
                 style="max-height: 48px; width: auto; background-color: #1f1f1f; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Create DataFrames](#toc2_)    
- [Indexing, Selection, and Filtering](#toc3_)    
  - [Label-based Indexing](#toc3_1_)    
    - [Select rows by label](#toc3_1_1_)    
    - [Select rows & specific columns](#toc3_1_2_)    
    - [Assigning values with `.loc`](#toc3_1_3_)    
  - [Integer-based Indexing](#toc3_2_)    
    - [Select rows by position](#toc3_2_1_)    
    - [Select rows & specific columns](#toc3_2_2_)    
    - [Using negative indices](#toc3_2_3_)    
    - [Assigning with `.iloc`](#toc3_2_4_)    
  - [Direct Indexing](#toc3_3_)    
    - [Single column](#toc3_3_1_)    
    - [Multiple columns](#toc3_3_2_)    
    - [Row slicing](#toc3_3_3_)    
  - [Boolean Indexing and Conditions](#toc3_4_)    
    - [Basics of Boolean Indexing](#toc3_4_1_)    
    - [Combining Multiple Conditions](#toc3_4_2_)    
    - [Filtering with `.isin()`](#toc3_4_3_)    
    - [Filtering with `.between()`](#toc3_4_4_)    
    - [Using `.query()` as alternative](#toc3_4_5_)    
  - [Setting and Resetting Index](#toc3_5_)    
    - [Setting an Index](#toc3_5_1_)    
    - [Resetting an Index](#toc3_5_2_)    
  - [Chained Indexing vs. Single Indexing](#toc3_6_)    
  - [MultiIndex (Hierarchical Indexing)](#toc3_7_)    
    - [Creating a MultiIndex](#toc3_7_1_)    
    - [Selecting with MultiIndex](#toc3_7_2_)    
    - [Swapping and Sorting Levels](#toc3_7_3_)    
    - [Resetting MultiIndex](#toc3_7_4_)    
  - [Renaming & Naming Index/Columns](#toc3_8_)    
    - [Naming Indexes](#toc3_8_1_)    
    - [Renaming Indexes](#toc3_8_2_)    
    - [Renaming Columns](#toc3_8_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
import pandas as pd

In [None]:
# disable wrapping entirely
pd.set_option("display.expand_frame_repr", False)

# <a id='toc2_'></a>[Create DataFrames](#toc0_)


In [None]:
df1 = pd.DataFrame(
    {
        "Name": ["Alice", "Bob", "Charlie", "David"],
        "Age": [24, 27, 22, 32],
        "City": ["NY", "LA", "Paris", "London"],
    },
    index=["a", "b", "c", "d"],
)

# log
df1.head(5)

In [None]:
MOVIES_PATH = (
    r"https://raw.githubusercontent.com/mr-pylin/datasets/refs/heads/main/data/tabular-data/movies/csv/dataset.csv"
)
df2 = pd.read_csv(MOVIES_PATH, encoding="UTF-8")

# log
df2.head(5)

# <a id='toc3_'></a>[Indexing, Selection, and Filtering](#toc0_)

- One of the **most powerful features** in Pandas is its **flexible indexing system**.
- Indexing controls **how you access, select, and modify data** inside a `DataFrame` or `Series`.
- Pandas provides multiple approaches:
  - **Label-based indexing** (`.loc`)
  - **Integer-based indexing** (`.iloc`)
  - **Boolean/conditional selection**
  - **Fancy indexing with lists, slices, and masks**
  - **Resetting or changing the index**

<div style="text-align: center; padding-top: 10px;">
    <img src="../assets/images/pandas/tutorials/03/subset_rows.svg" alt="Select Specific Rows" style="min-width: 256px; max-height: 40%; width: auto; background-color: #DBDBDB; border-radius: 16px;">
    <p><em>Figure 1: Select Specific Rows from a <code>DataFrame</code></em> (<a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/" target="_blank">source</a>)</p>
</div>

<div style="text-align: center; padding-top: 10px;">
    <img src="../assets/images/pandas/tutorials/03/subset_columns.svg" alt="Select Specific Columns" style="min-width: 256px; max-height: 40%; width: auto; background-color: #DBDBDB; border-radius: 16px;">
    <p><em>Figure 2: Select Specific Columns from a <code>DataFrame</code></em> (<a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/" target="_blank">source</a>)</p>
</div>

<div style="text-align: center; padding-top: 10px;">
    <img src="../assets/images/pandas/tutorials/03/subset_columns_rows.svg" alt="Select Rows and Columns" style="min-width: 256px; max-height: 40%; width: auto; background-color: #DBDBDB; border-radius: 16px;">
    <p><em>Figure 3: Select Specific Rows and Columns from a <code>DataFrame</code></em> (<a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/" target="_blank">source</a>)</p>
</div>

📝 **Docs**:

- Indexing and selecting data: [pandas.pydata.org/docs/user_guide/indexing.html](https://pandas.pydata.org/docs/user_guide/indexing.html)
- Select a subset: [pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)


## <a id='toc3_1_'></a>[Label-based Indexing](#toc0_)

- The `.loc` indexer is used for **label-based selection** of rows and columns in Pandas.
- You specify **row labels** and **column labels** instead of numerical positions.
- It works with **strings, dates, categories**, or any custom index you assign.

✍️ **Key Characteristics**

- **Inclusive slicing:** Unlike Python slicing, both the **start** and **end labels** are included.
- **Flexible input:** Accepts single labels, lists of labels, slices, or boolean masks.
- **Two-dimensional:** First part is **rows**, second part is **columns** → `df.loc[row_labels, col_labels]`.

📝 **Docs**

- Selection by label: [pandas.pydata.org/docs/user_guide/indexing.html#selection-by-label](https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-label)
- `pandas.DataFrame.loc`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)

### <a id='toc3_1_1_'></a>[Select rows by label](#toc0_)


In [None]:
# single row [series object]
print(df1.loc["a"])

In [None]:
# single row [dataframe object]
print(df1.loc[["a"]])

In [None]:
# multiple rows
print(df1.loc[["a", "c"]])

In [None]:
# slice of rows (label inclusive ✅)
print(df1.loc["a":"c"])

### <a id='toc3_1_2_'></a>[Select rows & specific columns](#toc0_)


In [None]:
# row 'a', column 'Name'
print(df1.loc["a", "Name"])

In [None]:
# rows 'a' to 'c', only 'Name' and 'City'
print(df1.loc["a":"c", ["Name", "City"]])

In [None]:
# all rows, multiple columns
print(df1.loc[:, ["Name", "City"]])

### <a id='toc3_1_3_'></a>[Assigning values with `.loc`](#toc0_)


In [None]:
# change Bob's age
df1.loc["b", "Age"] = 30
print(df1)

In [None]:
# update multiple rows at once
df1.loc["a":"c", "City"] = "Unknown"
print(df1)

## <a id='toc3_2_'></a>[Integer-based Indexing](#toc0_)

- The `.iloc` indexer is used for **integer position-based selection** of rows and columns.
- Instead of labels, it relies purely on **numerical positions**, similar to standard Python and NumPy indexing.
- This makes `.iloc` very predictable, especially when working with **default integer indexes** or when labels are unknown.

✍️ **Key Characteristics**

- **Zero-based indexing:** Positions start at `0` for the first row/column.
- **Exclusive slicing:** Like Python, the **end position is excluded** in slices.
- **Flexible input:** Accepts single integers, lists, NumPy arrays, or slices.
- **Two-dimensional:** `df.iloc[row_positions, col_positions]`.

⚠️ **Common Pitfall**

- `.iloc` works only with integers. If your index labels are numeric (e.g., `0, 1, 2`), remember that `.iloc[0]` selects the **first row by position**, not necessarily the row with label `0`.

📝 **Docs**

- Selection by position: [pandas.pydata.org/docs/user_guide/indexing.html#selection-by-position](https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-position)
- `pandas.DataFrame.iloc`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)


### <a id='toc3_2_1_'></a>[Select rows by position](#toc0_)


In [None]:
# first row (position 0) [series object]
print(df2.iloc[0])

In [None]:
# first row (position 0) [dataframe object]
print(df2.iloc[[0]])

In [None]:
# multiple rows by positions
print(df2.iloc[[0, 2]])

In [None]:
# slice (end exclusive ❌)
print(df2.iloc[0:2])

### <a id='toc3_2_2_'></a>[Select rows & specific columns](#toc0_)


In [None]:
# row 0, column 1 (Genre)
print(df2.iloc[0, 1])

In [None]:
# rows 0–2, columns 0 and 2 (Film, Lead Studio)
print(df2.iloc[0:3, [0, 2]])

### <a id='toc3_2_3_'></a>[Using negative indices](#toc0_)


In [None]:
# last row
print(df2.iloc[-1])

In [None]:
# last two rows
print(df2.iloc[-2:])

### <a id='toc3_2_4_'></a>[Assigning with `.iloc`](#toc0_)


In [None]:
# Change value at row 1, column 1 (Genre)
df2.iloc[1, 1] = "Romance"
print(df2.head())

In [None]:
# Change last two rows, Age column
df2.iloc[-2:, 1] = "Drama"
print(df2.tail())

## <a id='toc3_3_'></a>[Direct Indexing](#toc0_)

- Direct indexing with square brackets (`[]`) is the **most common** way to select columns in a Pandas `DataFrame`.
- However, the behavior depends on whether you use **single brackets** or **double brackets**.

✍️ **Key Characteristics**

- `df['col']` → returns a **Series** (1D object).
- `df[['col']]` → returns a **DataFrame** (2D object).
- `df[['col1', 'col2']]` → selects **multiple columns** as a DataFrame.
- `df[0:3]` → when using a slice, selects **rows by integer position** (similar to `.iloc`).

⚠️ **Common Pitfall**

- `df[[0, 2, 4]]` ❌ → will **not work** for selecting rows. Row indexing must be done with `.iloc` or `.loc`.
- Always remember: **single/double brackets affect dimensionality** (Series vs. DataFrame).

📝 **Docs**

- Basics: [pandas.pydata.org/docs/user_guide/indexing.html#basics](https://pandas.pydata.org/docs/user_guide/indexing.html#basics)


### <a id='toc3_3_1_'></a>[Single column](#toc0_)


In [None]:
print(df2["Film"], end="\n\n")
print(type(df2["Film"]))

In [None]:
print(df2["Film"], end="\n\n")
print(type(df2[["Film"]]))

### <a id='toc3_3_2_'></a>[Multiple columns](#toc0_)


In [None]:
print(df2[["Film", "Genre"]], end='\n\n')
print(type(df2[["Film", "Genre"]]))

### <a id='toc3_3_3_'></a>[Row slicing](#toc0_)


In [None]:
print(df2[0:2])

## <a id='toc3_4_'></a>[Boolean Indexing and Conditions](#toc0_)

- Boolean indexing is one of the most powerful features in pandas.
- It allows you to **filter rows or columns** based on logical conditions, similar to SQL `WHERE` clauses.

📝 **Docs**

- Boolean indexing: [pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing)


### <a id='toc3_4_1_'></a>[Basics of Boolean Indexing](#toc0_)

- When you apply a condition on a DataFrame/Series, pandas returns a **Boolean mask** (True/False values) that you can use for filtering.


In [None]:
# condition -> returns a Boolean Series
mask = df2["Year"] > 2010
print(mask)

In [None]:
# use mask to filter rows
print(df2[mask])

### <a id='toc3_4_2_'></a>[Combining Multiple Conditions](#toc0_)

- Use `&` for AND, `|` for OR, and `~` for NOT.
- Don’t forget parentheses around each condition!


In [None]:
# Drama films released in 2011
mask = (df2["Year"] == 2011) & (df2["Genre"] == "Drama")
print(df2[mask])


In [None]:
# Films with Profitability > 10.0 OR Audience score > 90%
mask = (df2["Profitability"] > 10.0) | (df2["Audience score %"] > 88)
print(df2[mask])


In [None]:
# Films NOT from Romance Genre
print(df2[~(df2["Genre"] == "Romance")])


### <a id='toc3_4_3_'></a>[Filtering with `.isin()`](#toc0_)


In [None]:
# Drama or Animation films
print(df2[df2["Genre"].isin(["Drama", "Animation"])])

### <a id='toc3_4_4_'></a>[Filtering with `.between()`](#toc0_)


In [None]:
# Films between 2008 and 2010 (inclusive)
print(df2[df2["Year"].between(2008, 2010)])


### <a id='toc3_4_5_'></a>[Using `.query()` as alternative](#toc0_)


In [None]:
# SQL-like syntax
print(df2.query("Year > 2010 and Genre == 'Romance'"))

## <a id='toc3_5_'></a>[Setting and Resetting Index](#toc0_)

- Pandas allows you to **control the row labels (index)** of a DataFrame.
- This is useful for **organizing, filtering, and joining data** efficiently.


### <a id='toc3_5_1_'></a>[Setting an Index](#toc0_)

- Use `set_index()` to make **one or more columns the index**.
- This can make **row selection more meaningful**, e.g., using dates, names, or IDs as labels.
- Can create a **hierarchical (MultiIndex)** by passing multiple columns.

✍️ **Key Points**  
- Original columns can be **dropped** or **kept** in the DataFrame.
- Indexes improve **lookup speed** and **data alignment**.

📝 **Docs:**

- `pandas.DataFrame.set_index`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html)


In [None]:
# set "Film" as index (returns a new DataFrame by default)
df3 = df2.set_index("Film")

# log
print(df3.head(), end="\n\n")
print(df3.index[0])

In [None]:
# set ["Film", "Genre"] as index (returns a new DataFrame by default)
df4 = df2.set_index(["Film", "Genre"])

# log
print(df4.head(), end="\n\n")
print(df4.index[0])

In [None]:
# set "Film" as index in-place
df2.set_index("Film", inplace=True)

# log
print(df2.head(), end="\n\n")
print(df2.index[0])

### <a id='toc3_5_2_'></a>[Resetting an Index](#toc0_)

- Use `reset_index()` to restore the **default integer index**.
- Original index can be turned into a column or discarded.
- Often used after **data manipulations** like filtering or joining.

✍️ **Key Points**
- Resets **single-level** or **MultiIndex** back to flat structure.
- Useful for returning data to a **standard tabular format**

📝 **Docs:**

- `pandas.DataFrame.reset_index`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)


In [None]:
# reset index back to default integer index
df2 = df2.reset_index()
print(df2.head())

In [None]:
# reset index and drop the old index column
df3 = df3.reset_index(drop=True)
print(df3.head())

## <a id='toc3_6_'></a>[Chained Indexing vs. Single Indexing](#toc0_)

- Chained indexing occurs when you use **multiple indexing operations in sequence**, like `df['col'][0]`.
- While it may work, it can lead to **unexpected behavior** and is generally **not recommended**.

⚠️ **Why Avoid Chained Indexing**

- Can cause **silent bugs** when modifying data.
- Makes **debugging harder** because pandas may create a temporary object.
- Not guaranteed to work consistently across different pandas versions.

✅ **Best Practices**

- Prefer **`.loc` or `.iloc`** for any row and column selection in **one step**.
- Use chained indexing **only for read-only operations** if necessary.
- Think of **single indexing** as the “safe and reliable” method.

📝 **Docs:**

- Returning a view versus a copy: [pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy)


In [None]:
df = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "B": [10, 20, 30, 40],
    "C": [100, 200, 300, 400]
}, index=["a", "b", "c", "d"])

print(df)

In [None]:
# try to modify via chained indexing
df[df["A"] > 2]["C"] = 999   # ⚠️ this will raise a SettingWithCopyWarning

In [None]:
# use .loc instead
df.loc[df["A"] > 2, "C"] = 999
print(df)

## <a id='toc3_7_'></a>[MultiIndex (Hierarchical Indexing)](#toc0_)

- Pandas supports **multi-level (hierarchical) indexing**, known as **MultiIndex**, which allows working with higher-dimensional data in a 2D `DataFrame`.

📝 **Docs:**

- MultiIndex / advanced indexing: [pandas.pydata.org/docs/user_guide/advanced.html](https://pandas.pydata.org/docs/user_guide/advanced.html)


### <a id='toc3_7_1_'></a>[Creating a MultiIndex](#toc0_)


In [None]:
data = {
    "Sales": [200, 220, 180, 210, 500, 480],
    "Revenue": [2000, 2200, 1800, 2100, 5000, 4800],
}

In [None]:
# MultiIndex from tuples: (Store, Product)
index = pd.MultiIndex.from_tuples(
    [
        ("Store_A", "Apples"),
        ("Store_A", "Bananas"),
        ("Store_A", "Oranges"),
        ("Store_B", "Apples"),
        ("Store_B", "Bananas"),
        ("Store_B", "Oranges"),
    ],
    names=("Store", "Product"),
)
index

In [None]:
df = pd.DataFrame(data, index=index)
df

### <a id='toc3_7_2_'></a>[Selecting with MultiIndex](#toc0_)


In [None]:
# select all products for a single store
df.loc["Store_A"]

In [None]:
# select a single product in a store
df.loc["Store_A", "Apples"]

In [None]:
# select across a level (all stores for 'Apples')
df.xs("Apples", level="Product")

In [None]:
# slice multiple values
df.loc[pd.IndexSlice[:, ["Apples", "Oranges"]], :]

### <a id='toc3_7_3_'></a>[Swapping and Sorting Levels](#toc0_)


In [None]:
# swap levels (Store ↔ Product)
df_swapped = df.swaplevel("Store", "Product")
df_swapped

In [None]:
# sort by index
df_sorted = df_swapped.sort_index(ascending=True)
df_sorted

In [None]:
# sort within original index
df_sorted_orig = df.sort_index(level="Product")
df_sorted_orig

### <a id='toc3_7_4_'></a>[Resetting MultiIndex](#toc0_)


In [None]:
# reset all index levels
df_reset = df.reset_index()
df_reset

In [None]:
# reset only one level of the index
df_reset_store = df.reset_index(level="Store")
df_reset_store

In [None]:
# keep index but make it a column with drop=False
df_reset_keep = df.reset_index(level="Product", drop=False)
df_reset_keep

## <a id='toc3_8_'></a>[Renaming & Naming Index/Columns](#toc0_)

- Pandas allows you to **name and rename** both indexes and columns to make your data more readable and meaningful.
- Proper naming helps with **selection, filtering, and readability** in large datasets.

📝 **Docs:**

- `pandas.DataFrame.rename`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)


### <a id='toc3_8_1_'></a>[Naming Indexes](#toc0_)


In [None]:
# assign names to index
df2.index.set_names("New Index", inplace=True)
df2

In [None]:
# name the columns axis
df2.columns.name = "Columns"
print(df2)

### <a id='toc3_8_2_'></a>[Renaming Indexes](#toc0_)

In [None]:
# rename specific labels with rename()
df2.rename(index={0: "a", 1: "b"})

### <a id='toc3_8_3_'></a>[Renaming Columns](#toc0_)

In [None]:
# rename specific labels with rename()
df2.rename(columns={"Lead Studio": "Studio"})