<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari — 
        📧 <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> — 
        🐙 <a href="https://github.com/mr-pylin/pandas-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pandas.pydata.org/" target="_blank" rel="noopener noreferrer">
            <img src="../assets/images/pandas/logo/pandas_white.svg" 
                 alt="Pandas Logo"
                 style="max-height: 48px; width: auto; background-color: #1f1f1f; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Load Titanic Dataset](#toc2_)    
- [Data Cleaning and Transformation](#toc3_)    
  - [Handling Missing Data](#toc3_1_)    
    - [Detecting Missing Values](#toc3_1_1_)    
    - [Dropping Missing Data](#toc3_1_2_)    
    - [Filling Missing Data](#toc3_1_3_)    
    - [Interpolate Missing Values](#toc3_1_4_)    
    - [Replacing Specific Values](#toc3_1_5_)    
  - [Duplicates and Unique Values](#toc3_2_)    
    - [Finding Duplicates](#toc3_2_1_)    
    - [Removing Duplicates](#toc3_2_2_)    
    - [Getting Unique Values](#toc3_2_3_)    
    - [Counting Unique Values](#toc3_2_4_)    
  - [String Operations](#toc3_3_)    
    - [Accessing String Methods](#toc3_3_1_)    
    - [Pattern Matching with Regex](#toc3_3_2_)    
    - [Extracting or Splitting Strings](#toc3_3_3_)    
  - [Type Conversions and Categoricals](#toc3_4_)    
    - [Converting Column Types](#toc3_4_1_)    
    - [Handling Mixed Types](#toc3_4_2_)    
    - [Working with Categorical Data](#toc3_4_3_)    
    - [Optimizing Memory with Categoricals](#toc3_4_4_)    
  - [Renaming Columns and Indexes](#toc3_5_)    
    - [Renaming Columns](#toc3_5_1_)    
    - [Renaming Index or Index Levels](#toc3_5_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
import numpy as np
import pandas as pd

In [None]:
# disable wrapping entirely
pd.set_option("display.expand_frame_repr", False)

# <a id='toc2_'></a>[Load Titanic Dataset](#toc0_)


In [None]:
TITANIC_PATH = (
    r"https://raw.githubusercontent.com/mr-pylin/datasets/refs/heads/main/data/tabular-data/titanic/train.csv"
)
df = pd.read_csv(TITANIC_PATH, encoding="UTF8").drop(columns=["Cabin"])

In [None]:
df.head()

In [None]:
df.info()

# <a id='toc3_'></a>[Data Cleaning and Transformation](#toc0_)


## <a id='toc3_1_'></a>[Handling Missing Data](#toc0_)

- Real-world datasets often contain **missing**, **null**, or **undefined** values.
- In Pandas, missing data is usually represented by **`NaN` (Not a Number)** or **`None`** for object types.
- Properly handling missing data is essential for **data integrity**, **analysis accuracy**, and **model performance**.
- Pandas provides several built-in tools for **detecting**, **removing**, **filling**, and **interpolating** missing values.

✍️ **Key Concepts**

- **Detection:** Identify missing entries using `.isna()` or `.notna()`.
- **Removal:** Drop missing rows or columns with `.dropna()`.
- **Imputation:** Fill missing values with constants or computed statistics using `.fillna()`.
- **Interpolation:** Estimate missing numeric data based on trends with `.interpolate()`.
- **Replacement:** Replace specific placeholders (e.g., `"N/A"`, `"-"`) with `NaN` or valid values.

📝 **Docs**:
- `DataFrame.isna()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)
- `DataFrame.dropna()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)
- `DataFrame.fillna()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
- `DataFrame.interpolate()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html)
- `DataFrame.replace()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)


### <a id='toc3_1_1_'></a>[Detecting Missing Values](#toc0_)


In [None]:
# check for missing values in each column
df.isna()

In [None]:
# summarize missing values per column
df.isna().sum()

In [None]:
# check if any missing values exist in the entire DataFrame
df.isna().any()

In [None]:
# show rows that contain any missing values
df[df.isna().any(axis=1)]

In [None]:
# show rows missing a specific column
df[df["Embarked"].isna()]

### <a id='toc3_1_2_'></a>[Dropping Missing Data](#toc0_)


In [None]:
# drop rows that have any missing value
df_drop_rows = df.dropna()

# log
print(f"df_drop_rows.shape: {df_drop_rows.shape}")

In [None]:
# drop columns that have any missing value
df_drop_columns = df.dropna(axis=1)

# log
print(f"df_drop_columns.shape: {df_drop_columns.shape}")

In [None]:
# drop rows where a specific column is missing
df_drop_embarked = df.dropna(subset=["Embarked"])

# log
print(f"df_drop_embarked.shape: {df_drop_embarked.shape}")

In [None]:
# keep rows with at least a certain number of non-missing values
df_drop_thresh = df.dropna(thresh=10)

# log
print(f"df_drop_thresh.shape: {df_drop_thresh.shape}")

### <a id='toc3_1_3_'></a>[Filling Missing Data](#toc0_)


In [None]:
# fill missing values with a constant
df_fill_constant = df.fillna(0)

# log
df_fill_constant[df[["Age", "Embarked"]].isna().any(axis=1)]

In [None]:
# fill missing values in a specific column
df_fill_age = df.copy()
df_fill_age["Age"] = df["Age"].fillna(99)

# log
df_fill_age[df["Age"].isna()]

In [None]:
# fill missing values with column mean, median, or mode
df["Age"].fillna(df["Age"].mean())
df["Age"].fillna(df["Age"].median())
df["Embarked"].fillna(df["Embarked"].mode()[0])

In [None]:
# forward-fill and backward-fill (propagate nearby values)
df_ffill = df.ffill()
df_bfill = df.bfill()

In [None]:
# fill using a dictionary for **column-specific values**
df.fillna({"Age": df["Age"].median(), "Embarked": "S"})

### <a id='toc3_1_4_'></a>[Interpolate Missing Values](#toc0_)


In [None]:
# simple linear interpolation for numeric columns
age_1 = df["Age"].interpolate(method="linear")

# log
age_1.iloc[26:32]

In [None]:
# interpolation using non-linear methods
age_2 = df['Age'].interpolate(method='polynomial', order=2)

# log
age_2.iloc[26:32]

In [None]:
# limit the number of consecutive NaNs to fill
age_3 = df["Age"].interpolate(limit=1)

# log
age_3.iloc[26:32]

In [None]:
# directional interpolation
age_4 = df["Age"].interpolate(method="linear", limit_direction="backward", limit=1)

# log
age_4.iloc[26:32]

In [None]:
# apply interpolation to a dataframe
age_5 = df.select_dtypes(include='number').interpolate(method='linear')

# log
age_5.iloc[26:32]

### <a id='toc3_1_5_'></a>[Replacing Specific Values](#toc0_)


In [None]:
# replace a single value
df['Embarked'].replace('S', 'Southampton')

In [None]:
# replace multiple values at once
df['Embarked'].replace({'C': 'Cherbourg', 'Q': 'Queenstown'})

In [None]:
# replace numeric values
df.replace({'Pclass': {1: 'First', 2: 'Second', 3: 'Third'}})

In [None]:
# replace using a list
df.replace([np.float64('nan'), 'N/A'], 0)

## <a id='toc3_2_'></a>[Duplicates and Unique Values](#toc0_)

- Real-world datasets often contain **duplicate records** or **repeated values** due to merging errors, manual entries, or data collection issues.
- Detecting and removing duplicates is an essential step in **data cleaning**, ensuring accurate **aggregations**, **statistics**, and **model training**.
- Pandas provides intuitive methods to identify, count, and drop duplicate rows or values.
- Similarly, finding **unique values** helps understand **data variability** and is often used in **categorical feature analysis**.

✍️ **Key Concepts**

- **Detect duplicates:** Use `.duplicated()` to check whether a row (or specific subset of columns) is a duplicate.
- **Remove duplicates:** Use `.drop_duplicates()` to remove duplicate rows, optionally keeping the first or last occurrence.
- **Find unique values:** Use `.unique()` for one-dimensional Series or `.nunique()` to count distinct values.
- **Subset selection:** Both `.duplicated()` and `.drop_duplicates()` support the `subset` argument to consider specific columns only.

💡 **Example scenarios:**
- Identifying users who appear multiple times in a transaction log.
- Cleaning a dataset after concatenating multiple CSVs.
- Counting how many unique categories or product IDs exist.

📝 **Docs**:
- `DataFrame.duplicated()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)
- `DataFrame.drop_duplicates()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)
- `Series.unique()`: [pandas.pydata.org/docs/reference/api/pandas.Series.unique.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html)
- `Series.nunique()`: [pandas.pydata.org/docs/reference/api/pandas.Series.nunique.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.nunique.html)


### <a id='toc3_2_1_'></a>[Finding Duplicates](#toc0_)


In [None]:
# check for duplicate rows in the DataFrame
df.duplicated()

In [None]:
# find duplicates based on specific column(s)
df[df.duplicated(subset=['Survived', 'Pclass', 'Age', 'Sex', 'Embarked'])]

In [None]:
# find all duplicates based on specific column(s)
df[df.duplicated(subset=['Survived', 'Pclass', 'Age', 'Sex', 'Embarked'], keep=False)]

### <a id='toc3_2_2_'></a>[Removing Duplicates](#toc0_)


In [None]:
# remove duplicate rows based on all columns
df.drop_duplicates()

In [None]:
# remove duplicates based on specific column(s)
df.drop_duplicates(subset=['Survived', 'Pclass', 'Age', 'Sex', 'Embarked'])

### <a id='toc3_2_3_'></a>[Getting Unique Values](#toc0_)


In [None]:
# get unique values in a column
df['Pclass'].unique()

### <a id='toc3_2_4_'></a>[Counting Unique Values](#toc0_)


In [None]:
# count unique values across all columns
df.nunique()

In [None]:
# count number of unique values in a column
df['Age'].nunique()

In [None]:
# count occurrences of each unique value
df['Survived'].value_counts()

In [None]:
# include NaN in counts
df['Age'].value_counts(dropna=False)

## <a id='toc3_3_'></a>[String Operations](#toc0_)

- Text data is common in datasets — such as names, categories, emails, or addresses — and Pandas provides powerful **vectorized string operations** through the `.str` accessor.
- These methods allow you to manipulate and clean textual data **efficiently**, without using loops.
- Common tasks include **case conversion**, **trimming whitespace**, **extracting substrings**, **pattern matching**, and **regular expression** operations.

✍️ **Key Concepts**

- **Access string methods:** Use `.str.<method>` to apply string functions element-wise on a Series or Index.
- **Case operations:** Convert text to lowercase, uppercase, or title case using `.str.lower()`, `.str.upper()`, `.str.title()`.
- **Whitespace handling:** Remove or strip unwanted spaces with `.str.strip()`, `.str.lstrip()`, `.str.rstrip()`.
- **Splitting and joining:** Split text into lists with `.str.split()` or join elements using `.str.join()`.
- **Substring extraction:** Use `.str.slice()` or `.str.extract()` (with regex) for substring selection.
- **Containment and replacement:** Check for patterns using `.str.contains()` and replace text via `.str.replace()`.
- **Regex support:** Most `.str` methods support powerful pattern matching through regular expressions.

💡 **Example scenarios:**
- Cleaning inconsistent capitalization in names.
- Extracting domains from email addresses.
- Detecting product codes or IDs in unstructured text.
- Normalizing whitespace and removing special symbols.

📝 **Docs**:
- `Series.str`: [pandas.pydata.org/docs/reference/api/pandas.Series.str.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html)
- `Series.str.contains()`: [pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html)
- `Series.str.replace()`: [pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)
- `Series.str.extract()`: [pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html)
- `Series.str.split()`: [pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html)


### <a id='toc3_3_1_'></a>[Accessing String Methods](#toc0_)


In [None]:
# convert to lowercase
df['Name'].str.lower()

In [None]:
# convert to uppercase
df['Name'].str.upper()

In [None]:
# get length of each string
df['Name'].str.len()

In [None]:
# check if string contains a substring
df['Name'].str.contains('Mr.')

In [None]:
# access first n characters
df['Name'].str[:10]

In [None]:
# strip leading/trailing whitespace
df['Name'].str.strip()

In [None]:
# replace substrings
df["Name"].str.replace('Mr.', 'Mister')

In [None]:
# concatenate strings with another Series or string
titles = df['Name'].str.extract(r'(\w+)\.')
df["Name"] + ' (' + titles[0] + ')'

In [None]:
# remove punctuation or unwanted characters using regex
df["Name"].str.replace(r'[^\w\s]', '', regex=True)

### <a id='toc3_3_2_'></a>[Pattern Matching with Regex](#toc0_)


In [None]:
# check if a pattern exists in each string
df['Name'].str.contains(r'Mr\.', regex=True)

In [None]:
# extract a pattern using capturing groups
df['Name'].str.extract(r'(\w+)\.')  # Captures 'Mr', 'Mrs', etc.

In [None]:
# replace patterns
df['Name'].str.replace(r'Mr\.|Mrs\.', 'Title', regex=True)

In [None]:
# find all occurrences of a pattern
df['Name'].str.findall(r'\b\w+\b')

In [None]:
# match exact pattern (boolean)
df['Name'].str.match(r'.*Master\..*')

### <a id='toc3_3_3_'></a>[Extracting or Splitting Strings](#toc0_)


In [None]:
# split strings by a delimiter
df['Name'].str.split(',', expand=True)

In [None]:
# split first/last names
df['Name'].str.split(' ', n=1, expand=True)

In [None]:
# get first element after split
df['Name'].str.split(' ', expand=True)[0]

In [None]:
# example of multiple captures in regex
df['Name'].str.extract(r'(?P<LastName>\w+),\s(?P<Title>\w+)\.\s(?P<FirstName>.*)')

## <a id='toc3_4_'></a>[Type Conversions and Categoricals](#toc0_)

- Ensuring that each column in your dataset has the **correct data type** (`dtype`) is critical for accurate analysis and efficient memory usage.
- Pandas supports **automatic type inference**, but manual type conversion is often required — for example, converting strings to numbers or dates.
- Additionally, **categorical data** offers a memory-efficient way to represent repeated string labels and is useful for modeling and grouping operations.

✍️ **Key Concepts**

- **Check dtypes:** Use `.dtypes` or `.info()` to inspect column data types.
- **Convert types:** Use `.astype()` to cast a Series or entire DataFrame to a different dtype (e.g., `int`, `float`, `str`, `bool`).
- **Numeric conversion:** Use `pd.to_numeric()` to safely convert strings to numbers, coercing invalid values to `NaN`.
- **Datetime conversion:** Use `pd.to_datetime()` for parsing strings into proper datetime objects.
- **Categorical conversion:** Convert string columns with limited unique values to `category` dtype for performance and memory efficiency.
- **Categorical ordering:** Use `CategoricalDtype` to specify custom category orderings for sorting or comparisons.

💡 **Benefits of Categoricals:**
- Reduced memory footprint for columns with few unique values.
- Faster groupby and comparison operations.
- Allows explicit ordering (e.g., `low < medium < high`).

📝 **Docs**:
- `DataFrame.astype()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)
- `pandas.to_numeric()`: [pandas.pydata.org/docs/reference/api/pandas.to_numeric.html](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html)
- `pandas.to_datetime()`: [pandas.pydata.org/docs/reference/api/pandas.to_datetime.html](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)
- `CategoricalDtype`: [pandas.pydata.org/docs/reference/api/pandas.api.types.CategoricalDtype.html](https://pandas.pydata.org/docs/reference/api/pandas.api.types.CategoricalDtype.html)
- `Categorical`: [pandas.pydata.org/docs/reference/api/pandas.Categorical.html](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html)


### <a id='toc3_4_1_'></a>[Converting Column Types](#toc0_)


In [None]:
# convert 'Pclass' from int64 to string
df["Pclass"].astype(str)

In [None]:
# convert 'Fare' to float (if not already)
df["Fare"].astype(float)

In [None]:
# convert 'Survived' to boolean
df["Survived"].astype(bool)

In [None]:
# convert using pandas dtypes
pd.Categorical(df["Embarked"])

In [None]:
# convert with errors='ignore' to avoid exceptions
df["Age"].astype(int, errors="ignore")

### <a id='toc3_4_2_'></a>[Handling Mixed Types](#toc0_)


In [None]:
df

In [None]:
# convert argument to a numeric type
pd.to_numeric(df['Sex'], errors='coerce')

In [None]:
# detect non-numeric entries
df['Age'][~df['Age'].apply(lambda x: isinstance(x, (int, float)))]

### <a id='toc3_4_3_'></a>[Working with Categorical Data](#toc0_)


In [None]:
# convert a column to categorical
categorical = pd.Categorical(df['Pclass'], ordered=True)
categorical

In [None]:
# check categories
categorical.categories

In [None]:
# reorder categories
categorical.reorder_categories([1, 2, 3], ordered=True)

In [None]:
# rename categories
categorical.rename_categories({1: "1st", 2: "2nd", 3: "3rd"})

In [None]:
# add a new category
categorical.add_categories(['Unknown'])

In [None]:
# remove unused categories
categorical.remove_unused_categories()

### <a id='toc3_4_4_'></a>[Optimizing Memory with Categoricals](#toc0_)


In [None]:
# check memory usage before conversion
print(df['Embarked'].memory_usage(deep=True))

In [None]:
# convert object column to categorical
categorical_embarked = df['Embarked'].astype('category')

# check memory usage before conversion
print(categorical_embarked.memory_usage(deep=True))

## <a id='toc3_5_'></a>[Renaming Columns and Indexes](#toc0_)

- Renaming columns and indexes is an important part of **data cleaning**, helping make your dataset more **readable**, **consistent**, and **analysis-friendly**.
- While Pandas allows you to rename during creation, real-world datasets often require renaming after loading or merging.

✍️ **Key Concepts**

- **Rename columns:** Use `.rename(columns={old_name: new_name})` to rename one or multiple columns.
- **Rename index:** Use `.rename(index={old_index: new_index})` to rename specific row labels.
- **Rename index levels:** For MultiIndex, you can rename **levels** with `.rename_axis()`.
- **In-place modification:** Pass `inplace=True` to modify the DataFrame without creating a copy.
- **Other transformations:** Add prefixes or suffixes with `.add_prefix()` and `.add_suffix()` for consistent naming conventions.

💡 **Best Practices:**
- Use **meaningful, descriptive names** for columns and indexes.
- Keep names **short but informative** for better readability.
- Avoid spaces or special characters when possible — consider using underscores.
- Standardize capitalization (e.g., all lowercase) for consistency.

📝 **Docs**:
- `DataFrame.rename()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)
- `DataFrame.rename_axis()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename_axis.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename_axis.html)
- `DataFrame.add_prefix()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.add_prefix.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.add_prefix.html)
- `DataFrame.add_suffix()`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.add_suffix.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.add_suffix.html)


### <a id='toc3_5_1_'></a>[Renaming Columns](#toc0_)


In [None]:
# rename a single column
df.rename(columns={'Pclass':'PassengerClass'}, inplace=False)

In [None]:
# rename multiple columns
df.rename(columns={'Name':'FullName', 'Age':'PassengerAge', 'Fare':'TicketFare'}, inplace=False)

In [None]:
# using a function to rename columns
df.rename(columns=str.lower, inplace=False)

### <a id='toc3_5_2_'></a>[Renaming Index or Index Levels](#toc0_)


In [None]:
# rename index labels
df.rename(index={0:'first_row', 1:'second_row'}, inplace=False)

In [None]:
# rename all index labels using a function
df.rename(index=lambda x: f'Row_{x}', inplace=False)