# Introduction to Pandas

Pandas is the most popular Python library for data analysis and manipulation. It provides high-performance, easy-to-use data structures and data analysis tools built on top of NumPy.

The name "pandas" is derived from "panel data" - an econometrics term for multidimensional structured datasets.

**Official Documentation:** https://pandas.pydata.org/docs/

## Why Pandas?

### Advantages:
1. **Easy Data Handling**: Work with structured data intuitively
2. **Data Cleaning**: Handle missing data, duplicates, and inconsistencies
3. **Data Transformation**: Reshape, merge, join, and pivot data easily
4. **Time Series**: Built-in support for time series data
5. **Integration**: Works seamlessly with NumPy, Matplotlib, and other libraries
6. **I/O Operations**: Read/write CSV, Excel, SQL, JSON, and more

### Key Features:
- DataFrame and Series data structures
- Intelligent data alignment
- Flexible grouping and aggregation
- Built-in visualization
- Efficient indexing and selection

## When to Use Pandas vs NumPy

Understanding when to use each library is important for efficient data analysis:

### Use NumPy when:
- Working with numerical arrays and matrices
- Need fast mathematical operations
- Data is homogeneous (all same type)
- Working with multi-dimensional numerical data
- Memory efficiency is critical

### Use Pandas when:
- Working with tabular data (rows and columns)
- Need to handle mixed data types
- Working with labeled data (column names, indices)
- Need data cleaning and transformation tools
- Reading/writing data from files (CSV, Excel, SQL)
- Performing group-by operations and aggregations
- Working with time series data

### Best Practice:
**Use both together!** Pandas is built on top of NumPy, and you can easily convert between them:
- DataFrame → NumPy: `df.values` or `df.to_numpy()`
- NumPy → DataFrame: `pd.DataFrame(array)`

**Example workflow:** Load data with Pandas → Clean with Pandas → Convert to NumPy for numerical operations → Convert back to Pandas for results

In [None]:
import pandas as pd
import numpy as np

# Example: Pandas for structured data
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 55000]
})
print("Pandas DataFrame (mixed types):")
print(df)
print("\nData types:", df.dtypes.tolist())

# Convert to NumPy for numerical operations
ages_array = df['Age'].values
salaries_array = df['Salary'].values

print("\nNumPy arrays (numerical operations):")
print("Ages:", ages_array)
print("Salaries:", salaries_array)

# Perform NumPy operations
age_mean = np.mean(ages_array)
salary_normalized = (salaries_array - np.mean(salaries_array)) / np.std(salaries_array)

print("\nResults:")
print(f"Average age: {age_mean}")
print(f"Normalized salaries: {salary_normalized}")

# Convert back to Pandas
df['Salary_Normalized'] = salary_normalized
print("\nBack to Pandas with new column:")
print(df)

## Core Data Structures

Pandas has two main data structures:

1. **Series**: One-dimensional labeled array (like a column)
2. **DataFrame**: Two-dimensional labeled data structure (like a table)

**Documentation:** https://pandas.pydata.org/docs/user_guide/dsintro.html

## Pandas Series

A Series is a one-dimensional array with labels (index).

In [None]:
# Create a Series from a list
series_from_list = pd.Series([10, 20, 30, 40, 50])
print("Series from list:")
print(series_from_list)
print("\nData type:", type(series_from_list))

# Create a Series with custom index
series_with_index = pd.Series([10, 20, 30, 40, 50], 
                               index=['a', 'b', 'c', 'd', 'e'])
print("\nSeries with custom index:")
print(series_with_index)

# Create a Series from a dictionary
data_dict = {'apple': 5, 'banana': 3, 'orange': 8}
series_from_dict = pd.Series(data_dict)
print("\nSeries from dictionary:")
print(series_from_dict)

In [None]:
# Series attributes and methods
series = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])

print("Values:", series.values)
print("Index:", series.index)
print("Shape:", series.shape)
print("Size:", series.size)
print("Data type:", series.dtype)

# Accessing elements
print("\nAccess by index label:", series['c'])
print("Access by position:", series[2])
print("Access multiple:", series[['a', 'c', 'e']])

# Basic statistics
print("\nMean:", series.mean())
print("Sum:", series.sum())
print("Max:", series.max())
print("Min:", series.min())

## Pandas DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

In [None]:
# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin'],
    'Salary': [50000, 60000, 55000, 65000, 58000]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")
print(df)

# Create DataFrame from list of lists
data_list = [
    ['Alice', 25, 'New York', 50000],
    ['Bob', 30, 'London', 60000],
    ['Charlie', 35, 'Paris', 55000]
]
df_from_list = pd.DataFrame(data_list, 
                             columns=['Name', 'Age', 'City', 'Salary'])
print("\nDataFrame from list:")
print(df_from_list)

In [None]:
# DataFrame attributes and methods
print("Shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nIndex:", df.index.tolist())
print("\nData types:\n", df.dtypes)
print("\nInfo:")
df.info()
print("\nFirst 3 rows:")
print(df.head(3))
print("\nLast 2 rows:")
print(df.tail(2))
print("\nDescriptive statistics:")
print(df.describe())

## Selecting Data

Pandas provides multiple ways to select and access data:

**Documentation:** https://pandas.pydata.org/docs/user_guide/indexing.html

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin'],
    'Salary': [50000, 60000, 55000, 65000, 58000]
})

# Select single column (returns Series)
print("Select 'Name' column:")
print(df['Name'])
print("\nType:", type(df['Name']))

# Select multiple columns (returns DataFrame)
print("\nSelect multiple columns:")
print(df[['Name', 'Age']])

# Select rows by index position (iloc)
print("\nFirst row (iloc):")
print(df.iloc[0])

print("\nFirst 3 rows:")
print(df.iloc[0:3])

# Select rows by index label (loc)
print("\nSelect by label (loc):")
print(df.loc[1:3, ['Name', 'City']])

# Select specific cells
print("\nSelect specific cell:")
print(df.loc[2, 'Name'])

In [None]:
# Boolean indexing (filtering)
print("People older than 30:")
print(df[df['Age'] > 30])

print("\nPeople in New York or London:")
print(df[df['City'].isin(['New York', 'London'])])

# Multiple conditions (AND)
print("\nAge > 25 AND Salary > 55000:")
print(df[(df['Age'] > 25) & (df['Salary'] > 55000)])

# Multiple conditions (OR)
print("\nAge < 30 OR Salary > 60000:")
print(df[(df['Age'] < 30) | (df['Salary'] > 60000)])

# String methods
print("\nNames starting with 'A' or 'B':")
print(df[df['Name'].str.startswith(('A', 'B'))])

## Understanding loc vs iloc vs [] - Deep Dive

One of the most confusing aspects of Pandas for beginners is understanding the difference between `loc`, `iloc`, and bracket indexing `[]`. Let's clarify this once and for all:

### The Three Ways to Select Data:

1. **`[]` (bracket notation)**: Mixed behavior, can be confusing
2. **`.loc[]`**: Label-based indexing (explicit index)
3. **`.iloc[]`**: Position-based indexing (integer position)

**Best Practice:** Always use `.loc[]` or `.iloc[]` to be explicit and avoid confusion!

In [None]:
# Create a DataFrame with INTEGER index to show the confusion
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['NYC', 'London', 'Paris', 'Tokyo', 'Berlin']
}, index=[10, 20, 30, 40, 50])  # Non-sequential integer index!

print("DataFrame with custom integer index:")
print(df)
print("\nIndex:", df.index.tolist())

In [None]:
# THE CONFUSION: What does df[10] mean?
# Is it position 10 or label 10?

# Using bracket notation with integer index
try:
    print("df[10] - ERROR! Tries to use position 10 (doesn't exist)")
    print(df[10])
except KeyError as e:
    print(f"KeyError: {e}")

# Slicing with brackets uses POSITION (implicit index)
print("\ndf[0:2] - Uses POSITION (first 2 rows):")
print(df[0:2])

# But selecting a single value would try to use LABEL
# This is CONFUSING!

In [None]:
# SOLUTION 1: Use .loc[] for LABEL-based selection
print("Using .loc[] - LABEL-based (uses index values):")
print("\ndf.loc[10] - Gets row with index label 10:")
print(df.loc[10])

print("\ndf.loc[10:30] - Slice by labels (INCLUDES endpoint!):")
print(df.loc[10:30])

print("\ndf.loc[10, 'Name'] - Get specific cell:")
print(df.loc[10, 'Name'])

print("\ndf.loc[[10, 30, 50], ['Name', 'Age']] - Multiple rows and columns:")
print(df.loc[[10, 30, 50], ['Name', 'Age']])

In [None]:
# SOLUTION 2: Use .iloc[] for POSITION-based selection
print("Using .iloc[] - POSITION-based (like NumPy arrays):")
print("\ndf.iloc[0] - Gets FIRST row (position 0):")
print(df.iloc[0])

print("\ndf.iloc[0:2] - Slice by position (EXCLUDES endpoint):")
print(df.iloc[0:2])

print("\ndf.iloc[0, 1] - Get cell at position [0, 1]:")
print(df.iloc[0, 1])

print("\ndf.iloc[[0, 2, 4], [0, 1]] - Multiple positions:")
print(df.iloc[[0, 2, 4], [0, 1]])

### Key Differences Summary:

| Feature | `.loc[]` | `.iloc[]` | `[]` |
|---------|----------|-----------|------|
| **Selection Type** | Label-based | Position-based | Mixed (confusing!) |
| **Slicing Endpoint** | **Included** | **Excluded** | Excluded |
| **Works with** | Index labels | Integer positions | Depends |
| **Single value** | `df.loc[label]` | `df.iloc[pos]` | `df[col]` (column only) |
| **Slicing** | `df.loc[10:30]` | `df.iloc[0:2]` | `df[0:2]` (position) |
| **Recommended?** | ✅ Yes, explicit | ✅ Yes, explicit | ⚠️ Use with caution |

### The Golden Rule:

**Always prefer `.loc[]` or `.iloc[]` over `[]` for row selection!**

- Use `.loc[]` when you know the index labels
- Use `.iloc[]` when you care about positions
- Use `[]` only for column selection: `df['column']`

In [None]:
# Practical example: Why explicit is better

# Bad practice (confusing):
df_subset = df[0:2]  # Uses position
# df_value = df[10]  # Would cause error!

# Good practice (explicit and clear):
df_subset_loc = df.loc[10:30]  # Clear: using labels
df_subset_iloc = df.iloc[0:2]  # Clear: using positions

print("Using .loc[] (labels 10:30, includes 30):")
print(df_subset_loc)

print("\nUsing .iloc[] (positions 0:2, excludes 2):")
print(df_subset_iloc)

# Both are explicit about what they're doing!

## Handling Missing Data

Real-world data often contains missing values. Pandas provides tools to handle them:

**Documentation:** https://pandas.pydata.org/docs/user_guide/missing_data.html

### Understanding None vs NaN

Pandas uses two Python values to represent missing data:
- **`None`**: A Python singleton object (used in object arrays)
- **`NaN`**: "Not a Number", a special floating-point value (used in numeric arrays)

**Important:** Pandas treats both as essentially interchangeable for indicating missing values, but there are key differences under the hood.

In [None]:
# Understanding None vs NaN

# None - Python object
vals_none = pd.Series([1, None, 3, 4])
print("Series with None:")
print(vals_none)
print("Data type:", vals_none.dtype)  # object dtype

# NaN - Numeric missing value
vals_nan = pd.Series([1, np.nan, 3, 4])
print("\nSeries with NaN:")
print(vals_nan)
print("Data type:", vals_nan.dtype)  # float64 dtype

# Pandas converts between them automatically
print("\nAutomatic conversion:")
print("None in numeric context becomes NaN:", vals_none.mean())

### Type Conversion with Missing Values

When you introduce missing values, Pandas may automatically convert data types:

| Original Type | Missing Value Added | Resulting Type | NA Representation |
|--------------|---------------------|----------------|-------------------|
| `int` | `np.nan` or `None` | `float64` | `np.nan` |
| `float` | `np.nan` or `None` | `float64` | `np.nan` |
| `bool` | `np.nan` or `None` | `object` | `None` or `np.nan` |
| `object` | `np.nan` or `None` | `object` | `None` or `np.nan` |

**Key Point:** Integer arrays are converted to float when NaN is introduced because there's no "integer NaN" in NumPy.

In [None]:
# Type conversion example
original = pd.Series([1, 2, 3, 4], dtype='int64')
print("Original integer Series:")
print(original)
print("Data type:", original.dtype)

# Add a missing value
original_with_nan = original.copy()
original_with_nan[2] = np.nan
print("\nAfter adding NaN:")
print(original_with_nan)
print("Data type:", original_with_nan.dtype)  # Converted to float64!

# Boolean conversion
bool_series = pd.Series([True, False, True])
print("\nOriginal boolean Series:")
print(bool_series)
print("Data type:", bool_series.dtype)

bool_series[1] = None
print("\nAfter adding None:")
print(bool_series)
print("Data type:", bool_series.dtype)  # Converted to object!

### Detecting Missing Values

Use `isnull()` or `notnull()` to detect missing values:

In [None]:
# Create DataFrame with missing values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'City': ['New York', 'London', None, 'Tokyo', 'Berlin'],
    'Salary': [50000, 60000, None, 65000, 58000]
})

print("DataFrame with missing values:")
print(df)

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

print("\nAny missing values?:", df.isnull().any().any())

# Visualize missing data
print("\nMissing data mask:")
print(df.isnull())

In [None]:
# Using boolean masks to filter
print("Original DataFrame:")
print(df)

# Get rows with NO missing values
print("\nRows with no missing values:")
print(df[df.notnull().all(axis=1)])

# Get rows with ANY missing values
print("\nRows with any missing values:")
print(df[df.isnull().any(axis=1)])

# Get non-null values from a specific column
print("\nNon-null Ages:")
print(df[df['Age'].notnull()])

### Dropping Missing Values

The `dropna()` method removes rows or columns with missing values:

In [None]:
# Drop rows with any missing values
df_dropped = df.dropna()
print("After dropping rows with NaN:")
print(df_dropped)

# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print("\nAfter dropping columns with NaN:")
print(df_dropped_cols)

# Drop rows only if all values are missing
df_dropped_all = df.dropna(how='all')
print("\nAfter dropping rows where all values are NaN:")
print(df_dropped_all)

# Drop rows with missing values in specific columns
df_dropped_subset = df.dropna(subset=['Age'])
print("\nAfter dropping rows with NaN in 'Age':")
print(df_dropped_subset)

### Advanced dropna(): Using thresh Parameter

The `thresh` parameter is very useful in practice - it specifies the **minimum number of non-null values required** to keep a row/column:

In [None]:
# Create DataFrame with varying amounts of missing data
df_messy = pd.DataFrame({
    'A': [1, 2, np.nan, np.nan, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, np.nan, np.nan, 4, 5],
    'D': [1, 2, 3, 4, 5]
})

print("Messy DataFrame:")
print(df_messy)
print("\nMissing values per row:")
print(df_messy.isnull().sum(axis=1))

# Keep rows with at least 3 non-null values
df_thresh = df_messy.dropna(thresh=3)
print("\nKeep rows with at least 3 non-null values:")
print(df_thresh)

# Keep rows with at least 75% non-null values
min_count = int(len(df_messy.columns) * 0.75)
df_percent = df_messy.dropna(thresh=min_count)
print(f"\nKeep rows with at least {min_count} non-null values (75% of {len(df_messy.columns)} columns):")
print(df_percent)

In [None]:
# thresh with axis=1 (columns)
print("Original DataFrame:")
print(df_messy)
print("\nMissing values per column:")
print(df_messy.isnull().sum())

# Keep columns with at least 4 non-null values
df_thresh_cols = df_messy.dropna(axis=1, thresh=4)
print("\nKeep columns with at least 4 non-null values:")
print(df_thresh_cols)

### Filling Missing Values

Instead of dropping, you can fill missing values with `fillna()`:

In [None]:
# Create sample data for filling examples
df_fill = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan, 5],
    'B': [np.nan, 2, np.nan, 4, 5],
    'C': [1, 2, 3, 4, 5]
})

print("Original DataFrame:")
print(df_fill)

# Fill missing values with a constant
df_filled = df_fill.fillna(0)
print("\nFill NaN with 0:")
print(df_filled)

# Fill with different values per column
df_filled_dict = df_fill.fillna({'A': df_fill['A'].mean(), 
                                 'B': df_fill['B'].median()})
print("\nFill with different values per column:")
print(df_filled_dict)

### Forward Fill and Backward Fill

For time series or ordered data, you can propagate values forward or backward:

- **Forward fill (`ffill`)**: Use the last valid value
- **Backward fill (`bfill`)**: Use the next valid value

In [None]:
# Create a Series to demonstrate fill methods clearly
data = pd.Series([1, np.nan, np.nan, 4, np.nan, np.nan, 7], 
                 index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

print("Original Series:")
print(data)

# Forward fill - propagate last valid value forward
print("\nForward fill (ffill):")
print(data.fillna(method='ffill'))

# Backward fill - propagate next valid value backward
print("\nBackward fill (bfill):")
print(data.fillna(method='bfill'))

# Compare all three
print("\n=== Comparison ===")
comparison = pd.DataFrame({
    'Original': data,
    'Forward Fill': data.fillna(method='ffill'),
    'Backward Fill': data.fillna(method='bfill')
})
print(comparison)

In [None]:
# Forward and backward fill with DataFrames
df_dates = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=7, freq='D'),
    'Sales': [100, np.nan, np.nan, 150, np.nan, 180, 200],
    'Customers': [10, 12, np.nan, np.nan, 15, np.nan, 20]
})

print("Sales data with missing values:")
print(df_dates)

# Forward fill
print("\nForward fill:")
print(df_dates.fillna(method='ffill'))

# Backward fill
print("\nBackward fill:")
print(df_dates.fillna(method='bfill'))

# Limit the number of consecutive fills
print("\nForward fill with limit=1 (only fill 1 consecutive NaN):")
print(df_dates.fillna(method='ffill', limit=1))

### Interpolation for Numerical Data

For numerical data, interpolation can provide more sophisticated filling:

In [None]:
# Interpolation example
df_interp = pd.DataFrame({
    'Value': [1, np.nan, np.nan, 4, np.nan, 6, np.nan, np.nan, 9]
})

print("Original data:")
print(df_interp)

# Linear interpolation (default)
df_interp['Linear'] = df_interp['Value'].interpolate()
print("\nLinear interpolation:")
print(df_interp)

# Polynomial interpolation
df_interp['Polynomial'] = df_interp['Value'].interpolate(method='polynomial', order=2)
print("\nPolynomial interpolation:")
print(df_interp)

### Practical Tips for Handling Missing Data

**When to use each approach:**

1. **`dropna()`**: Use when:
   - Missing data is minimal (< 5% of rows)
   - Missing values are random (MCAR - Missing Completely At Random)
   - You have enough data after dropping

2. **`fillna(0)` or constant**: Use when:
   - Zero/constant makes business sense (e.g., no sales = 0)
   - You're creating indicator variables

3. **`fillna(mean/median)`**: Use when:
   - Data is numerical
   - You want to preserve the distribution
   - Mean for normal distributions, median for skewed data

4. **Forward/Backward fill**: Use when:
   - Data is time-series
   - Values don't change rapidly
   - Order matters

5. **`interpolate()`**: Use when:
   - Data is numerical and smooth
   - You expect gradual changes
   - Time-series with regular intervals

**Example Decision Tree:**
```
Is the data time-series? 
├─ Yes → Use ffill/bfill or interpolate()
└─ No → Is < 5% missing?
    ├─ Yes → Use dropna()
    └─ No → Use fillna(mean/median) or domain-specific imputation
```

In [None]:
# Practical example: Complete workflow
print("=== Real-world Missing Data Workflow ===\n")

# 1. Create realistic messy data
df_real = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 150, np.nan, 180, 120, np.nan, np.nan, 200, 110, np.nan],
    'Returns': [5, np.nan, 3, 4, np.nan, 2, 1, np.nan, 2, 3]
})

print("Original data:")
print(df_real)
print(f"\nMissing data summary:")
print(df_real.isnull().sum())

# 2. Analyze missing patterns
missing_pct = (df_real.isnull().sum() / len(df_real)) * 100
print(f"\nMissing percentage:")
print(missing_pct)

# 3. Apply appropriate strategy
df_clean = df_real.copy()

# Sales: use forward fill (time series assumption)
df_clean['Sales'] = df_clean['Sales'].fillna(method='ffill')

# Returns: use median (sporadic missing, numerical)
df_clean['Returns'] = df_clean['Returns'].fillna(df_clean['Returns'].median())

print("\nCleaned data:")
print(df_clean)

# 4. Verify no missing values remain
print(f"\nRemaining missing values:")
print(df_clean.isnull().sum())

## Merging and Joining DataFrames

Combining data from multiple sources is a fundamental task in data analysis. Pandas provides powerful tools for merging and joining DataFrames, similar to SQL database operations.

**Documentation:** https://pandas.pydata.org/docs/user_guide/merging.html

### Understanding Different Types of Joins

Before we dive into the code, it's important to understand the different types of joins and when to use each one:

**1. One-to-One Join**: Each key appears only once in both DataFrames
- Example: Merging employee personal info with employee contact info

**2. Many-to-One Join**: Keys in one DataFrame appear multiple times, but only once in the other
- Example: Merging employees with their department information

**3. Many-to-Many Join**: Keys appear multiple times in both DataFrames
- Example: Merging employees with skills (employees can have multiple skills, skills can belong to multiple employees)

In [None]:
# Example 1: One-to-One Join
# Each employee appears once in each DataFrame

employees = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

salaries = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Salary': [70000, 80000, 75000, 90000]
})

print("One-to-One Join:")
print("Employees:")
print(employees)
print("\nSalaries:")
print(salaries)

# Merge on EmployeeID
result = pd.merge(employees, salaries, on='EmployeeID')
print("\nMerged Result (One-to-One):")
print(result)

In [None]:
# Example 2: Many-to-One Join
# Multiple employees can belong to the same department

employees = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'DepartmentID': [101, 102, 101, 103, 102]
})

departments = pd.DataFrame({
    'DepartmentID': [101, 102, 103],
    'Department': ['Sales', 'IT', 'HR'],
    'Manager': ['John', 'Sarah', 'Mike']
})

print("Many-to-One Join:")
print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

# Merge - each department info will be repeated for each employee
result = pd.merge(employees, departments, on='DepartmentID')
print("\nMerged Result (Many-to-One):")
print(result)

In [None]:
# Example 3: Many-to-Many Join
# Employees can have multiple skills, skills can belong to multiple employees

employee_skills = pd.DataFrame({
    'EmployeeID': [1, 1, 2, 2, 3, 3],
    'Skill': ['Python', 'SQL', 'Python', 'Excel', 'SQL', 'Tableau']
})

skill_levels = pd.DataFrame({
    'Skill': ['Python', 'Python', 'SQL', 'SQL', 'Excel', 'Tableau'],
    'Level': ['Advanced', 'Beginner', 'Advanced', 'Intermediate', 'Advanced', 'Beginner']
})

print("Many-to-Many Join:")
print("Employee Skills:")
print(employee_skills)
print("\nSkill Levels:")
print(skill_levels)

# This will create all possible combinations
result = pd.merge(employee_skills, skill_levels, on='Skill')
print("\nMerged Result (Many-to-Many):")
print(result)
print("\nNote: Employee 1 with Python skill gets matched with both Python levels!")

### Merging on Index: left_index and right_index

Sometimes you want to merge DataFrames based on their index rather than a column. This is particularly useful when working with time series data or when the index contains meaningful information.

In [None]:
# Create DataFrames with meaningful indices

# Employee data indexed by employee ID
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28]
}, index=[101, 102, 103, 104])
employees.index.name = 'EmployeeID'

# Salary data also indexed by employee ID
salaries = pd.DataFrame({
    'Salary': [70000, 80000, 75000, 90000],
    'Bonus': [5000, 8000, 6000, 10000]
}, index=[101, 102, 103, 104])
salaries.index.name = 'EmployeeID'

print("Employees (indexed by EmployeeID):")
print(employees)
print("\nSalaries (indexed by EmployeeID):")
print(salaries)

# Merge on index using left_index and right_index
result = pd.merge(employees, salaries, left_index=True, right_index=True)
print("\nMerged on Index:")
print(result)

In [None]:
# Mixing index and column merging

# One DataFrame indexed, another with a column
employees_indexed = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Department': ['Sales', 'IT', 'Sales']
}, index=[101, 102, 103])
employees_indexed.index.name = 'EmployeeID'

# This one has EmployeeID as a column
performance = pd.DataFrame({
    'EmployeeID': [101, 102, 103],
    'Rating': [4.5, 4.8, 4.2],
    'ReviewDate': ['2024-01-15', '2024-01-20', '2024-01-18']
})

print("Employees (indexed):")
print(employees_indexed)
print("\nPerformance (column-based):")
print(performance)

# Merge: left_index=True, right_on='EmployeeID'
result = pd.merge(employees_indexed, performance, 
                  left_index=True, right_on='EmployeeID')
print("\nMerged (index + column):")
print(result)

### Handling Column Name Conflicts with suffixes

When merging DataFrames that have columns with the same name (other than the merge key), Pandas automatically adds suffixes to distinguish them. By default, it uses `_x` and `_y`, but you can customize this.

In [None]:
# DataFrames with conflicting column names

# Employee data from HR system
hr_data = pd.DataFrame({
    'EmployeeID': [1, 2, 3],
    'Name': ['Alice Smith', 'Bob Jones', 'Charlie Brown'],
    'Department': ['Sales', 'IT', 'Sales'],
    'Status': ['Active', 'Active', 'Active']
})

# Employee data from Payroll system
payroll_data = pd.DataFrame({
    'EmployeeID': [1, 2, 3],
    'Name': ['A. Smith', 'R. Jones', 'C. Brown'],  # Different name format
    'Department': ['Sales', 'IT', 'Marketing'],     # Might be different!
    'Status': ['Paid', 'Paid', 'Pending']
})

print("HR Data:")
print(hr_data)
print("\nPayroll Data:")
print(payroll_data)

# Merge with default suffixes (_x and _y)
result_default = pd.merge(hr_data, payroll_data, on='EmployeeID')
print("\nMerged with default suffixes:")
print(result_default)

In [None]:
# Using custom suffixes for clarity

result_custom = pd.merge(hr_data, payroll_data, on='EmployeeID', 
                        suffixes=('_HR', '_Payroll'))
print("Merged with custom suffixes:")
print(result_custom)

# Now it's much clearer which data comes from which system!
print("\nCompare departments:")
print(result_custom[['EmployeeID', 'Name_HR', 'Department_HR', 'Department_Payroll']])

### Types of Joins: how parameter

The `how` parameter in `pd.merge()` determines which keys to include in the result:

- **`inner`** (default): Only keys that appear in both DataFrames
- **`outer`**: All keys from both DataFrames
- **`left`**: All keys from the left DataFrame
- **`right`**: All keys from the right DataFrame

In [None]:
# Sample data with some non-matching keys

employees = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

departments = pd.DataFrame({
    'EmployeeID': [3, 4, 5, 6],
    'Department': ['Sales', 'IT', 'HR', 'Marketing']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)
print("\nNotice: IDs 1,2 only in employees; IDs 5,6 only in departments; IDs 3,4 in both")

In [None]:
# Inner join - only matching keys (3, 4)
inner_result = pd.merge(employees, departments, on='EmployeeID', how='inner')
print("Inner Join (only IDs 3, 4):")
print(inner_result)

# Outer join - all keys from both
outer_result = pd.merge(employees, departments, on='EmployeeID', how='outer')
print("\nOuter Join (all IDs):")
print(outer_result)

# Left join - all keys from left (employees)
left_result = pd.merge(employees, departments, on='EmployeeID', how='left')
print("\nLeft Join (IDs 1, 2, 3, 4):")
print(left_result)

# Right join - all keys from right (departments)
right_result = pd.merge(employees, departments, on='EmployeeID', how='right')
print("\nRight Join (IDs 3, 4, 5, 6):")
print(right_result)

### Practical Example: Combining Employee Data

Let's put it all together with a realistic example combining multiple data sources.

In [None]:
# Scenario: Combine employee data from multiple systems

# 1. Basic employee info
employees = pd.DataFrame({
    'EmployeeID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'DepartmentID': [1, 2, 1, 3, 2]
})

# 2. Department information
departments = pd.DataFrame({
    'DepartmentID': [1, 2, 3, 4],
    'Department': ['Sales', 'IT', 'HR', 'Marketing'],
    'Location': ['New York', 'San Francisco', 'Chicago', 'Boston']
})

# 3. Salary information (not all employees have salary data yet)
salaries = pd.DataFrame({
    'EmployeeID': [101, 102, 103, 105],
    'Salary': [70000, 85000, 72000, 90000]
})

print("Step 1: Merge employees with departments (many-to-one)")
df = pd.merge(employees, departments, on='DepartmentID', how='left')
print(df)

print("\nStep 2: Merge with salaries (one-to-one, left join to keep all employees)")
final = pd.merge(df, salaries, on='EmployeeID', how='left')
print(final)

print("\nFinal Result Summary:")
print(f"Total employees: {len(final)}")
print(f"Employees with salary data: {final['Salary'].notna().sum()}")
print(f"Employees missing salary: {final['Salary'].isna().sum()}")

### Key Takeaways for Merging

**When to use each join type:**

1. **Inner join** (`how='inner'`):
   - When you only want records that have matches in both DataFrames
   - Safest option when you want complete data

2. **Outer join** (`how='outer'`):
   - When you want to keep all records from both DataFrames
   - Useful for finding gaps in data

3. **Left join** (`how='left'`):
   - When the left DataFrame is your main data
   - Want to enrich it with additional info from right
   - Most common in practice

4. **Right join** (`how='right'`):
   - Rarely used (can use left join by swapping DataFrames)
   - Included for completeness

**Best Practices:**
- Always check the shape before and after merging
- Use `indicator=True` to track merge status
- Be explicit with column names using `left_on`/`right_on`
- Use meaningful suffixes when columns conflict
- Consider using `validate` parameter to catch unexpected duplicates

In [None]:
# Advanced: Using indicator to track merge status

result = pd.merge(employees, salaries, on='EmployeeID', how='outer', indicator=True)
print("Merge with indicator:")
print(result)

print("\nMerge statistics:")
print(result['_merge'].value_counts())

# Find employees without salary data
print("\nEmployees missing salary:")
print(result[result['_merge'] == 'left_only'][['EmployeeID', 'Name']])

## Adding and Modifying Data

Before working with data, you often need to add or modify columns and rows:

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 55000]
})

print("Original DataFrame:")
print(df)

# Add a new column
df['City'] = ['New York', 'London', 'Paris']
print("\nAfter adding 'City' column:")
print(df)

# Add calculated column
df['Salary_k'] = df['Salary'] / 1000
print("\nAfter adding calculated column:")
print(df)

# Modify existing column
df['Age'] = df['Age'] + 1
print("\nAfter incrementing Age:")
print(df)

## Sorting Data

Sorting is essential for organizing and analyzing data.

**Documentation:** https://pandas.pydata.org/docs/user_guide/basics.html#sorting

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'Name': ['Charlie', 'Alice', 'Eve', 'Bob', 'David'],
    'Age': [35, 25, 32, 30, 28],
    'Salary': [55000, 50000, 58000, 60000, 65000]
})

print("Original DataFrame:")
print(df)

# Sort by single column
df_sorted = df.sort_values('Age')
print("\nSorted by Age:")
print(df_sorted)

# Sort in descending order
df_sorted_desc = df.sort_values('Salary', ascending=False)
print("\nSorted by Salary (descending):")
print(df_sorted_desc)

# Sort by multiple columns
df_sorted_multi = df.sort_values(['Age', 'Salary'], ascending=[True, False])
print("\nSorted by Age (asc) then Salary (desc):")
print(df_sorted_multi)

# Sort by index
df_sorted_index = df.sort_index()
print("\nSorted by index:")
print(df_sorted_index)

## Handling Duplicate Data

Duplicate rows are common in real datasets. Pandas provides methods to detect and remove them:

**Documentation:** https://pandas.pydata.org/docs/user_guide/duplicates.html

In [None]:
# Create DataFrame with duplicates
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'David'],
    'Age': [25, 30, 25, 35, 30, 28],
    'City': ['New York', 'London', 'New York', 'Paris', 'London', 'Tokyo']
})

print("DataFrame with duplicates:")
print(df)

# Check for duplicates
print("\nAre there duplicates?:", df.duplicated().any())
print("\nDuplicate rows (boolean mask):")
print(df.duplicated())

# Show duplicate rows
print("\nDuplicate rows:")
print(df[df.duplicated()])

# Count duplicates
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")

In [None]:
# Remove duplicates (keep first occurrence)
df_no_duplicates = df.drop_duplicates()
print("After removing duplicates (keep first):")
print(df_no_duplicates)

# Remove duplicates (keep last occurrence)
df_keep_last = df.drop_duplicates(keep='last')
print("\nAfter removing duplicates (keep last):")
print(df_keep_last)

# Remove duplicates based on specific columns
df_partial = df.drop_duplicates(subset=['Name'])
print("\nRemove duplicates based on 'Name' only:")
print(df_partial)

# Keep none (remove all duplicates including originals)
df_keep_none = df.drop_duplicates(keep=False)
print("\nRemove all duplicates (including originals):")
print(df_keep_none)

## Value Counts and Frequency Analysis

Understanding the distribution of values in your data is crucial for exploratory data analysis:

In [None]:
# Create sample data
df = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Laptop'],
    'Region': ['North', 'South', 'North', 'East', 'North', 'South', 'West', 'North'],
    'Status': ['Sold', 'Sold', 'Pending', 'Sold', 'Sold', 'Pending', 'Sold', 'Sold']
})

print("Sales data:")
print(df)

# Count unique values in a column
print("\nProduct value counts:")
print(df['Product'].value_counts())

# With percentages
print("\nProduct value counts (percentages):")
print(df['Product'].value_counts(normalize=True))

# Count unique values
print(f"\nNumber of unique products: {df['Product'].nunique()}")
print(f"Unique products: {df['Product'].unique()}")

# Crosstab for frequency analysis
print("\nCrosstab - Product vs Region:")
print(pd.crosstab(df['Product'], df['Region']))

## Grouping and Aggregation

GroupBy allows you to split data into groups and apply functions to each group:

**Documentation:** https://pandas.pydata.org/docs/user_guide/groupby.html

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR', 'Sales'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Age': [25, 30, 35, 28, 32, 45, 27],
    'Salary': [50000, 60000, 55000, 65000, 58000, 52000, 54000]
})

print("Employee DataFrame:")
print(df)

# Group by single column
grouped = df.groupby('Department')

# Calculate mean per group
print("\nMean salary by department:")
print(grouped['Salary'].mean())

# Multiple aggregations
print("\nMultiple statistics by department:")
print(grouped['Salary'].agg(['mean', 'min', 'max', 'count']))

# Group by and aggregate different columns differently
print("\nDifferent aggregations per column:")
print(grouped.agg({
    'Age': ['mean', 'min', 'max'],
    'Salary': ['mean', 'sum']
}))

In [None]:
# Multiple grouping columns
df['Experience'] = ['Junior', 'Senior', 'Senior', 'Junior', 'Senior', 'Senior', 'Junior']

print("DataFrame with Experience:")
print(df)

# Group by multiple columns
grouped_multi = df.groupby(['Department', 'Experience'])
print("\nMean salary by Department and Experience:")
print(grouped_multi['Salary'].mean())

# Reset index to make it a regular DataFrame
print("\nWith reset index:")
print(grouped_multi['Salary'].mean().reset_index())

# Size of each group
print("\nCount of employees per group:")
print(grouped_multi.size())

## Apply and Lambda Functions

The `apply()` function allows you to apply custom transformations to your data:

**Documentation:** https://pandas.pydata.org/docs/user_guide/basics.html#function-application

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 55000, 65000]
})

print("Original DataFrame:")
print(df)

# Apply lambda to a single column
df['Age_in_10_years'] = df['Age'].apply(lambda x: x + 10)
print("\nWith Age_in_10_years:")
print(df)

# Apply with conditional logic
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
print("\nWith Age_Group:")
print(df)

# Apply to entire DataFrame (axis=1 for rows)
df['Total_Score'] = df.apply(lambda row: (row['Age'] * 0.3) + (row['Salary'] / 1000), axis=1)
print("\nWith calculated Total_Score:")
print(df)

In [None]:
# Define custom function
def categorize_salary(salary):
    if salary < 55000:
        return 'Low'
    elif salary < 62000:
        return 'Medium'
    else:
        return 'High'

df['Salary_Category'] = df['Salary'].apply(categorize_salary)
print("With Salary_Category:")
print(df)

# Apply with multiple columns
def full_description(row):
    return f"{row['Name']} is {row['Age']} years old and earns ${row['Salary']:,}"

df['Description'] = df.apply(full_description, axis=1)
print("\nWith Description:")
print(df[['Name', 'Description']])

## Pivot Tables and Reshaping

Transform data between wide and long formats:

**Documentation:** https://pandas.pydata.org/docs/user_guide/reshaping.html

In [None]:
# Create sample sales data
sales = pd.DataFrame({
    'Date': ['2024-01', '2024-01', '2024-02', '2024-02', '2024-03', '2024-03'],
    'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 150, 120, 180, 110, 160],
    'Region': ['East', 'East', 'West', 'West', 'East', 'East']
})

print("Sales data:")
print(sales)

# Create pivot table
pivot = sales.pivot_table(values='Sales', 
                          index='Date', 
                          columns='Product', 
                          aggfunc='sum')
print("\nPivot table (Sales by Date and Product):")
print(pivot)

# Pivot with multiple aggregations
pivot_multi = sales.pivot_table(values='Sales', 
                                index='Date', 
                                columns='Product', 
                                aggfunc=['sum', 'mean'])
print("\nPivot with multiple aggregations:")
print(pivot_multi)

In [None]:
# Melt (wide to long format)
df_wide = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Math': [90, 85, 95],
    'English': [88, 92, 89]
})

print("Wide format:")
print(df_wide)

df_long = pd.melt(df_wide, 
                  id_vars=['Name'], 
                  value_vars=['Math', 'English'],
                  var_name='Subject', 
                  value_name='Score')
print("\nLong format (melted):")
print(df_long)

# Pivot (long to wide format)
df_wide_again = df_long.pivot(index='Name', columns='Subject', values='Score')
print("\nBack to wide format:")
print(df_wide_again)

## Reading and Writing Data

Pandas can read/write data from/to various formats:

**Documentation:** https://pandas.pydata.org/docs/user_guide/io.html

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Paris', 'Tokyo'],
    'Salary': [50000, 60000, 55000, 65000]
})

# Write to CSV
df.to_csv('employees.csv', index=False)
print("Written to employees.csv")

# Read from CSV
df_from_csv = pd.read_csv('employees.csv')
print("\nRead from CSV:")
print(df_from_csv)

# Write to JSON
df.to_json('employees.json', orient='records', indent=2)
print("\nWritten to employees.json")

# Read from JSON
df_from_json = pd.read_json('employees.json')
print("\nRead from JSON:")
print(df_from_json)

# Common read_csv parameters:
# pd.read_csv('file.csv', sep=',', header=0, names=['col1', 'col2'], 
#             index_col=0, usecols=['col1', 'col2'], nrows=100)

## String Operations

Pandas provides powerful string manipulation methods:

**Documentation:** https://pandas.pydata.org/docs/user_guide/text.html

In [None]:
# Create DataFrame with string data
df = pd.DataFrame({
    'Name': ['alice smith', 'BOB JONES', 'Charlie Brown', 'david LEE'],
    'Email': ['alice@example.com', 'BOB@EXAMPLE.COM', 'charlie@test.org', 'david@sample.net']
})

print("Original:")
print(df)

# Convert to lowercase
df['Name_lower'] = df['Name'].str.lower()

# Convert to uppercase
df['Name_upper'] = df['Name'].str.upper()

# Title case
df['Name_title'] = df['Name'].str.title()

print("\nWith case conversions:")
print(df[['Name', 'Name_lower', 'Name_upper', 'Name_title']])

# Extract domain from email
df['Domain'] = df['Email'].str.split('@').str[1]

# Check if contains substring
df['Has_example'] = df['Email'].str.contains('example')

print("\nWith string operations:")
print(df[['Email', 'Domain', 'Has_example']])

## Date and Time Operations

Pandas has excellent support for working with dates and times:

**Documentation:** https://pandas.pydata.org/docs/user_guide/timeseries.html

In [None]:
# Create DataFrame with date strings
df = pd.DataFrame({
    'Date': ['2024-01-15', '2024-02-20', '2024-03-25', '2024-04-30'],
    'Sales': [1000, 1500, 1200, 1800]
})

print("Original:")
print(df)
print("\nDate dtype:", df['Date'].dtype)

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
print("\nAfter conversion:")
print(df)
print("Date dtype:", df['Date'].dtype)

# Extract date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['DayName'] = df['Date'].dt.day_name()

print("\nWith extracted components:")
print(df)

In [None]:
# Date range
date_range = pd.date_range(start='2024-01-01', end='2024-01-10', freq='D')
print("Date range:")
print(date_range)

# Create DataFrame with date range
df_dates = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'Value': np.random.randint(100, 200, 10)
})

print("\nDataFrame with date range:")
print(df_dates)

# Set date as index
df_dates.set_index('Date', inplace=True)
print("\nWith date as index:")
print(df_dates)

# Select by date
print("\nData for 2024-01-05:")
print(df_dates.loc['2024-01-05'])

## Basic Visualization

Pandas has built-in plotting capabilities using Matplotlib:

In [None]:
import matplotlib.pyplot as plt

# Create sample data
df = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Sales': [1000, 1500, 1200, 1800, 2100, 1900],
    'Expenses': [800, 900, 850, 950, 1100, 1000]
})

# Line plot
df.plot(x='Month', y=['Sales', 'Expenses'], kind='line', figsize=(10, 5))
plt.title('Sales vs Expenses')
plt.ylabel('Amount ($)')
plt.grid(True)
plt.show()

# Bar plot
df.plot(x='Month', y='Sales', kind='bar', figsize=(10, 5), color='skyblue')
plt.title('Monthly Sales')
plt.ylabel('Sales ($)')
plt.xticks(rotation=0)
plt.show()

## Practical Exercises

These exercises are designed to help you practice the concepts covered in this notebook. They progress from basic to advanced topics.

**How to use these exercises:**
1. Read the problem description carefully
2. Try to solve it on your own first
3. Use the sample data provided
4. Check your understanding by running the code
5. Don't be afraid to look back at the examples above

**Tip:** Create a copy of this notebook and add your solutions in new cells below each exercise!

#### Exercise 1: Create and Explore a DataFrame

**Task:** Create a DataFrame with information about students and calculate basic statistics.

**Requirements:**
1. Create a DataFrame with at least 5 students
2. Include columns: Name, Age, Math_Grade, English_Grade, Science_Grade
3. Calculate the average grade for each student (add a new column called 'Average')
4. Find the student with the highest average grade
5. Calculate the mean, max, and min for each subject across all students

**Sample data to get you started:**

In [None]:
# Exercise 1: Your solution here
# Sample data structure:
students_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [18, 19, 18, 20, 19],
    'Math_Grade': [85, 92, 78, 88, 95],
    'English_Grade': [90, 85, 92, 87, 89],
    'Science_Grade': [88, 90, 85, 91, 93]
}

# TODO: Create DataFrame
# TODO: Add 'Average' column
# TODO: Find student with highest average
# TODO: Calculate statistics for each subject

#### Exercise 2: Selection with loc and iloc

**Task:** Practice different selection methods using the DataFrame below.

**Requirements:**
1. Select the row for 'Bob' using `.loc[]`
2. Select the first 3 rows using `.iloc[]`
3. Select Age and Salary columns for employees older than 30
4. Select Name and City for employees in positions 2 and 4 using `.iloc[]`
5. Explain the difference between your `.loc[]` and `.iloc[]` selections

In [None]:
# Exercise 2: Sample data
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 32, 40],
    'City': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix', 'Miami'],
    'Salary': [50000, 60000, 55000, 65000, 58000, 70000]
}, index=['E001', 'E002', 'E003', 'E004', 'E005', 'E006'])

print("Employee Data:")
print(employees)

# TODO: Your selections here

#### Exercise 3: Filtering with Boolean Indexing

**Task:** Filter data based on multiple conditions.

**Requirements:**
1. Find all products with price > 500 AND stock < 50
2. Find products that are either 'Electronics' OR have rating >= 4.5
3. Find products whose name starts with 'L'
4. Count how many products meet each condition

In [None]:
# Exercise 3: Sample data
products = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop Pro', 'Headphones'],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics', 'Accessories'],
    'Price': [999, 25, 75, 350, 1500, 120],
    'Stock': [30, 200, 150, 45, 15, 80],
    'Rating': [4.5, 4.2, 4.7, 4.6, 4.8, 4.3]
})

print("Product Catalog:")
print(products)

# TODO: Apply filters here

#### Exercise 4: Sorting Data

**Task:** Sort data by different criteria.

**Requirements:**
1. Sort by Salary in descending order
2. Sort by Department (ascending) and then by Salary (descending)
3. Find the top 3 highest-paid employees
4. Sort by Age and reset the index

In [None]:
# Exercise 4: Sample data
company = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'Department': ['IT', 'Sales', 'IT', 'Sales', 'HR', 'IT', 'HR'],
    'Age': [25, 30, 35, 28, 32, 40, 27],
    'Salary': [70000, 65000, 80000, 68000, 62000, 85000, 60000]
})

print("Company Data:")
print(company)

# TODO: Perform sorting operations

#### Exercise 5: Handling Missing Data

**Task:** Clean a messy dataset with missing values.

**Requirements:**
1. Identify which columns have missing values and how many
2. Fill missing ages with the median age
3. Fill missing cities with 'Unknown'
4. Drop rows where Salary is missing
5. Calculate the percentage of data that was missing before cleaning

In [None]:
# Exercise 5: Sample data with missing values
messy_data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'Age': [25, np.nan, 35, 28, np.nan, 40],
    'City': ['NYC', 'LA', None, 'Houston', None, 'Miami'],
    'Salary': [50000, 60000, np.nan, 65000, 58000, 70000],
    'Department': ['IT', 'Sales', 'IT', None, 'HR', 'IT']
})

print("Messy Data:")
print(messy_data)
print("\nMissing values:")
print(messy_data.isnull().sum())

# TODO: Clean the data following the requirements

#### Exercise 6: Removing Duplicates

**Task:** Identify and handle duplicate records in a transaction log.

**Requirements:**
1. Find all duplicate transactions (complete duplicates)
2. Find transactions that have the same Customer and Product (partial duplicates)
3. Keep only the first occurrence of each complete duplicate
4. For partial duplicates based on Customer+Product, keep the one with the highest Amount
5. Report how many duplicates were found and removed

In [None]:
# Exercise 6: Sample transaction data
transactions = pd.DataFrame({
    'TransactionID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice', 'Diana', 'Charlie'],
    'Product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse', 'Laptop', 'Monitor', 'Keyboard'],
    'Amount': [999, 25, 999, 75, 25, 1200, 350, 75],
    'Date': ['2024-01-15', '2024-01-16', '2024-01-15', '2024-01-17', 
             '2024-01-16', '2024-01-18', '2024-01-19', '2024-01-17']
})

print("Transaction Log:")
print(transactions)

# TODO: Handle duplicates following the requirements

#### Exercise 7: Value Counts and Frequency Analysis

**Task:** Analyze sales data to understand product and regional distribution.

**Requirements:**
1. Count how many times each product was sold
2. Show the percentage distribution of sales by region
3. Create a crosstab showing Product vs Status
4. Find the most popular product in each region
5. Identify products that were never marked as 'Returned'

In [None]:
# Exercise 7: Sample sales data
sales_log = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Monitor', 
                'Laptop', 'Keyboard', 'Mouse', 'Monitor', 'Laptop', 'Mouse'],
    'Region': ['North', 'South', 'North', 'East', 'North', 'West', 
               'South', 'East', 'South', 'North', 'West', 'East'],
    'Status': ['Sold', 'Sold', 'Sold', 'Sold', 'Returned', 'Sold', 
               'Sold', 'Sold', 'Sold', 'Returned', 'Sold', 'Sold'],
    'Quantity': [1, 2, 1, 1, 1, 1, 1, 3, 1, 1,
                 2, 1, 3, 2, 2, 1, 4, 1, 2, 1]
})

print("Sales Log:")
print(sales_log)

# TODO: Perform frequency analysis

#### Exercise 8: String Operations

**Task:** Clean and standardize customer email data.

**Requirements:**
1. Convert all names to Title Case (first letter uppercase)
2. Convert all emails to lowercase
3. Extract the domain from each email (part after @)
4. Create a boolean column 'Has_Gmail' that is True if email contains 'gmail'
5. Split the Name into 'First_Name' and 'Last_Name' columns
6. Count how many customers use each email domain

In [None]:
# Exercise 8: Sample customer data
customers = pd.DataFrame({
    'Name': ['alice SMITH', 'bob jones', 'CHARLIE BROWN', 'diana PRINCE'],
    'Email': ['Alice.Smith@GMAIL.com', 'bob@yahoo.com', 'charlie@GMAIL.COM', 'diana@outlook.com'],
    'Phone': ['555-0101', '555-0102', '555-0103', '555-0104']
})

print("Customer Data:")
print(customers)

# TODO: Clean and extract information from strings

#### Exercise 9: Date and Time Operations

**Task:** Analyze sales trends over time.

**Requirements:**
1. Convert the Date column to datetime format
2. Extract Year, Month, Day, and Day of Week
3. Add a column for the Quarter (Q1, Q2, Q3, Q4)
4. Filter sales from January 2024
5. Calculate total sales by month
6. Find the day of the week with the highest average sales

In [None]:
# Exercise 9: Sample time series data
sales_dates = pd.DataFrame({
    'Date': ['2024-01-15', '2024-01-22', '2024-02-10', '2024-02-18', 
             '2024-03-05', '2024-03-20', '2024-04-12', '2024-04-25'],
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop', 'Mouse', 'Keyboard', 'Laptop'],
    'Sales_Amount': [999, 25, 75, 350, 1200, 30, 80, 1100]
})

print("Sales with Dates:")
print(sales_dates)

# TODO: Perform date operations and analysis

#### Exercise 10: GroupBy and Aggregation

**Task:** Analyze employee data by department.

**Requirements:**
1. Calculate average, min, and max salary by Department
2. Count the number of employees in each Department
3. Find the department with the highest average salary
4. Group by Department and Experience level, calculate mean salary
5. Create a summary showing total salary cost per department

In [None]:
# Exercise 10: Sample employee data
employees_full = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Department': ['IT', 'Sales', 'IT', 'Sales', 'HR', 'IT', 'HR', 'Sales'],
    'Experience': ['Junior', 'Senior', 'Senior', 'Junior', 'Senior', 'Senior', 'Junior', 'Senior'],
    'Age': [25, 30, 35, 28, 32, 40, 27, 38],
    'Salary': [60000, 75000, 90000, 65000, 70000, 95000, 58000, 80000]
})

print("Employee Data:")
print(employees_full)

# TODO: Perform group by operations and analysis

#### Exercise 11: Apply and Lambda Functions

**Task:** Create custom calculations and categories.

**Requirements:**
1. Create a 'Salary_Category' column: Low (<65k), Medium (65-85k), High (>85k)
2. Create an 'Age_Group' column: Young (<30), Mid (30-35), Senior (>35)
3. Calculate a 'Performance_Score' using: (Salary/1000) * 0.7 + Age * 0.3
4. Create a 'Full_Description' column with format: "Name, Age years, Dept department"
5. Use apply to add 10% bonus to salaries in the IT department only

In [None]:
# Exercise 11: Use the employees_full data from Exercise 10
print("Employee Data:")
print(employees_full)

# TODO: Create custom columns using apply and lambda

#### Exercise 12: Merging DataFrames

**Task:** Combine data from multiple sources about a company.

**Requirements:**
1. Merge employees with departments (many-to-one)
2. Merge the result with salaries (one-to-one)
3. Perform an outer join to see which employees don't have salary data
4. Use the indicator parameter to track the merge status
5. Calculate the average salary by department location

In [None]:
# Exercise 12: Sample data from multiple sources

# Employee basic info
emp_info = pd.DataFrame({
    'EmployeeID': [101, 102, 103, 104, 105, 106],
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'DepartmentID': [1, 2, 1, 3, 2, 1]
})

# Department information
dept_info = pd.DataFrame({
    'DepartmentID': [1, 2, 3, 4],
    'Department': ['IT', 'Sales', 'HR', 'Marketing'],
    'Location': ['New York', 'Los Angeles', 'Chicago', 'Boston']
})

# Salary information (incomplete - not all employees)
salary_info = pd.DataFrame({
    'EmployeeID': [101, 102, 103, 105],
    'Salary': [70000, 75000, 90000, 70000],
    'Bonus': [5000, 7000, 9000, 7000]
})

print("Employee Info:")
print(emp_info)
print("\nDepartment Info:")
print(dept_info)
print("\nSalary Info:")
print(salary_info)

# TODO: Merge the DataFrames and perform analysis

#### Exercise 13: Pivot Tables

**Task:** Create pivot tables to analyze multi-dimensional sales data.

**Requirements:**
1. Create a pivot table showing total Sales by Product (rows) and Region (columns)
2. Create a pivot table showing average Sales by Month and Product
3. Add row and column totals (margins=True)
4. Find which Product-Region combination has the highest sales
5. Calculate what percentage each region contributes to total sales

In [None]:
# Exercise 13: Sample multi-dimensional sales data
sales_multi = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=20, freq='W'),
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop'] * 4,
    'Region': ['North', 'North', 'South', 'East', 'West'] * 4,
    'Sales': [1200, 45, 120, 680, 1300, 1100, 50, 110, 700, 1250,
              1150, 48, 115, 690, 1280, 1220, 52, 125, 710, 1290],
    'Units': [2, 3, 4, 2, 2, 2, 4, 3, 2, 2,
              2, 3, 3, 2, 2, 2, 4, 4, 2, 2]
})

print("Multi-dimensional Sales Data:")
print(sales_multi.head(10))
print(f"\n... ({len(sales_multi)} total rows)")

# TODO: Create and analyze pivot tables

## Summary

Pandas is essential for:
- Data manipulation and cleaning
- Exploratory data analysis
- Data preparation for machine learning
- Working with structured data

**Key Takeaways:**
- DataFrame and Series are the core data structures
- Always use `.loc[]` or `.iloc[]` for explicit row selection
- Powerful selection, filtering, and indexing capabilities
- GroupBy enables split-apply-combine operations
- Easy handling of missing data and duplicates
- Built-in support for merging, joining, and reshaping
- Excellent time series functionality
- Direct integration with visualization libraries
- `apply()` and lambda for custom transformations
- Use vectorized operations when possible for better performance

**Next Steps:**
- Practice with real datasets (Kaggle, UCI ML Repository)
- Learn data visualization with Matplotlib and Seaborn
- Explore advanced pandas features (MultiIndex, window functions)
- Study data cleaning and preprocessing techniques
- Combine with NumPy for numerical operations

**Additional Resources:**
- Pandas Official Tutorial: https://pandas.pydata.org/docs/getting_started/intro_tutorials/
- 10 Minutes to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
- Pandas Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- Pandas API Reference: https://pandas.pydata.org/docs/reference/index.html