# Merging and Joining DataFrames

## Learning Objectives

By the end of this notebook, you will be able to:

1. Concatenate DataFrames vertically and horizontally using `concat()`
2. Merge DataFrames using `merge()` with different join types
3. Use the `join()` method for index-based joining
4. Handle different key scenarios (one-to-one, one-to-many, many-to-many)
5. Resolve column name conflicts when merging
6. Choose the appropriate method for different use cases

---

## Setup

In [None]:
import pandas as pd
import numpy as np

# Set display options
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)

---

## 1. Concatenation with `concat()`

`concat()` combines DataFrames by stacking them either vertically (row-wise) or horizontally (column-wise).

### 1.1 Vertical Concatenation (Stacking Rows)

In [None]:
# Create sample DataFrames
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'city': ['NYC', 'LA']
})

df2 = pd.DataFrame({
    'name': ['Charlie', 'Diana'],
    'age': [35, 28],
    'city': ['Chicago', 'Houston']
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Concatenate vertically (default axis=0)
result = pd.concat([df1, df2])
print("Concatenated (vertical):")
print(result)

In [None]:
# Reset the index
result = pd.concat([df1, df2], ignore_index=True)
print("Concatenated with reset index:")
print(result)

In [None]:
# Add keys to identify source
result = pd.concat([df1, df2], keys=['batch1', 'batch2'])
print("Concatenated with keys:")
print(result)
print(f"\nIndex: {result.index}")

### 1.2 Handling Different Columns

In [None]:
# DataFrames with different columns
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'salary': [50000, 60000]
})

df2 = pd.DataFrame({
    'name': ['Charlie', 'Diana'],
    'age': [35, 28],
    'department': ['Sales', 'HR']
})

print("DataFrame 1 (has salary):")
print(df1)
print("\nDataFrame 2 (has department):")
print(df2)

In [None]:
# Default: outer join (keep all columns, fill with NaN)
result = pd.concat([df1, df2], ignore_index=True)
print("Outer join (default):")
print(result)

In [None]:
# Inner join (keep only common columns)
result = pd.concat([df1, df2], join='inner', ignore_index=True)
print("Inner join:")
print(result)

### 1.3 Horizontal Concatenation (Adding Columns)

In [None]:
# Create DataFrames with same index
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
}, index=['a', 'b', 'c'])

df2 = pd.DataFrame({
    'salary': [50000, 60000, 70000],
    'city': ['NYC', 'LA', 'Chicago']
}, index=['a', 'b', 'c'])

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Concatenate horizontally (axis=1)
result = pd.concat([df1, df2], axis=1)
print("Concatenated (horizontal):")
print(result)

In [None]:
# With different indexes
df3 = pd.DataFrame({
    'bonus': [5000, 7000]
}, index=['a', 'd'])  # 'd' doesn't exist in df1

result = pd.concat([df1, df3], axis=1)
print("Horizontal concat with different indexes:")
print(result)

---

## 2. Merging with `merge()`

`merge()` combines DataFrames based on common columns (like SQL JOIN).

### 2.1 Basic Merge

In [None]:
# Create sample DataFrames
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'dept_id': [101, 102, 101, 103, 102]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 103, 104],
    'dept_name': ['Engineering', 'Marketing', 'HR', 'Finance']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

In [None]:
# Basic merge (inner join by default)
result = pd.merge(employees, departments, on='dept_id')
print("Merged (inner join):")
print(result)

### 2.2 Join Types

In [None]:
# Create DataFrames with non-matching keys
left = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value_left': [1, 2, 3, 4]
})

right = pd.DataFrame({
    'key': ['B', 'C', 'D', 'E'],
    'value_right': [5, 6, 7, 8]
})

print("Left:")
print(left)
print("\nRight:")
print(right)

In [None]:
# Inner join (only matching keys)
result = pd.merge(left, right, on='key', how='inner')
print("Inner join (default):")
print(result)

In [None]:
# Left join (all keys from left)
result = pd.merge(left, right, on='key', how='left')
print("Left join:")
print(result)

In [None]:
# Right join (all keys from right)
result = pd.merge(left, right, on='key', how='right')
print("Right join:")
print(result)

In [None]:
# Outer join (all keys from both)
result = pd.merge(left, right, on='key', how='outer')
print("Outer join:")
print(result)

### 2.3 Merging on Different Column Names

In [None]:
# DataFrames with different key column names
employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'department_code': [101, 102, 101]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 103],
    'dept_name': ['Engineering', 'Marketing', 'HR']
})

print("Employees (key: department_code):")
print(employees)
print("\nDepartments (key: dept_id):")
print(departments)

In [None]:
# Use left_on and right_on
result = pd.merge(employees, departments, 
                  left_on='department_code', 
                  right_on='dept_id')
print("Merged with different key names:")
print(result)

### 2.4 Merging on Multiple Keys

In [None]:
# Create DataFrames with composite keys
sales = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'revenue': [100, 150, 120, 180]
})

targets = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'target': [90, 140, 110, 170]
})

print("Sales:")
print(sales)
print("\nTargets:")
print(targets)

In [None]:
# Merge on multiple columns
result = pd.merge(sales, targets, on=['year', 'quarter'])
result['pct_of_target'] = (result['revenue'] / result['target'] * 100).round(1)
print("Merged on multiple keys:")
print(result)

### 2.5 Handling Duplicate Column Names

In [None]:
# DataFrames with same column names
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [10, 20, 30],
    'date': ['2024-01', '2024-02', '2024-03']
})

df2 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [100, 200, 300],
    'date': ['2024-01', '2024-02', '2024-03']
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Default suffixes
result = pd.merge(df1, df2, on='id')
print("Merged with default suffixes:")
print(result)

In [None]:
# Custom suffixes
result = pd.merge(df1, df2, on='id', suffixes=('_2023', '_2024'))
print("Merged with custom suffixes:")
print(result)

---

## 3. Different Key Scenarios

### 3.1 One-to-One Merge

In [None]:
# One employee per record, one salary per employee
employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

salaries = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'salary': [70000, 80000, 90000]
})

result = pd.merge(employees, salaries, on='emp_id')
print("One-to-One merge:")
print(result)

### 3.2 One-to-Many Merge

In [None]:
# One department, many employees
departments = pd.DataFrame({
    'dept_id': [1, 2],
    'dept_name': ['Engineering', 'Marketing']
})

employees = pd.DataFrame({
    'emp_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [1, 1, 2, 2]
})

result = pd.merge(departments, employees, on='dept_id')
print("One-to-Many merge:")
print(result)

### 3.3 Many-to-Many Merge

In [None]:
# Students can be in multiple courses, courses have multiple students
students = pd.DataFrame({
    'student_id': [1, 1, 2, 2, 3],
    'course': ['Math', 'Science', 'Math', 'History', 'Science'],
    'grade': ['A', 'B', 'B', 'A', 'A']
})

courses = pd.DataFrame({
    'course': ['Math', 'Math', 'Science', 'Science', 'History'],
    'teacher': ['Smith', 'Jones', 'Brown', 'Davis', 'Wilson'],
    'room': [101, 102, 201, 202, 301]
})

print("Students:")
print(students)
print("\nCourses:")
print(courses)

In [None]:
# Many-to-many creates all combinations
result = pd.merge(students, courses, on='course')
print("Many-to-Many merge (creates all combinations):")
print(result)

---

## 4. Index-Based Joining with `join()`

The `join()` method is convenient when you want to join on indexes.

In [None]:
# Create DataFrames with meaningful indexes
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
}, index=['E001', 'E002', 'E003'])

df2 = pd.DataFrame({
    'salary': [70000, 80000, 90000],
    'department': ['Eng', 'Mkt', 'HR']
}, index=['E001', 'E002', 'E004'])  # Note: E004 instead of E003

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [None]:
# Left join on index (default)
result = df1.join(df2)
print("Join (left, default):")
print(result)

In [None]:
# Inner join on index
result = df1.join(df2, how='inner')
print("Join (inner):")
print(result)

In [None]:
# Outer join on index
result = df1.join(df2, how='outer')
print("Join (outer):")
print(result)

In [None]:
# Join with column on index
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'dept_id': ['D1', 'D2', 'D1']
})

df2 = pd.DataFrame({
    'dept_name': ['Engineering', 'Marketing']
}, index=['D1', 'D2'])

result = df1.join(df2, on='dept_id')
print("Join column on index:")
print(result)

---

## 5. Merge Indicator

In [None]:
# Use indicator to see merge source
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
right = pd.DataFrame({'key': ['B', 'C', 'D'], 'value': [4, 5, 6]})

result = pd.merge(left, right, on='key', how='outer', 
                  indicator=True, suffixes=('_left', '_right'))
print("Outer merge with indicator:")
print(result)

In [None]:
# Filter by merge indicator
print("\nOnly in left:")
print(result[result['_merge'] == 'left_only'])

print("\nOnly in right:")
print(result[result['_merge'] == 'right_only'])

print("\nIn both:")
print(result[result['_merge'] == 'both'])

---

## 6. Practical Examples

In [None]:
# Create a realistic dataset
np.random.seed(42)

# Customers table
customers = pd.DataFrame({
    'customer_id': range(1, 6),
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix']
})

# Orders table
orders = pd.DataFrame({
    'order_id': range(101, 109),
    'customer_id': [1, 2, 1, 3, 2, 1, 4, 6],  # Note: customer 6 doesn't exist
    'product': ['Widget', 'Gadget', 'Widget', 'Gizmo', 'Widget', 'Gadget', 'Gizmo', 'Widget'],
    'amount': [100, 200, 150, 300, 120, 180, 250, 90]
})

# Products table
products = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'category': ['Electronics', 'Electronics', 'Home'],
    'unit_cost': [20, 40, 60]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)
print("\nProducts:")
print(products)

In [None]:
# Join orders with customer information
orders_with_customers = pd.merge(orders, customers, on='customer_id', how='left')
print("Orders with customer info:")
print(orders_with_customers)

In [None]:
# Add product information
full_orders = pd.merge(orders_with_customers, products, on='product')
print("Full order details:")
print(full_orders)

In [None]:
# Calculate profit margin
full_orders['profit'] = full_orders['amount'] - full_orders['unit_cost']
print("Orders with profit:")
print(full_orders[['order_id', 'name', 'product', 'amount', 'unit_cost', 'profit']])

In [None]:
# Find customers without orders
all_customers = pd.merge(customers, orders, on='customer_id', how='left', indicator=True)
customers_no_orders = all_customers[all_customers['_merge'] == 'left_only']['name']
print("Customers without orders:")
print(customers_no_orders.values)

---

## Exercises

In [None]:
# Exercise data
# Employees table
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'dept_id': [101, 102, 101, 103, 102],
    'manager_id': [None, 1, 1, 2, 2]
})

# Departments table
departments = pd.DataFrame({
    'dept_id': [101, 102, 103, 104],
    'dept_name': ['Engineering', 'Sales', 'HR', 'Finance'],
    'budget': [500000, 300000, 200000, 400000]
})

# Salaries table (historical)
salaries = pd.DataFrame({
    'emp_id': [1, 1, 2, 2, 3, 4, 5],
    'year': [2023, 2024, 2023, 2024, 2024, 2024, 2024],
    'salary': [70000, 75000, 60000, 65000, 80000, 55000, 62000]
})

# Projects table
projects = pd.DataFrame({
    'project_id': ['P1', 'P2', 'P3'],
    'project_name': ['Alpha', 'Beta', 'Gamma'],
    'dept_id': [101, 102, 101]
})

# Project assignments (many-to-many)
assignments = pd.DataFrame({
    'emp_id': [1, 1, 2, 3, 3, 4],
    'project_id': ['P1', 'P3', 'P2', 'P1', 'P3', 'P2'],
    'hours': [100, 50, 80, 120, 60, 90]
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)
print("\nSalaries:")
print(salaries)
print("\nProjects:")
print(projects)
print("\nAssignments:")
print(assignments)

### Exercise 1: Basic Merge

Create a DataFrame that shows each employee's name along with their department name.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
result = pd.merge(employees[['emp_id', 'name', 'dept_id']], 
                  departments[['dept_id', 'dept_name']], 
                  on='dept_id')
print("Employees with department names:")
print(result)
```
</details>

### Exercise 2: Find Departments Without Employees

Use a merge with indicator to find departments that have no employees assigned.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
result = pd.merge(departments, employees, on='dept_id', how='left', indicator=True)
empty_depts = result[result['_merge'] == 'left_only']['dept_name'].unique()
print("Departments without employees:")
print(empty_depts)
```
</details>

### Exercise 3: Get 2024 Salaries

Create a DataFrame showing each employee's name and their 2024 salary. Include employees even if they don't have a 2024 salary record.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# Filter salaries for 2024
salaries_2024 = salaries[salaries['year'] == 2024]

# Left join to keep all employees
result = pd.merge(employees[['emp_id', 'name']], 
                  salaries_2024[['emp_id', 'salary']], 
                  on='emp_id', 
                  how='left')
print("Employee 2024 salaries:")
print(result)
```
</details>

### Exercise 4: Project Details

Create a comprehensive project report showing:
- Project name
- Department name
- Employee name
- Hours assigned

This requires multiple merges.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# Step 1: Merge assignments with projects
step1 = pd.merge(assignments, projects, on='project_id')

# Step 2: Add employee names
step2 = pd.merge(step1, employees[['emp_id', 'name']], on='emp_id')

# Step 3: Add department names
result = pd.merge(step2, departments[['dept_id', 'dept_name']], on='dept_id')

# Select and order columns
result = result[['project_name', 'dept_name', 'name', 'hours']]
result = result.sort_values(['project_name', 'name'])

print("Project Report:")
print(result)
```
</details>

### Exercise 5: Concatenation

The company acquired a new division. Concatenate the new employees with the existing ones:

```python
new_employees = pd.DataFrame({
    'emp_id': [6, 7, 8],
    'name': ['Frank', 'Grace', 'Henry'],
    'dept_id': [104, 104, 103],
    'manager_id': [None, 6, 6]
})
```

Create a combined employee list with a new column indicating whether each employee is from the 'Original' or 'Acquired' group.

In [None]:
# Your code here
new_employees = pd.DataFrame({
    'emp_id': [6, 7, 8],
    'name': ['Frank', 'Grace', 'Henry'],
    'dept_id': [104, 104, 103],
    'manager_id': [None, 6, 6]
})


<details>
<summary>Click to reveal solution</summary>

```python
# Add source column to each DataFrame
employees_orig = employees.copy()
employees_orig['source'] = 'Original'

new_employees_copy = new_employees.copy()
new_employees_copy['source'] = 'Acquired'

# Concatenate
all_employees = pd.concat([employees_orig, new_employees_copy], ignore_index=True)

print("Combined employee list:")
print(all_employees)
```
</details>

---

## Summary

In this notebook, you learned:

1. **`concat()`**:
   - Vertical concatenation (axis=0): Stack rows
   - Horizontal concatenation (axis=1): Add columns
   - Options: ignore_index, keys, join

2. **`merge()`**:
   - Join types: inner, left, right, outer
   - Key options: on, left_on, right_on
   - Multiple keys for composite joins
   - Handling duplicate columns with suffixes
   - Using indicator for debugging

3. **Key Scenarios**:
   - One-to-one: Unique keys on both sides
   - One-to-many: Unique keys on one side
   - Many-to-many: Creates all combinations

4. **`join()`**:
   - Convenient for index-based joins
   - Can join column to index

5. **When to Use What**:
   - `concat()`: Stacking similar DataFrames
   - `merge()`: Combining on columns (like SQL)
   - `join()`: Quick index-based operations

---

## Next Steps

Congratulations! You have completed the Pandas module. You now have a solid foundation in:

- Creating and manipulating Series and DataFrames
- Reading and writing data in various formats
- Selecting and filtering data
- Cleaning and transforming data
- Grouping and aggregating data
- Merging and joining datasets

To continue your learning:
- Practice with real-world datasets
- Explore advanced topics like time series analysis
- Learn data visualization with Matplotlib and Seaborn
- Combine Pandas with machine learning libraries