# Data Reshaping - Part 1: Merge, Join, and Concatenate

## Week 4, Day 1 (Wednesday) - April 30th, 2025

### Overview
This is the first part of our Data Reshaping session, focusing on combining data from multiple sources. Understanding how to merge, join, and concatenate DataFrames is essential for working with real-world datasets that are often split across multiple tables or files.

### Learning Objectives
- Master different ways to combine DataFrames in Pandas
- Understand the difference between concatenation and merging
- Learn various types of joins and their SQL equivalents
- Apply data combination techniques to e-commerce scenarios
- Handle common issues when combining datasets

### Prerequisites
- Python fundamentals (Week 1)
- NumPy basics (Week 2, Day 1)
- Pandas fundamentals (Week 2-3)
- SQL knowledge (prior to course) - especially JOIN operations

## 1. Introduction to Data Combination

In real-world data analysis, you rarely work with a single dataset. Data is often:
- Split across multiple tables (normalized databases)
- Stored in separate files by time period
- Coming from different sources that need to be combined

Pandas provides powerful tools to combine data, similar to SQL's JOIN and UNION operations.

In [None]:
# Import libraries
import pandas as pd
import numpy as np

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")

## 2. Creating Sample E-commerce Datasets

Let's create sample datasets similar to what you might find in an e-commerce database:

In [None]:
# Orders dataset
orders = pd.DataFrame({
    'order_id': ['ORD001', 'ORD002', 'ORD003', 'ORD004', 'ORD005'],
    'customer_id': ['CUST001', 'CUST002', 'CUST001', 'CUST003', 'CUST002'],
    'order_date': ['2025-01-15', '2025-01-16', '2025-01-17', '2025-01-18', '2025-01-19'],
    'total_amount': [299.99, 149.50, 89.99, 199.99, 75.25]
})

# Customers dataset
customers = pd.DataFrame({
    'customer_id': ['CUST001', 'CUST002', 'CUST003', 'CUST004'],
    'customer_name': ['Alice Johnson', 'Bob Smith', 'Carol Brown', 'David Wilson'],
    'email': ['alice@email.com', 'bob@email.com', 'carol@email.com', 'david@email.com'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})

# Products dataset
products = pd.DataFrame({
    'product_id': ['PROD001', 'PROD002', 'PROD003', 'PROD004', 'PROD005'],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Electronics'],
    'price': [1299.99, 899.99, 449.99, 199.99, 349.99]
})

# Order items dataset (linking orders to products)
order_items = pd.DataFrame({
    'order_id': ['ORD001', 'ORD001', 'ORD002', 'ORD003', 'ORD004', 'ORD005'],
    'product_id': ['PROD001', 'PROD004', 'PROD002', 'PROD003', 'PROD005', 'PROD004'],
    'quantity': [1, 1, 1, 1, 1, 1],
    'unit_price': [1299.99, 199.99, 899.99, 449.99, 349.99, 199.99]
})

print("Sample datasets created!")
print(f"Orders: {len(orders)} rows")
print(f"Customers: {len(customers)} rows")
print(f"Products: {len(products)} rows")
print(f"Order Items: {len(order_items)} rows")

In [None]:
# Let's examine our datasets
print("ORDERS:")
print(orders)
print("\nCUSTOMERS:")
print(customers)
print("\nPRODUCTS:")
print(products)
print("\nORDER ITEMS:")
print(order_items)

## 3. Concatenation with pd.concat()

### What is Concatenation?
Concatenation combines DataFrames by stacking them vertically (rows) or horizontally (columns). It's similar to SQL's UNION operation.

### Vertical Concatenation (Stacking Rows)
SQL equivalent: `SELECT * FROM table1 UNION ALL SELECT * FROM table2`

In [None]:
# Create additional orders data (like from another time period)
new_orders = pd.DataFrame({
    'order_id': ['ORD006', 'ORD007', 'ORD008'],
    'customer_id': ['CUST001', 'CUST004', 'CUST002'],
    'order_date': ['2025-01-20', '2025-01-21', '2025-01-22'],
    'total_amount': [199.99, 299.99, 89.99]
})

print("Original orders:")
print(orders)
print("\nNew orders:")
print(new_orders)

In [None]:
# Concatenate vertically (default behavior)
all_orders = pd.concat([orders, new_orders])

print("All orders combined:")
print(all_orders)
print(f"\nOriginal orders: {len(orders)} rows")
print(f"New orders: {len(new_orders)} rows")
print(f"Combined: {len(all_orders)} rows")

In [None]:
# Reset index to have continuous numbering
all_orders_reset = pd.concat([orders, new_orders], ignore_index=True)

print("All orders with reset index:")
print(all_orders_reset)

### Horizontal Concatenation (Side by Side)
Useful when you have additional columns for the same rows:

In [None]:
# Create additional customer information
customer_details = pd.DataFrame({
    'phone': ['555-0101', '555-0102', '555-0103', '555-0104'],
    'age': [28, 35, 42, 31],
    'membership_tier': ['Gold', 'Silver', 'Gold', 'Bronze']
})

print("Original customers:")
print(customers)
print("\nAdditional details:")
print(customer_details)

In [None]:
# Concatenate horizontally
customers_expanded = pd.concat([customers, customer_details], axis=1)

print("Customers with additional details:")
print(customers_expanded)

### Adding Labels to Concatenated Data

In [None]:
# Add labels to identify data sources
labeled_orders = pd.concat([orders, new_orders], 
                          keys=['January_Week_3', 'January_Week_4'],
                          names=['Week', 'Row'])

print("Orders with source labels:")
print(labeled_orders)
print("\nIndex structure:")
print(labeled_orders.index)

## 4. Merging and Joining with pd.merge()

### What is Merging?
Merging combines DataFrames based on common columns or indices, similar to SQL JOINs. This is used when you have related data across multiple tables.

### Inner Join (Default)
SQL equivalent: `SELECT * FROM table1 INNER JOIN table2 ON table1.key = table2.key`

In [None]:
# Inner join: Orders with Customer information
orders_with_customers = pd.merge(orders, customers, on='customer_id')

print("Orders with customer information (Inner Join):")
print(orders_with_customers)
print(f"\nOriginal orders: {len(orders)} rows")
print(f"Customers: {len(customers)} rows")
print(f"Merged result: {len(orders_with_customers)} rows")

### Left Join
SQL equivalent: `SELECT * FROM table1 LEFT JOIN table2 ON table1.key = table2.key`

Keeps all records from the left DataFrame, adds matching records from the right:

In [None]:
# Left join: All orders, with customer info where available
orders_left_join = pd.merge(orders, customers, on='customer_id', how='left')

print("Orders with customer information (Left Join):")
print(orders_left_join)
print(f"\nResult has {len(orders_left_join)} rows (same as orders: {len(orders)})")

### Right Join
SQL equivalent: `SELECT * FROM table1 RIGHT JOIN table2 ON table1.key = table2.key`

Keeps all records from the right DataFrame:

In [None]:
# Right join: All customers, with order info where available
customers_right_join = pd.merge(orders, customers, on='customer_id', how='right')

print("Customers with order information (Right Join):")
print(customers_right_join)
print(f"\nResult has {len(customers_right_join)} rows (same as customers: {len(customers)})")
print("\nNotice: David Wilson (CUST004) appears with NaN values because he has no orders")

### Outer Join (Full Join)
SQL equivalent: `SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key`

Keeps all records from both DataFrames:

In [None]:
# Outer join: All orders and all customers
full_outer_join = pd.merge(orders, customers, on='customer_id', how='outer')

print("Full outer join - All orders and customers:")
print(full_outer_join)
print(f"\nResult has {len(full_outer_join)} rows")
print("NaN values appear where no match exists")

## 5. Complex Merging Scenarios

### Merging on Multiple Columns

In [None]:
# Create a scenario where we need to merge on multiple columns
# Let's say we have daily sales data and want to merge with promotional data

daily_sales = pd.DataFrame({
    'date': ['2025-01-15', '2025-01-16', '2025-01-17', '2025-01-15', '2025-01-16'],
    'product_id': ['PROD001', 'PROD001', 'PROD001', 'PROD002', 'PROD002'],
    'sales_qty': [5, 3, 8, 2, 4]
})

promotions = pd.DataFrame({
    'date': ['2025-01-15', '2025-01-16', '2025-01-17', '2025-01-15'],
    'product_id': ['PROD001', 'PROD001', 'PROD001', 'PROD002'],
    'discount_pct': [10, 15, 5, 20]
})

print("Daily Sales:")
print(daily_sales)
print("\nPromotions:")
print(promotions)

In [None]:
# Merge on multiple columns
sales_with_promotions = pd.merge(daily_sales, promotions, 
                                on=['date', 'product_id'], 
                                how='left')

print("Sales with promotions (merged on date AND product_id):")
print(sales_with_promotions)
print("\nNaN in discount_pct means no promotion was available for that date/product combination")

### Merging with Different Column Names

In [None]:
# Sometimes the join columns have different names
customer_reviews = pd.DataFrame({
    'cust_id': ['CUST001', 'CUST002', 'CUST003', 'CUST001'],  # Different column name
    'product_id': ['PROD001', 'PROD002', 'PROD003', 'PROD004'],
    'rating': [5, 4, 3, 5],
    'review_date': ['2025-01-20', '2025-01-21', '2025-01-22', '2025-01-23']
})

print("Customer Reviews (note: 'cust_id' instead of 'customer_id'):")
print(customer_reviews)

In [None]:
# Merge with different column names using left_on and right_on
reviews_with_customers = pd.merge(customer_reviews, customers, 
                                 left_on='cust_id', right_on='customer_id',
                                 how='inner')

print("Reviews with customer details:")
print(reviews_with_customers)
print("\nNotice: Both 'cust_id' and 'customer_id' columns are kept")

### Handling Duplicate Column Names

In [None]:
# Create datasets with overlapping column names
order_summary = pd.DataFrame({
    'order_id': ['ORD001', 'ORD002', 'ORD003'],
    'total_amount': [299.99, 149.50, 89.99],  # Same name as in orders
    'status': ['Delivered', 'Shipped', 'Processing']
})

print("Original orders:")
print(orders[['order_id', 'total_amount']].head(3))
print("\nOrder summary:")
print(order_summary)

In [None]:
# Merge with suffixes to handle duplicate column names
orders_with_status = pd.merge(orders, order_summary, 
                             on='order_id', 
                             suffixes=('_original', '_summary'))

print("Orders merged with status (using suffixes):")
print(orders_with_status)
print("\nNotice: total_amount_original and total_amount_summary")

## 6. Advanced Merging: One-to-Many and Many-to-Many

### One-to-Many Relationship
Each customer can have multiple orders:

In [None]:
# This is a one-to-many relationship: one customer -> many orders
customer_orders = pd.merge(customers, orders, on='customer_id', how='inner')

print("Customer-Order relationship (One-to-Many):")
print(customer_orders)
print("\nNotice: Alice Johnson appears twice because she has two orders")

### Many-to-Many Relationship
Let's create a more complex scenario with order items:

In [None]:
# Complex join: Orders -> Order Items -> Products
# First join orders with order items
orders_items = pd.merge(orders, order_items, on='order_id')

print("Orders joined with Order Items:")
print(orders_items)
print(f"\nRows: {len(orders_items)} (more than original {len(orders)} orders because some orders have multiple items)")

In [None]:
# Now add product information
complete_order_info = pd.merge(orders_items, products, on='product_id')

print("Complete order information (Orders + Items + Products):")
print(complete_order_info)
print("\nNow we can see what products were ordered in each order!")

## 7. Performance and Memory Considerations

### When to Use Concatenation vs Merging

| Operation | Use When | SQL Equivalent |
|-----------|----------|----------------|
| `pd.concat()` | Combining DataFrames with same structure | `UNION ALL` |
| `pd.merge()` | Combining related data from different tables | `JOIN` |

### Memory Efficiency Tips

In [None]:
# Check memory usage
print("Memory usage of our datasets:")
print(f"Orders: {orders.memory_usage(deep=True).sum()} bytes")
print(f"Customers: {customers.memory_usage(deep=True).sum()} bytes")
print(f"Products: {products.memory_usage(deep=True).sum()} bytes")
print(f"Complete order info: {complete_order_info.memory_usage(deep=True).sum()} bytes")

# For large datasets, consider:
# 1. Using categorical data types for repeated strings
# 2. Selecting only needed columns before merging
# 3. Using chunking for very large datasets

## 8. SQL to Pandas Translation Guide

### Common SQL JOIN patterns and their Pandas equivalents:

| SQL Operation | Pandas Equivalent |
|--------------|-------------------|
| `SELECT * FROM a INNER JOIN b ON a.key = b.key` | `pd.merge(a, b, on='key')` |
| `SELECT * FROM a LEFT JOIN b ON a.key = b.key` | `pd.merge(a, b, on='key', how='left')` |
| `SELECT * FROM a RIGHT JOIN b ON a.key = b.key` | `pd.merge(a, b, on='key', how='right')` |
| `SELECT * FROM a FULL OUTER JOIN b ON a.key = b.key` | `pd.merge(a, b, on='key', how='outer')` |
| `SELECT * FROM a UNION ALL SELECT * FROM b` | `pd.concat([a, b])` |
| `SELECT * FROM a JOIN b ON a.k1=b.k1 AND a.k2=b.k2` | `pd.merge(a, b, on=['k1', 'k2'])` |

## 9. Real-World E-commerce Example

Let's create a comprehensive analysis combining all our datasets:

In [None]:
# Create a comprehensive customer analysis
# Step 1: Get orders with customer information
step1 = pd.merge(orders, customers, on='customer_id', how='left')

# Step 2: Add order items
step2 = pd.merge(step1, order_items, on='order_id', how='left')

# Step 3: Add product information
comprehensive_data = pd.merge(step2, products, on='product_id', how='left')

print("Comprehensive E-commerce Analysis Dataset:")
print(comprehensive_data)
print(f"\nShape: {comprehensive_data.shape}")
print(f"Columns: {list(comprehensive_data.columns)}")

In [None]:
# Now we can do interesting analyses
print("Analysis Examples:")
print("\n1. Total spent by each customer:")
customer_spending = comprehensive_data.groupby('customer_name')['unit_price'].sum().sort_values(ascending=False)
print(customer_spending)

print("\n2. Most popular product categories:")
category_popularity = comprehensive_data['category'].value_counts()
print(category_popularity)

print("\n3. Average order value by city:")
city_avg_order = comprehensive_data.groupby('city')['total_amount'].mean().sort_values(ascending=False)
print(city_avg_order)

## 10. Common Issues and Solutions

### Issue 1: Unexpected Number of Rows After Merge

In [None]:
# This can happen with duplicate keys
# Let's create a scenario with duplicate customer IDs
duplicate_customers = pd.DataFrame({
    'customer_id': ['CUST001', 'CUST001', 'CUST002'],  # CUST001 appears twice
    'segment': ['Premium', 'VIP', 'Standard']
})

print("Orders (3 for CUST001):")
print(orders[orders['customer_id'] == 'CUST001'])
print("\nDuplicate customer segments:")
print(duplicate_customers)

# This will create a Cartesian product!
problematic_merge = pd.merge(orders, duplicate_customers, on='customer_id')
print(f"\nProblematic merge result ({len(problematic_merge)} rows):")
print(problematic_merge[problematic_merge['customer_id'] == 'CUST001'])
print("\nNotice: Each order for CUST001 now appears twice!")

### Issue 2: Missing Data After Inner Join

In [None]:
# Some orders might not have customer information
orders_with_missing = orders.copy()
orders_with_missing.loc[len(orders_with_missing)] = ['ORD999', 'CUST999', '2025-01-25', 99.99]

print("Orders with missing customer:")
print(orders_with_missing.tail())

# Inner join will drop the order with missing customer
inner_result = pd.merge(orders_with_missing, customers, on='customer_id', how='inner')
print(f"\nInner join result: {len(inner_result)} rows (lost 1 order)")

# Left join will keep all orders
left_result = pd.merge(orders_with_missing, customers, on='customer_id', how='left')
print(f"Left join result: {len(left_result)} rows (kept all orders)")
print("\nOrders without customer info:")
print(left_result[left_result['customer_name'].isna()])

## 11. Practice Exercises

### Exercise 1: Basic Merging
Merge the `orders` and `customers` DataFrames to show all orders with customer information. Include orders even if customer information is missing.

In [None]:
# Your code here


### Exercise 2: Multiple Table Join
Create a dataset that shows:
- Order ID
- Customer Name
- Product Name
- Quantity
- Unit Price

Hint: You'll need to join orders, customers, order_items, and products.

In [None]:
# Your code here


### Exercise 3: Concatenation Challenge
You have sales data from two different months. Combine them and add a column to identify which month each sale came from.

```python
january_sales = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [100, 150, 200]
})

february_sales = pd.DataFrame({
    'product': ['A', 'B', 'D'],
    'sales': [120, 180, 90]
})
```

In [None]:
# Create the sample data
january_sales = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [100, 150, 200]
})

february_sales = pd.DataFrame({
    'product': ['A', 'B', 'D'],
    'sales': [120, 180, 90]
})

# Your code here


### Exercise 4: SQL Translation
Translate this SQL query to Pandas:

```sql
SELECT c.customer_name, COUNT(o.order_id) as order_count, SUM(o.total_amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
ORDER BY total_spent DESC
```

In [None]:
# Your code here


## Next Steps

Congratulations! You've mastered the fundamentals of merging, joining, and concatenating DataFrames. In the next parts of today's session, we'll continue with:

- **Part 2: Reshape operations (melt, pivot)**
- **Part 3: Time series manipulation basics**

These skills form the foundation for working with real-world, multi-table datasets like the Olist e-commerce database we'll be using throughout the course.

### Key Takeaways
1. **Concatenation** (`pd.concat()`) stacks DataFrames - use for combining similar data
2. **Merging** (`pd.merge()`) joins related data - use for combining different but related datasets
3. **Join types** (inner, left, right, outer) control which records are kept
4. **Always verify** the number of rows after joining to catch unexpected duplications
5. **Plan your joins** - start with the main table and add related information step by step