# Merging Data with Pandas

## Introduction

In real-world data analysis, information is rarely stored in a single monolithic table. Instead, data is typically distributed across multiple sources, each representing different aspects, entities, or time periods of the system under study. Combining these disparate tables - **merging** - is a foundational operation in data wrangling, much like performing SQL joins in a relational database. Effective merging allows you to enrich datasets, assemble complete records, and unlock the analytical power of relational thinking.

## The Essence of Merging

### What is a Merge (Join)?

A **merge** combines two or more tables (DataFrames) based on the values of one or more shared columns, called **keys**. This process is analogous to SQL joins, enabling you to bring together related information from separate sources.

### The Most Common Join: **Inner Join**

**Intuition:**  
An *inner join* retrieves only the rows that have matching key values in **both** DataFrames. It acts as a filter - retaining only those records for which the relationship exists in all datasets involved.

- If you imagine each table as a circle in a Venn diagram, an inner join corresponds to the overlapping region: only values present in both circles are included in the result.
- This is the default behaviour in Pandas' `merge` function.

**When to use:**  
- When you want to analyse only the “shared universe” of your datasets - i.e., where data is available in all sources.

### Syntax: `pd.merge()`

The principal function for table merging in Pandas is `pd.merge()`. The basic usage is:

```python
import pandas as pd

merged_df = pd.merge(
    left=table1,
    right=table2,
    on="key_column"        # or on=["key1", "key2"] for composite keys
)
```
- By default, this performs an **inner join** on the specified key(s).


### Example: Inner Join with a Common Key

Suppose you have two DataFrames, each with a `"user_id"` column representing the entity to join on:

```python
merged_df = pd.merge(left=users, right=purchases, on="user_id")
```
- Only users with purchase records - and vice versa - are included in the result.

### Handling Overlapping Column Names: The `suffixes` Argument

When both tables have columns with the same name (apart from the join key), Pandas automatically appends suffixes to distinguish them. You can customise these using the `suffixes` argument:

```python
merged_df = pd.merge(
    left=table1,
    right=table2,
    on="id",
    suffixes=("_left", "_right")
)
```
- This ensures the resulting DataFrame remains unambiguous and self-explanatory.

## Best Practices

- **Explicit key columns:** Always specify the `on=` argument to avoid subtle merge errors, especially when column names are similar but not identical.
- **Review join types:** Inner joins are safe and conservative, but for broader analyses, consider left, right, or outer joins (see Pandas documentation).
- **Inspect results:** Always check the shape and sample rows of your merged DataFrame (`.shape`, `.head()`) to ensure correctness.
- **Column name hygiene:** Use `suffixes` to manage naming collisions for maximum interpretability.



In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
import seaborn as sns 
import os 

In [7]:
wards = pd.read_pickle("data/ward.p")


In [5]:
census = pd.read_pickle("data/census.p")

In [8]:
wards.head()

Unnamed: 0,ward,alderman,address,zip
0,1,"Proco ""Joe"" Moreno",2058 NORTH WESTERN AVENUE,60647
1,2,Brian Hopkins,1400 NORTH ASHLAND AVENUE,60622
2,3,Pat Dowell,5046 SOUTH STATE STREET,60609
3,4,William D. Burns,"435 EAST 35TH STREET, 1ST FLOOR",60616
4,5,Leslie A. Hairston,2325 EAST 71ST STREET,60649


In [9]:
census.head()

Unnamed: 0,ward,pop_2000,pop_2010,change,address,zip
0,1,52951,56149,6%,2765 WEST SAINT MARY STREET,60647
1,2,54361,55805,3%,WM WASTE MANAGEMENT 1500,60622
2,3,40385,53039,31%,17 EAST 38TH STREET,60653
3,4,51953,54589,5%,31ST ST HARBOR BUILDING LAKEFRONT TRAIL,60653
4,5,55302,51455,-7%,JACKSON PARK LAGOON SOUTH CORNELL DRIVE,60637


In [12]:
wards_census = wards.merge(census, on="ward")
display(wards_census.head())

Unnamed: 0,ward,alderman,address_x,zip_x,pop_2000,pop_2010,change,address_y,zip_y
0,1,"Proco ""Joe"" Moreno",2058 NORTH WESTERN AVENUE,60647,52951,56149,6%,2765 WEST SAINT MARY STREET,60647
1,2,Brian Hopkins,1400 NORTH ASHLAND AVENUE,60622,54361,55805,3%,WM WASTE MANAGEMENT 1500,60622
2,3,Pat Dowell,5046 SOUTH STATE STREET,60609,40385,53039,31%,17 EAST 38TH STREET,60653
3,4,William D. Burns,"435 EAST 35TH STREET, 1ST FLOOR",60616,51953,54589,5%,31ST ST HARBOR BUILDING LAKEFRONT TRAIL,60653
4,5,Leslie A. Hairston,2325 EAST 71ST STREET,60649,55302,51455,-7%,JACKSON PARK LAGOON SOUTH CORNELL DRIVE,60637


In [13]:
print(wards_census.columns)

Index(['ward', 'alderman', 'address_x', 'zip_x', 'pop_2000', 'pop_2010',
       'change', 'address_y', 'zip_y'],
      dtype='object')


In [15]:
wards_census = wards.merge(census, on="ward", suffixes=("_ward", "_cen"))
display(wards_census.head())

Unnamed: 0,ward,alderman,address_ward,zip_ward,pop_2000,pop_2010,change,address_cen,zip_cen
0,1,"Proco ""Joe"" Moreno",2058 NORTH WESTERN AVENUE,60647,52951,56149,6%,2765 WEST SAINT MARY STREET,60647
1,2,Brian Hopkins,1400 NORTH ASHLAND AVENUE,60622,54361,55805,3%,WM WASTE MANAGEMENT 1500,60622
2,3,Pat Dowell,5046 SOUTH STATE STREET,60609,40385,53039,31%,17 EAST 38TH STREET,60653
3,4,William D. Burns,"435 EAST 35TH STREET, 1ST FLOOR",60616,51953,54589,5%,31ST ST HARBOR BUILDING LAKEFRONT TRAIL,60653
4,5,Leslie A. Hairston,2325 EAST 71ST STREET,60649,55302,51455,-7%,JACKSON PARK LAGOON SOUTH CORNELL DRIVE,60637


In [16]:
print(wards_census.shape)

(50, 9)
