# Merging dataset
## Basic Inner Join

Only keeps rows that exist in BOTH datasets.

As you can see, property 4 (in `property` df) and property 6 (in `prices` df) disappear.

In [1]:
import pandas as pd

properties = pd.DataFrame({
    "property_id": [1, 2, 3, 4, 5],
    "square_feet": [1800, 2200, 1200, 3000, 2500],
    "num_bedrooms": [3, 4, 2, 5, 4]
})

properties

Unnamed: 0,property_id,square_feet,num_bedrooms
0,1,1800,3
1,2,2200,4
2,3,1200,2
3,4,3000,5
4,5,2500,4


In [2]:
prices = pd.DataFrame({
    "property_id": [1, 2, 3, 5, 6],
    "price_thousand_usd": [250, 320, 180, 390, 410]
})

prices

Unnamed: 0,property_id,price_thousand_usd
0,1,250
1,2,320
2,3,180
3,5,390
4,6,410


In [3]:
merged_inner = pd.merge(left=properties, right=prices, on="property_id", how='inner')
merged_inner

Unnamed: 0,property_id,square_feet,num_bedrooms,price_thousand_usd
0,1,1800,3,250
1,2,2200,4,320
2,3,1200,2,180
3,5,2500,4,390


## Left join
- Keeps ALL rows from left dataset (properties)
- Missing matches become NaN
- Property 4 now has missing price

In [4]:
merged_left = pd.merge(left=properties, right=prices, on="property_id", how="left")
merged_left

Unnamed: 0,property_id,square_feet,num_bedrooms,price_thousand_usd
0,1,1800,3,250.0
1,2,2200,4,320.0
2,3,1200,2,180.0
3,4,3000,5,
4,5,2500,4,390.0


## Right join
- Keeps all rows from `prices`
- Property 6 appears with missing property info

In [5]:
merged_right = pd.merge(left=properties, right=prices, on="property_id", how="right")
merged_right

Unnamed: 0,property_id,square_feet,num_bedrooms,price_thousand_usd
0,1,1800.0,3.0,250
1,2,2200.0,4.0,320
2,3,1200.0,2.0,180
3,5,2500.0,4.0,390
4,6,,,410


## Outer Join (Full Join)
- Keeps everything
- Good for detecting mismatches
- Often used in data cleaning

In [6]:
merged_outer = pd.merge(left=properties, right=prices, on="property_id", how="outer")
merged_outer

Unnamed: 0,property_id,square_feet,num_bedrooms,price_thousand_usd
0,1,1800.0,3.0,250.0
1,2,2200.0,4.0,320.0
2,3,1200.0,2.0,180.0
3,4,3000.0,5.0,
4,5,2500.0,4.0,390.0
5,6,,,410.0


You can also set `indicator=True` when you do outer join. This way, the new dataframe will have a new column called `_merge`, which is a categorical variable that has three possible values: `both`, `left_only`, `right_only`.

In [7]:
merged_outer = pd.merge(left=properties, right=prices,
                        on="property_id", how="outer", indicator=True)
merged_outer

Unnamed: 0,property_id,square_feet,num_bedrooms,price_thousand_usd,_merge
0,1,1800.0,3.0,250.0,both
1,2,2200.0,4.0,320.0,both
2,3,1200.0,2.0,180.0,both
3,4,3000.0,5.0,,left_only
4,5,2500.0,4.0,390.0,both
5,6,,,410.0,right_only


## Merge on Different Column Names

- Column names do not need to match
- Use `left_on` and `right_on`

In [9]:
prices_renamed = pd.DataFrame({
    "id": [1, 2, 3, 5, 6],
    "price_thousand_usd": [250, 320, 180, 390, 410]
})

pd.merge(left=properties, right=prices_renamed,
         left_on="property_id", right_on="id", how="inner")

Unnamed: 0,property_id,square_feet,num_bedrooms,id,price_thousand_usd
0,1,1800,3,1,250
1,2,2200,4,2,320
2,3,1200,2,3,180
3,5,2500,4,5,390


## Merge on Multiple Keys

- Real datasets often require multiple keys
- Prevents incorrect duplication

In [12]:
sales = pd.DataFrame({
    "property_id": [1, 1, 2, 3],
    "year": [2023, 2024, 2024, 2024],
    "price_thousand_usd": [240, 250, 320, 180]
})

tax_rates = pd.DataFrame({
    "property_id": [1, 2, 3],
    "year": [2024, 2024, 2024],
    "tax_rate": [0.012, 0.011, 0.013]
})

pd.merge(left=sales, right=tax_rates,
         on=["property_id", "year"], how="left")

Unnamed: 0,property_id,year,price_thousand_usd,tax_rate
0,1,2023,240,
1,1,2024,250,0.012
2,2,2024,320,0.011
3,3,2024,180,0.013
