# 05.02 - Merge y Join

**Autor:** Miguel Angel Vazquez Varela  
**Nivel:** Intermedio  
**Tiempo estimado:** 30 min

---

## Que aprenderemos?

- Combinar DataFrames con `merge()`
- Tipos de join: inner, left, right, outer
- Concatenar con `concat()`
- Manejar duplicados y conflictos

In [1]:
import pandas as pd
import numpy as np

---

## Datos de ejemplo

In [2]:
# Tabla de viajes
trips = pd.DataFrame({
    "trip_id": [1, 2, 3, 4, 5],
    "station_id": ["ST01", "ST02", "ST01", "ST03", "ST04"],
    "duration_min": [12, 25, 8, 45, 15],
    "user_id": [101, 102, 101, 103, 104]
})

print("TRIPS:")
trips

TRIPS:


Unnamed: 0,trip_id,station_id,duration_min,user_id
0,1,ST01,12,101
1,2,ST02,25,102
2,3,ST01,8,101
3,4,ST03,45,103
4,5,ST04,15,104


In [3]:
# Tabla de estaciones
stations = pd.DataFrame({
    "station_id": ["ST01", "ST02", "ST03", "ST05"],
    "name": ["Sol", "Atocha", "Retiro", "Cibeles"],
    "capacity": [30, 25, 20, 35]
})

print("STATIONS:")
stations

STATIONS:


Unnamed: 0,station_id,name,capacity
0,ST01,Sol,30
1,ST02,Atocha,25
2,ST03,Retiro,20
3,ST05,Cibeles,35


In [4]:
# Tabla de usuarios
users = pd.DataFrame({
    "user_id": [101, 102, 103, 105],
    "name": ["Ana", "Carlos", "Maria", "Pedro"],
    "user_type": ["subscriber", "casual", "subscriber", "casual"]
})

print("USERS:")
users

USERS:


Unnamed: 0,user_id,name,user_type
0,101,Ana,subscriber
1,102,Carlos,casual
2,103,Maria,subscriber
3,105,Pedro,casual


**Nota:** Los datos tienen desajustes intencionados:
- `trips` tiene ST04 que no existe en `stations`
- `stations` tiene ST05 que no aparece en `trips`
- Similar con usuarios

---

## 1. Inner Join (por defecto)

Solo mantiene filas que existen en **ambas** tablas.

In [5]:
# Merge trips con stations
result = pd.merge(trips, stations, on="station_id")
result

Unnamed: 0,trip_id,station_id,duration_min,user_id,name,capacity
0,1,ST01,12,101,Sol,30
1,2,ST02,25,102,Atocha,25
2,3,ST01,8,101,Sol,30
3,4,ST03,45,103,Retiro,20


**Observa:** 
- Trip 5 (ST04) desaparece porque ST04 no existe en stations
- Station ST05 no aparece porque no tiene viajes

---

## 2. Left Join

Mantiene **todas** las filas de la tabla izquierda.

In [6]:
result = pd.merge(trips, stations, on="station_id", how="left")
result

Unnamed: 0,trip_id,station_id,duration_min,user_id,name,capacity
0,1,ST01,12,101,Sol,30.0
1,2,ST02,25,102,Atocha,25.0
2,3,ST01,8,101,Sol,30.0
3,4,ST03,45,103,Retiro,20.0
4,5,ST04,15,104,,


**Observa:** Trip 5 se mantiene, pero name y capacity son NaN.

---

## 3. Right Join

Mantiene **todas** las filas de la tabla derecha.

In [7]:
result = pd.merge(trips, stations, on="station_id", how="right")
result

Unnamed: 0,trip_id,station_id,duration_min,user_id,name,capacity
0,1.0,ST01,12.0,101.0,Sol,30
1,3.0,ST01,8.0,101.0,Sol,30
2,2.0,ST02,25.0,102.0,Atocha,25
3,4.0,ST03,45.0,103.0,Retiro,20
4,,ST05,,,Cibeles,35


**Observa:** ST05 (Cibeles) aparece aunque no tiene viajes.

---

## 4. Outer Join (Full)

Mantiene **todas** las filas de **ambas** tablas.

In [8]:
result = pd.merge(trips, stations, on="station_id", how="outer")
result

Unnamed: 0,trip_id,station_id,duration_min,user_id,name,capacity
0,1.0,ST01,12.0,101.0,Sol,30.0
1,3.0,ST01,8.0,101.0,Sol,30.0
2,2.0,ST02,25.0,102.0,Atocha,25.0
3,4.0,ST03,45.0,103.0,Retiro,20.0
4,5.0,ST04,15.0,104.0,,
5,,ST05,,,Cibeles,35.0


---

## 5. Merge con diferentes nombres de columna

In [9]:
# Renombrar para el ejemplo
trips_renamed = trips.rename(columns={"station_id": "start_station"})
trips_renamed.head()

Unnamed: 0,trip_id,start_station,duration_min,user_id
0,1,ST01,12,101
1,2,ST02,25,102
2,3,ST01,8,101
3,4,ST03,45,103
4,5,ST04,15,104


In [10]:
# Usar left_on y right_on
result = pd.merge(
    trips_renamed, 
    stations, 
    left_on="start_station", 
    right_on="station_id"
)
result

Unnamed: 0,trip_id,start_station,duration_min,user_id,station_id,name,capacity
0,1,ST01,12,101,ST01,Sol,30
1,2,ST02,25,102,ST02,Atocha,25
2,3,ST01,8,101,ST01,Sol,30
3,4,ST03,45,103,ST03,Retiro,20


---

## 6. Merge con multiples columnas

In [11]:
# Datos con clave compuesta
sales_2023 = pd.DataFrame({
    "station_id": ["ST01", "ST01", "ST02"],
    "month": [1, 2, 1],
    "revenue": [1000, 1200, 800]
})

sales_2024 = pd.DataFrame({
    "station_id": ["ST01", "ST01", "ST02"],
    "month": [1, 2, 1],
    "revenue": [1100, 1300, 900]
})

print("Sales 2023:")
display(sales_2023)
print("Sales 2024:")
display(sales_2024)

Sales 2023:


Unnamed: 0,station_id,month,revenue
0,ST01,1,1000
1,ST01,2,1200
2,ST02,1,800


Sales 2024:


Unnamed: 0,station_id,month,revenue
0,ST01,1,1100
1,ST01,2,1300
2,ST02,1,900


In [12]:
# Merge por station_id Y month
comparison = pd.merge(
    sales_2023, 
    sales_2024, 
    on=["station_id", "month"],
    suffixes=("_2023", "_2024")
)
comparison

Unnamed: 0,station_id,month,revenue_2023,revenue_2024
0,ST01,1,1000,1100
1,ST01,2,1200,1300
2,ST02,1,800,900


In [13]:
# Calcular crecimiento
comparison["growth"] = comparison["revenue_2024"] - comparison["revenue_2023"]
comparison

Unnamed: 0,station_id,month,revenue_2023,revenue_2024,growth
0,ST01,1,1000,1100,100
1,ST01,2,1200,1300,100
2,ST02,1,800,900,100


---

## 7. Concatenar DataFrames con `concat()`

Para apilar DataFrames (mismo esquema).

In [14]:
# Viajes de diferentes dias
monday = pd.DataFrame({
    "trip_id": [1, 2],
    "duration": [15, 20],
    "day": ["monday", "monday"]
})

tuesday = pd.DataFrame({
    "trip_id": [3, 4],
    "duration": [12, 25],
    "day": ["tuesday", "tuesday"]
})

print("Monday:")
display(monday)
print("Tuesday:")
display(tuesday)

Monday:


Unnamed: 0,trip_id,duration,day
0,1,15,monday
1,2,20,monday


Tuesday:


Unnamed: 0,trip_id,duration,day
0,3,12,tuesday
1,4,25,tuesday


In [15]:
# Concatenar verticalmente
all_trips = pd.concat([monday, tuesday], ignore_index=True)
all_trips

Unnamed: 0,trip_id,duration,day
0,1,15,monday
1,2,20,monday
2,3,12,tuesday
3,4,25,tuesday


### Concatenar horizontalmente

In [16]:
# Columnas adicionales
extra_info = pd.DataFrame({
    "distance": [3.5, 4.2, 2.8, 5.0]
})

# Concatenar horizontalmente
combined = pd.concat([all_trips, extra_info], axis=1)
combined

Unnamed: 0,trip_id,duration,day,distance
0,1,15,monday,3.5
1,2,20,monday,4.2
2,3,12,tuesday,2.8
3,4,25,tuesday,5.0


---

## 8. Validar merges

In [17]:
# Verificar que es 1:1, 1:m, m:1, o m:m
try:
    result = pd.merge(
        trips, 
        stations, 
        on="station_id",
        validate="one_to_one"  # Espera 1:1
    )
except pd.errors.MergeError as e:
    print(f"Error: {e}")

Error: Merge keys are not unique in left dataset; not a one-to-one merge
Duplicates in left:
 station_id
      ST01 ...


In [18]:
# many_to_one es correcto (varios viajes por estacion)
result = pd.merge(
    trips, 
    stations, 
    on="station_id",
    validate="many_to_one"
)
print("Merge validado correctamente!")

Merge validado correctamente!


---

## 9. Indicator: ver origen de filas

In [19]:
result = pd.merge(
    trips, 
    stations, 
    on="station_id", 
    how="outer",
    indicator=True
)
result

Unnamed: 0,trip_id,station_id,duration_min,user_id,name,capacity,_merge
0,1.0,ST01,12.0,101.0,Sol,30.0,both
1,3.0,ST01,8.0,101.0,Sol,30.0,both
2,2.0,ST02,25.0,102.0,Atocha,25.0,both
3,4.0,ST03,45.0,103.0,Retiro,20.0,both
4,5.0,ST04,15.0,104.0,,,left_only
5,,ST05,,,Cibeles,35.0,right_only


In [20]:
# Contar por origen
result["_merge"].value_counts()

_merge
both          4
left_only     1
right_only    1
Name: count, dtype: int64

---

## 10. Ejemplo practico: enriquecer viajes

In [21]:
# Empezar con trips
enriched = trips.copy()

# Anadir info de estacion
enriched = pd.merge(
    enriched,
    stations[["station_id", "name"]],
    on="station_id",
    how="left"
).rename(columns={"name": "station_name"})

# Anadir info de usuario
enriched = pd.merge(
    enriched,
    users[["user_id", "name", "user_type"]],
    on="user_id",
    how="left"
).rename(columns={"name": "user_name"})

enriched

Unnamed: 0,trip_id,station_id,duration_min,user_id,station_name,user_name,user_type
0,1,ST01,12,101,Sol,Ana,subscriber
1,2,ST02,25,102,Atocha,Carlos,casual
2,3,ST01,8,101,Sol,Ana,subscriber
3,4,ST03,45,103,Retiro,Maria,subscriber
4,5,ST04,15,104,,,


---

## Resumen

| Tipo | Mantiene |
|------|----------|
| `inner` | Solo coincidencias |
| `left` | Todas de izquierda |
| `right` | Todas de derecha |
| `outer` | Todas de ambas |

| Funcion | Uso |
|---------|-----|
| `pd.merge()` | Combinar por columnas |
| `pd.concat()` | Apilar DataFrames |

---

**Anterior:** [05.01 - GroupBy y Agregaciones](05_01_groupby.ipynb)  
**Siguiente:** [05.03 - Pivot y Reshape](05_03_pivot_reshape.ipynb)