# 05.01 - GroupBy y Agregaciones

**Autor:** Miguel Angel Vazquez Varela  
**Nivel:** Intermedio  
**Tiempo estimado:** 30 min

---

## Que aprenderemos?

- Concepto split-apply-combine
- Agrupar por una o varias columnas
- Funciones de agregacion
- Agregaciones multiples con `agg()`
- Transformaciones con `transform()`

In [1]:
import pandas as pd
import numpy as np

---

## Datos de ejemplo

In [2]:
trips = pd.DataFrame({
    "trip_id": range(1, 16),
    "station": ["Sol", "Atocha", "Sol", "Retiro", "Cibeles",
                "Sol", "Atocha", "Retiro", "Cibeles", "Sol",
                "Atocha", "Sol", "Retiro", "Cibeles", "Atocha"],
    "user_type": ["subscriber", "casual", "subscriber", "subscriber", "casual",
                  "subscriber", "subscriber", "casual", "subscriber", "casual",
                  "subscriber", "subscriber", "casual", "subscriber", "casual"],
    "duration_min": [12, 25, 8, 45, 15, 30, 18, 22, 35, 10, 14, 28, 40, 12, 20],
    "distance_km": [2.5, 5.0, 1.8, 8.2, 3.1, 6.0, 3.5, 4.2, 7.0, 2.0, 2.8, 5.5, 7.5, 2.3, 4.0]
})

trips

Unnamed: 0,trip_id,station,user_type,duration_min,distance_km
0,1,Sol,subscriber,12,2.5
1,2,Atocha,casual,25,5.0
2,3,Sol,subscriber,8,1.8
3,4,Retiro,subscriber,45,8.2
4,5,Cibeles,casual,15,3.1
5,6,Sol,subscriber,30,6.0
6,7,Atocha,subscriber,18,3.5
7,8,Retiro,casual,22,4.2
8,9,Cibeles,subscriber,35,7.0
9,10,Sol,casual,10,2.0


---

## 1. Concepto: Split-Apply-Combine

GroupBy funciona en 3 pasos:

1. **Split**: Divide los datos en grupos
2. **Apply**: Aplica una funcion a cada grupo
3. **Combine**: Combina los resultados

In [3]:
# Agrupar por estacion
grouped = trips.groupby("station")
print(f"Tipo: {type(grouped)}")
print(f"Grupos: {grouped.ngroups}")

Tipo: <class 'pandas.api.typing.DataFrameGroupBy'>
Grupos: 4


In [4]:
# Ver los grupos
grouped.groups

{'Atocha': [1, 6, 10, 14], 'Cibeles': [4, 8, 13], 'Retiro': [3, 7, 12], 'Sol': [0, 2, 5, 9, 11]}

In [5]:
# Acceder a un grupo especifico
grouped.get_group("Sol")

Unnamed: 0,trip_id,station,user_type,duration_min,distance_km
0,1,Sol,subscriber,12,2.5
2,3,Sol,subscriber,8,1.8
5,6,Sol,subscriber,30,6.0
9,10,Sol,casual,10,2.0
11,12,Sol,subscriber,28,5.5


---

## 2. Agregaciones basicas

In [6]:
# Media por estacion
trips.groupby("station")["duration_min"].mean()

station
Atocha     19.250000
Cibeles    20.666667
Retiro     35.666667
Sol        17.600000
Name: duration_min, dtype: float64

In [7]:
# Suma por estacion
trips.groupby("station")["distance_km"].sum()

station
Atocha     15.3
Cibeles    12.4
Retiro     19.9
Sol        17.8
Name: distance_km, dtype: float64

In [8]:
# Conteo por estacion
trips.groupby("station")["trip_id"].count()

station
Atocha     4
Cibeles    3
Retiro     3
Sol        5
Name: trip_id, dtype: int64

In [9]:
# Tambien con size()
trips.groupby("station").size()

station
Atocha     4
Cibeles    3
Retiro     3
Sol        5
dtype: int64

### Multiples columnas a la vez

In [10]:
# Media de todas las columnas numericas
trips.groupby("station")[["duration_min", "distance_km"]].mean()

Unnamed: 0_level_0,duration_min,distance_km
station,Unnamed: 1_level_1,Unnamed: 2_level_1
Atocha,19.25,3.825
Cibeles,20.666667,4.133333
Retiro,35.666667,6.633333
Sol,17.6,3.56


---

## 3. Agrupar por multiples columnas

In [11]:
# Por estacion Y tipo de usuario
trips.groupby(["station", "user_type"])["duration_min"].mean()

station  user_type 
Atocha   casual        22.5
         subscriber    16.0
Cibeles  casual        15.0
         subscriber    23.5
Retiro   casual        31.0
         subscriber    45.0
Sol      casual        10.0
         subscriber    19.5
Name: duration_min, dtype: float64

In [12]:
# Resultado como DataFrame (mas legible)
trips.groupby(["station", "user_type"])["duration_min"].mean().reset_index()

Unnamed: 0,station,user_type,duration_min
0,Atocha,casual,22.5
1,Atocha,subscriber,16.0
2,Cibeles,casual,15.0
3,Cibeles,subscriber,23.5
4,Retiro,casual,31.0
5,Retiro,subscriber,45.0
6,Sol,casual,10.0
7,Sol,subscriber,19.5


In [13]:
# O usando as_index=False
trips.groupby(["station", "user_type"], as_index=False)["duration_min"].mean()

Unnamed: 0,station,user_type,duration_min
0,Atocha,casual,22.5
1,Atocha,subscriber,16.0
2,Cibeles,casual,15.0
3,Cibeles,subscriber,23.5
4,Retiro,casual,31.0
5,Retiro,subscriber,45.0
6,Sol,casual,10.0
7,Sol,subscriber,19.5


---

## 4. Multiples agregaciones con `agg()`

In [14]:
# Una columna, multiples funciones
trips.groupby("station")["duration_min"].agg(["mean", "min", "max", "count"])

Unnamed: 0_level_0,mean,min,max,count
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Atocha,19.25,14,25,4
Cibeles,20.666667,12,35,3
Retiro,35.666667,22,45,3
Sol,17.6,8,30,5


In [15]:
# Con nombres personalizados
trips.groupby("station")["duration_min"].agg(
    avg_duration="mean",
    min_duration="min",
    max_duration="max",
    total_trips="count"
)

Unnamed: 0_level_0,avg_duration,min_duration,max_duration,total_trips
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Atocha,19.25,14,25,4
Cibeles,20.666667,12,35,3
Retiro,35.666667,22,45,3
Sol,17.6,8,30,5


### Diferentes funciones por columna

In [16]:
# Diccionario de agregaciones
trips.groupby("station").agg({
    "duration_min": ["mean", "std"],
    "distance_km": ["sum", "mean"],
    "trip_id": "count"
})

Unnamed: 0_level_0,duration_min,duration_min,distance_km,distance_km,trip_id
Unnamed: 0_level_1,mean,std,sum,mean,count
station,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Atocha,19.25,4.573474,15.3,3.825,4
Cibeles,20.666667,12.503333,12.4,4.133333,3
Retiro,35.666667,12.096832,19.9,6.633333,3
Sol,17.6,10.526158,17.8,3.56,5


In [17]:
# Named aggregations (mas limpio)
trips.groupby("station", as_index=False).agg(
    avg_duration=("duration_min", "mean"),
    total_distance=("distance_km", "sum"),
    num_trips=("trip_id", "count")
)

Unnamed: 0,station,avg_duration,total_distance,num_trips
0,Atocha,19.25,15.3,4
1,Cibeles,20.666667,12.4,3
2,Retiro,35.666667,19.9,3
3,Sol,17.6,17.8,5


---

## 5. Funciones de agregacion personalizadas

In [18]:
# Funcion personalizada: rango (max - min)
def range_func(x):
    return x.max() - x.min()

trips.groupby("station")["duration_min"].agg(range_func)

station
Atocha     11
Cibeles    23
Retiro     23
Sol        22
Name: duration_min, dtype: int64

In [19]:
# Con lambda
trips.groupby("station")["duration_min"].agg(lambda x: x.max() - x.min())

station
Atocha     11
Cibeles    23
Retiro     23
Sol        22
Name: duration_min, dtype: int64

In [20]:
# Percentil 90
trips.groupby("station")["duration_min"].agg(lambda x: x.quantile(0.9))

station
Atocha     23.5
Cibeles    31.0
Retiro     44.0
Sol        29.2
Name: duration_min, dtype: float64

---

## 6. `transform()`: mantener dimensiones originales

A diferencia de `agg()`, `transform()` devuelve un resultado con el **mismo tamano** que el original.

In [21]:
# Media por estacion - agregacion normal
trips.groupby("station")["duration_min"].mean()

station
Atocha     19.250000
Cibeles    20.666667
Retiro     35.666667
Sol        17.600000
Name: duration_min, dtype: float64

In [22]:
# Media por estacion - con transform (misma longitud)
trips.groupby("station")["duration_min"].transform("mean")

0     17.600000
1     19.250000
2     17.600000
3     35.666667
4     20.666667
5     17.600000
6     19.250000
7     35.666667
8     20.666667
9     17.600000
10    19.250000
11    17.600000
12    35.666667
13    20.666667
14    19.250000
Name: duration_min, dtype: float64

### Uso tipico: normalizar dentro de grupos

In [23]:
# Anadir columna con la media del grupo
trips["station_avg"] = trips.groupby("station")["duration_min"].transform("mean")

# Diferencia respecto a la media del grupo
trips["diff_from_avg"] = trips["duration_min"] - trips["station_avg"]

trips[["station", "duration_min", "station_avg", "diff_from_avg"]]

Unnamed: 0,station,duration_min,station_avg,diff_from_avg
0,Sol,12,17.6,-5.6
1,Atocha,25,19.25,5.75
2,Sol,8,17.6,-9.6
3,Retiro,45,35.666667,9.333333
4,Cibeles,15,20.666667,-5.666667
5,Sol,30,17.6,12.4
6,Atocha,18,19.25,-1.25
7,Retiro,22,35.666667,-13.666667
8,Cibeles,35,20.666667,14.333333
9,Sol,10,17.6,-7.6


In [24]:
# Limpiar columnas auxiliares
trips = trips.drop(columns=["station_avg", "diff_from_avg"])

---

## 7. Filtrar grupos con `filter()`

In [25]:
# Solo estaciones con mas de 3 viajes
trips.groupby("station").filter(lambda x: len(x) > 3)

Unnamed: 0,trip_id,station,user_type,duration_min,distance_km
0,1,Sol,subscriber,12,2.5
1,2,Atocha,casual,25,5.0
2,3,Sol,subscriber,8,1.8
5,6,Sol,subscriber,30,6.0
6,7,Atocha,subscriber,18,3.5
9,10,Sol,casual,10,2.0
10,11,Atocha,subscriber,14,2.8
11,12,Sol,subscriber,28,5.5
14,15,Atocha,casual,20,4.0


In [26]:
# Solo estaciones con duracion media > 20
trips.groupby("station").filter(lambda x: x["duration_min"].mean() > 20)

Unnamed: 0,trip_id,station,user_type,duration_min,distance_km
3,4,Retiro,subscriber,45,8.2
4,5,Cibeles,casual,15,3.1
7,8,Retiro,casual,22,4.2
8,9,Cibeles,subscriber,35,7.0
12,13,Retiro,casual,40,7.5
13,14,Cibeles,subscriber,12,2.3


---

## 8. Iterar sobre grupos

In [27]:
for name, group in trips.groupby("station"):
    print(f"\n--- {name} ({len(group)} viajes) ---")
    print(f"Duracion media: {group['duration_min'].mean():.1f} min")


--- Atocha (4 viajes) ---
Duracion media: 19.2 min

--- Cibeles (3 viajes) ---
Duracion media: 20.7 min

--- Retiro (3 viajes) ---
Duracion media: 35.7 min

--- Sol (5 viajes) ---
Duracion media: 17.6 min


---

## Resumen

| Metodo | Descripcion | Resultado |
|--------|-------------|-----------|
| `.agg()` | Agregacion | 1 fila por grupo |
| `.transform()` | Mantiene tamano | Mismas filas que original |
| `.filter()` | Filtra grupos | Subconjunto de filas |

**Funciones de agregacion comunes:**
- `mean`, `sum`, `count`, `size`
- `min`, `max`, `std`, `var`
- `first`, `last`, `nunique`

---

**Anterior:** [04.03 - Seleccion y Filtrado](../04_pandas_basics/04_03_selection_filtering.ipynb)  
**Siguiente:** [05.02 - Merge y Join](05_02_merge_join.ipynb)