# 06.01 - Valores Faltantes

**Autor:** Miguel Angel Vazquez Varela  
**Nivel:** Intermedio  
**Tiempo estimado:** 25 min

---

## Que aprenderemos?

- Detectar valores faltantes (NaN, None)
- Eliminar filas/columnas con `dropna()`
- Rellenar valores con `fillna()`
- Estrategias de imputacion

In [1]:
import pandas as pd
import numpy as np

---

## Datos de ejemplo

In [2]:
trips = pd.DataFrame({
    "trip_id": [1, 2, 3, 4, 5, 6, 7, 8],
    "duration_min": [12, np.nan, 8, 45, np.nan, 30, 18, np.nan],
    "distance_km": [2.5, 5.0, np.nan, 8.2, 3.1, np.nan, 3.5, 4.2],
    "station": ["Sol", "Atocha", "Sol", None, "Cibeles", "Sol", None, "Retiro"],
    "user_type": ["subscriber", "casual", "subscriber", "subscriber", None, "casual", "subscriber", "casual"]
})

trips

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
0,1,12.0,2.5,Sol,subscriber
1,2,,5.0,Atocha,casual
2,3,8.0,,Sol,subscriber
3,4,45.0,8.2,,subscriber
4,5,,3.1,Cibeles,
5,6,30.0,,Sol,casual
6,7,18.0,3.5,,subscriber
7,8,,4.2,Retiro,casual


---

## 1. Detectar valores faltantes

In [3]:
# isna() devuelve True donde hay NaN
trips.isna()

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
0,False,False,False,False,False
1,False,True,False,False,False
2,False,False,True,False,False
3,False,False,False,True,False
4,False,True,False,False,True
5,False,False,True,False,False
6,False,False,False,True,False
7,False,True,False,False,False


In [4]:
# Contar NaN por columna
trips.isna().sum()

trip_id         0
duration_min    3
distance_km     2
station         2
user_type       1
dtype: int64

In [5]:
# Porcentaje de NaN por columna
(trips.isna().sum() / len(trips) * 100).round(1)

trip_id          0.0
duration_min    37.5
distance_km     25.0
station         25.0
user_type       12.5
dtype: float64

In [6]:
# Total de NaN en todo el DataFrame
print(f"Total NaN: {trips.isna().sum().sum()}")
print(f"Filas con algun NaN: {trips.isna().any(axis=1).sum()}")

Total NaN: 8
Filas con algun NaN: 7


In [7]:
# Ver filas con valores faltantes
trips[trips.isna().any(axis=1)]

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
1,2,,5.0,Atocha,casual
2,3,8.0,,Sol,subscriber
3,4,45.0,8.2,,subscriber
4,5,,3.1,Cibeles,
5,6,30.0,,Sol,casual
6,7,18.0,3.5,,subscriber
7,8,,4.2,Retiro,casual


---

## 2. Eliminar con `dropna()`

In [8]:
# Eliminar filas con CUALQUIER NaN
trips.dropna()

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
0,1,12.0,2.5,Sol,subscriber


In [9]:
# Eliminar filas donde TODAS las columnas son NaN
trips.dropna(how="all")

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
0,1,12.0,2.5,Sol,subscriber
1,2,,5.0,Atocha,casual
2,3,8.0,,Sol,subscriber
3,4,45.0,8.2,,subscriber
4,5,,3.1,Cibeles,
5,6,30.0,,Sol,casual
6,7,18.0,3.5,,subscriber
7,8,,4.2,Retiro,casual


In [10]:
# Eliminar filas con NaN en columnas especificas
trips.dropna(subset=["duration_min", "station"])

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
0,1,12.0,2.5,Sol,subscriber
2,3,8.0,,Sol,subscriber
5,6,30.0,,Sol,casual


In [11]:
# Eliminar columnas con NaN (axis=1)
trips.dropna(axis=1)

Unnamed: 0,trip_id
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8


In [12]:
# thresh: minimo de valores NO nulos requeridos
trips.dropna(thresh=4)  # Al menos 4 valores no nulos

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
0,1,12.0,2.5,Sol,subscriber
1,2,,5.0,Atocha,casual
2,3,8.0,,Sol,subscriber
3,4,45.0,8.2,,subscriber
5,6,30.0,,Sol,casual
6,7,18.0,3.5,,subscriber
7,8,,4.2,Retiro,casual


---

## 3. Rellenar con `fillna()`

In [13]:
# Rellenar con valor fijo
trips["duration_min"].fillna(0)

0    12.0
1     0.0
2     8.0
3    45.0
4     0.0
5    30.0
6    18.0
7     0.0
Name: duration_min, dtype: float64

In [14]:
# Rellenar con la media
mean_duration = trips["duration_min"].mean()
print(f"Media: {mean_duration:.1f}")

trips["duration_min"].fillna(mean_duration)

Media: 22.6


0    12.0
1    22.6
2     8.0
3    45.0
4    22.6
5    30.0
6    18.0
7    22.6
Name: duration_min, dtype: float64

In [15]:
# Rellenar con la mediana (mas robusto a outliers)
median_duration = trips["duration_min"].median()
print(f"Mediana: {median_duration}")

trips["duration_min"].fillna(median_duration)

Mediana: 18.0


0    12.0
1    18.0
2     8.0
3    45.0
4    18.0
5    30.0
6    18.0
7    18.0
Name: duration_min, dtype: float64

In [16]:
# Rellenar texto con valor por defecto
trips["station"].fillna("Desconocida")

0            Sol
1         Atocha
2            Sol
3    Desconocida
4        Cibeles
5            Sol
6    Desconocida
7         Retiro
Name: station, dtype: str

In [17]:
# Rellenar con el valor mas frecuente (moda)
mode_station = trips["station"].mode()[0]
print(f"Moda: {mode_station}")

trips["station"].fillna(mode_station)

Moda: Sol


0        Sol
1     Atocha
2        Sol
3        Sol
4    Cibeles
5        Sol
6        Sol
7     Retiro
Name: station, dtype: str

### Rellenar con valor anterior/siguiente

In [18]:
# Forward fill: usa el valor anterior
trips["duration_min"].ffill()

0    12.0
1    12.0
2     8.0
3    45.0
4    45.0
5    30.0
6    18.0
7    18.0
Name: duration_min, dtype: float64

In [19]:
# Backward fill: usa el valor siguiente
trips["duration_min"].bfill()

0    12.0
1     8.0
2     8.0
3    45.0
4    30.0
5    30.0
6    18.0
7     NaN
Name: duration_min, dtype: float64

---

## 4. Diferentes estrategias por columna

In [20]:
# Copiar para no modificar original
trips_clean = trips.copy()

# Numericas: rellenar con mediana
trips_clean["duration_min"] = trips_clean["duration_min"].fillna(
    trips_clean["duration_min"].median()
)
trips_clean["distance_km"] = trips_clean["distance_km"].fillna(
    trips_clean["distance_km"].median()
)

# Categoricas: rellenar con moda o "Unknown"
trips_clean["station"] = trips_clean["station"].fillna("Unknown")
trips_clean["user_type"] = trips_clean["user_type"].fillna(
    trips_clean["user_type"].mode()[0]
)

trips_clean

Unnamed: 0,trip_id,duration_min,distance_km,station,user_type
0,1,12.0,2.5,Sol,subscriber
1,2,18.0,5.0,Atocha,casual
2,3,8.0,3.85,Sol,subscriber
3,4,45.0,8.2,Unknown,subscriber
4,5,18.0,3.1,Cibeles,subscriber
5,6,30.0,3.85,Sol,casual
6,7,18.0,3.5,Unknown,subscriber
7,8,18.0,4.2,Retiro,casual


In [21]:
# Verificar que no hay NaN
trips_clean.isna().sum()

trip_id         0
duration_min    0
distance_km     0
station         0
user_type       0
dtype: int64

---

## 5. Imputacion por grupo

In [22]:
# Datos con patron por grupo
trips2 = pd.DataFrame({
    "station": ["Sol", "Sol", "Sol", "Atocha", "Atocha", "Atocha"],
    "duration": [10, np.nan, 12, 25, 30, np.nan]
})

trips2

Unnamed: 0,station,duration
0,Sol,10.0
1,Sol,
2,Sol,12.0
3,Atocha,25.0
4,Atocha,30.0
5,Atocha,


In [23]:
# Rellenar con la media del GRUPO
trips2["duration_filled"] = trips2.groupby("station")["duration"].transform(
    lambda x: x.fillna(x.mean())
)

trips2

Unnamed: 0,station,duration,duration_filled
0,Sol,10.0,10.0
1,Sol,,11.0
2,Sol,12.0,12.0
3,Atocha,25.0,25.0
4,Atocha,30.0,30.0
5,Atocha,,27.5


---

## 6. Interpolacion

In [24]:
# Serie temporal con huecos
ts = pd.Series([10, np.nan, np.nan, 40, 50, np.nan, 70])
print("Original:")
print(ts.values)

Original:
[10. nan nan 40. 50. nan 70.]


In [25]:
# Interpolacion lineal
print("Interpolado:")
print(ts.interpolate().values)

Interpolado:
[10. 20. 30. 40. 50. 60. 70.]


---

## Resumen

| Metodo | Uso |
|--------|-----|
| `isna()` | Detectar NaN |
| `dropna()` | Eliminar filas/columnas |
| `fillna(valor)` | Rellenar con valor fijo |
| `fillna(media)` | Rellenar con estadistico |
| `ffill()` / `bfill()` | Propagar valor anterior/siguiente |
| `interpolate()` | Interpolar valores |

**Estrategias comunes:**
- Numericas: media, mediana, interpolacion
- Categoricas: moda, "Unknown", valor por defecto

---

**Anterior:** [05.03 - Pivot y Reshape](../05_pandas_intermediate/05_03_pivot_reshape.ipynb)  
**Siguiente:** [06.02 - Duplicados y Outliers](06_02_duplicates_outliers.ipynb)