# 06.02 - Duplicados y Outliers

**Autor:** Miguel Angel Vazquez Varela  
**Nivel:** Intermedio  
**Tiempo estimado:** 25 min

---

## Que aprenderemos?

- Detectar y eliminar duplicados
- Identificar outliers (IQR, Z-score)
- Tratar outliers (eliminar, limitar, transformar)

In [1]:
import pandas as pd
import numpy as np

---

## Parte 1: Duplicados

In [2]:
trips = pd.DataFrame({
    "trip_id": [1, 2, 3, 2, 4, 5, 5],
    "station": ["Sol", "Atocha", "Sol", "Atocha", "Retiro", "Cibeles", "Cibeles"],
    "duration": [12, 25, 8, 25, 45, 15, 15],
    "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-01", 
             "2024-01-02", "2024-01-03", "2024-01-03"]
})

trips

Unnamed: 0,trip_id,station,duration,date
0,1,Sol,12,2024-01-01
1,2,Atocha,25,2024-01-01
2,3,Sol,8,2024-01-02
3,2,Atocha,25,2024-01-01
4,4,Retiro,45,2024-01-02
5,5,Cibeles,15,2024-01-03
6,5,Cibeles,15,2024-01-03


### Detectar duplicados

In [3]:
# duplicated() marca filas duplicadas (excepto la primera)
trips.duplicated()

0    False
1    False
2    False
3     True
4    False
5    False
6     True
dtype: bool

In [4]:
# Ver filas duplicadas
trips[trips.duplicated()]

Unnamed: 0,trip_id,station,duration,date
3,2,Atocha,25,2024-01-01
6,5,Cibeles,15,2024-01-03


In [5]:
# Ver TODAS las filas involucradas (incluyendo la primera)
trips[trips.duplicated(keep=False)]

Unnamed: 0,trip_id,station,duration,date
1,2,Atocha,25,2024-01-01
3,2,Atocha,25,2024-01-01
5,5,Cibeles,15,2024-01-03
6,5,Cibeles,15,2024-01-03


In [6]:
# Duplicados solo por ciertas columnas
trips[trips.duplicated(subset=["trip_id"], keep=False)]

Unnamed: 0,trip_id,station,duration,date
1,2,Atocha,25,2024-01-01
3,2,Atocha,25,2024-01-01
5,5,Cibeles,15,2024-01-03
6,5,Cibeles,15,2024-01-03


In [7]:
# Contar duplicados
print(f"Filas duplicadas: {trips.duplicated().sum()}")
print(f"Filas unicas: {len(trips) - trips.duplicated().sum()}")

Filas duplicadas: 2
Filas unicas: 5


### Eliminar duplicados

In [8]:
# Eliminar duplicados exactos
trips.drop_duplicates()

Unnamed: 0,trip_id,station,duration,date
0,1,Sol,12,2024-01-01
1,2,Atocha,25,2024-01-01
2,3,Sol,8,2024-01-02
4,4,Retiro,45,2024-01-02
5,5,Cibeles,15,2024-01-03


In [9]:
# Mantener el ultimo en vez del primero
trips.drop_duplicates(keep="last")

Unnamed: 0,trip_id,station,duration,date
0,1,Sol,12,2024-01-01
2,3,Sol,8,2024-01-02
3,2,Atocha,25,2024-01-01
4,4,Retiro,45,2024-01-02
6,5,Cibeles,15,2024-01-03


In [10]:
# Por columnas especificas
trips.drop_duplicates(subset=["trip_id"])

Unnamed: 0,trip_id,station,duration,date
0,1,Sol,12,2024-01-01
1,2,Atocha,25,2024-01-01
2,3,Sol,8,2024-01-02
4,4,Retiro,45,2024-01-02
5,5,Cibeles,15,2024-01-03


In [11]:
# Eliminar TODOS los duplicados (no mantener ninguno)
trips.drop_duplicates(keep=False)

Unnamed: 0,trip_id,station,duration,date
0,1,Sol,12,2024-01-01
2,3,Sol,8,2024-01-02
4,4,Retiro,45,2024-01-02


---

## Parte 2: Outliers

In [12]:
# Datos con outliers
np.random.seed(42)
durations = pd.DataFrame({
    "trip_id": range(1, 101),
    "duration": np.concatenate([
        np.random.normal(20, 5, 95),  # Normal: media 20, std 5
        [80, 90, 100, 2, 1]           # Outliers
    ])
})

durations["duration"] = durations["duration"].round(1)
durations.describe()

Unnamed: 0,trip_id,duration
count,100.0,100.0
mean,50.5,21.271
std,29.011492,13.269714
min,1.0,1.0
25%,25.75,16.95
50%,50.5,19.35
75%,75.25,22.8
max,100.0,100.0


### Metodo IQR (Interquartile Range)

In [13]:
Q1 = durations["duration"].quantile(0.25)
Q3 = durations["duration"].quantile(0.75)
IQR = Q3 - Q1

print(f"Q1: {Q1:.1f}")
print(f"Q3: {Q3:.1f}")
print(f"IQR: {IQR:.1f}")

Q1: 16.9
Q3: 22.8
IQR: 5.9


In [14]:
# Limites para outliers
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

print(f"Limite inferior: {lower:.1f}")
print(f"Limite superior: {upper:.1f}")

Limite inferior: 8.2
Limite superior: 31.6


In [15]:
# Identificar outliers
outliers_mask = (durations["duration"] < lower) | (durations["duration"] > upper)
print(f"Outliers encontrados: {outliers_mask.sum()}")

durations[outliers_mask]

Outliers encontrados: 6


Unnamed: 0,trip_id,duration
74,75,6.9
95,96,80.0
96,97,90.0
97,98,100.0
98,99,2.0
99,100,1.0


### Metodo Z-score

In [16]:
# Z-score: cuantas desviaciones estandar del promedio
mean = durations["duration"].mean()
std = durations["duration"].std()

durations["z_score"] = (durations["duration"] - mean) / std
durations[["duration", "z_score"]].head(10)

Unnamed: 0,duration,z_score
0,22.5,0.092617
1,19.3,-0.148534
2,23.2,0.145369
3,27.6,0.476951
4,18.8,-0.186214
5,18.8,-0.186214
6,27.9,0.499559
7,23.8,0.190584
8,17.7,-0.269109
9,22.7,0.107689


In [17]:
# Outliers: |z| > 3 (mas de 3 desviaciones estandar)
outliers_z = durations[abs(durations["z_score"]) > 3]
print(f"Outliers (Z > 3): {len(outliers_z)}")
outliers_z

Outliers (Z > 3): 3


Unnamed: 0,trip_id,duration,z_score
95,96,80.0,4.425792
96,97,90.0,5.179388
97,98,100.0,5.932984


In [18]:
# Limpiar columna auxiliar
durations = durations.drop(columns=["z_score"])

### Tratar outliers

In [19]:
# Opcion 1: Eliminar
durations_clean = durations[~outliers_mask].copy()
print(f"Filas originales: {len(durations)}")
print(f"Filas sin outliers: {len(durations_clean)}")

Filas originales: 100
Filas sin outliers: 94


In [20]:
# Opcion 2: Limitar (capping/clipping)
durations_capped = durations.copy()
durations_capped["duration"] = durations_capped["duration"].clip(lower=lower, upper=upper)

print(f"Min original: {durations['duration'].min():.1f}")
print(f"Min capped: {durations_capped['duration'].min():.1f}")
print(f"Max original: {durations['duration'].max():.1f}")
print(f"Max capped: {durations_capped['duration'].max():.1f}")

Min original: 1.0
Min capped: 8.2
Max original: 100.0
Max capped: 31.6


In [21]:
# Opcion 3: Reemplazar con NaN (luego imputar)
durations_nan = durations.copy()
durations_nan.loc[outliers_mask, "duration"] = np.nan

print(f"NaN creados: {durations_nan['duration'].isna().sum()}")

NaN creados: 6


In [22]:
# Opcion 4: Transformacion logaritmica (reduce impacto de extremos)
durations_log = durations.copy()
durations_log["duration_log"] = np.log1p(durations_log["duration"])  # log(1+x)

print("Estadisticas originales:")
print(durations_log["duration"].describe())
print("\nEstadisticas log:")
print(durations_log["duration_log"].describe())

Estadisticas originales:
count    100.000000
mean      21.271000
std       13.269714
min        1.000000
25%       16.950000
50%       19.350000
75%       22.800000
max      100.000000
Name: duration, dtype: float64

Estadisticas log:
count    100.000000
mean       2.996957
std        0.465369
min        0.693147
25%        2.887578
50%        3.013078
75%        3.169659
max        4.615121
Name: duration_log, dtype: float64


---

## Funcion reutilizable

In [23]:
def detect_outliers_iqr(series, k=1.5):
    """
    Detecta outliers usando el metodo IQR.
    
    Parameters
    ----------
    series : pd.Series
        Serie numerica
    k : float
        Multiplicador del IQR (default 1.5)
    
    Returns
    -------
    pd.Series
        Mascara booleana (True = outlier)
    """
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - k * IQR
    upper = Q3 + k * IQR
    return (series < lower) | (series > upper)

In [24]:
# Usar la funcion
outliers = detect_outliers_iqr(durations["duration"])
print(f"Outliers: {outliers.sum()}")

Outliers: 6


---

## Resumen

**Duplicados:**
| Metodo | Uso |
|--------|-----|
| `duplicated()` | Detectar duplicados |
| `drop_duplicates()` | Eliminar duplicados |

**Outliers:**
| Metodo | Criterio |
|--------|----------|
| IQR | Fuera de Q1-1.5*IQR, Q3+1.5*IQR |
| Z-score | \|z\| > 3 |

**Tratamiento:**
- Eliminar filas
- Limitar valores (clip)
- Reemplazar con NaN
- Transformar (log)

---

**Anterior:** [06.01 - Valores Faltantes](06_01_missing_values.ipynb)  
**Siguiente:** [06.03 - Tipos de Datos](06_03_data_types.ipynb)