# 04.03 - Seleccion y Filtrado

**Autor:** Miguel Angel Vazquez Varela  
**Nivel:** Fundamentos  
**Tiempo estimado:** 30 min

---

## Que aprenderemos?

- Seleccionar filas y columnas con `loc` e `iloc`
- Filtrar datos con condiciones booleanas
- Combinar multiples condiciones
- Metodos utiles: `query()`, `isin()`, `between()`

In [1]:
import pandas as pd
import numpy as np

---

## Crear datos de ejemplo

In [2]:
trips = pd.DataFrame({
    "trip_id": range(1, 11),
    "duration_min": [12, 25, 8, 45, 15, 30, 18, 22, 35, 10],
    "distance_km": [2.5, 5.0, 1.8, 8.2, 3.1, 6.0, 3.5, 4.2, 7.0, 2.0],
    "station_start": ["Sol", "Atocha", "Sol", "Retiro", "Cibeles", 
                      "Sol", "Atocha", "Retiro", "Cibeles", "Sol"],
    "user_type": ["subscriber", "casual", "subscriber", "subscriber", "casual",
                  "subscriber", "subscriber", "casual", "subscriber", "casual"]
})

trips

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
1,2,25,5.0,Atocha,casual
2,3,8,1.8,Sol,subscriber
3,4,45,8.2,Retiro,subscriber
4,5,15,3.1,Cibeles,casual
5,6,30,6.0,Sol,subscriber
6,7,18,3.5,Atocha,subscriber
7,8,22,4.2,Retiro,casual
8,9,35,7.0,Cibeles,subscriber
9,10,10,2.0,Sol,casual


---

## 1. Seleccion con `loc` (por etiqueta)

`loc` selecciona por **etiquetas** de indice y columnas.

### Seleccionar una fila

In [3]:
# Fila con indice 0
trips.loc[0]

trip_id                   1
duration_min             12
distance_km             2.5
station_start           Sol
user_type        subscriber
Name: 0, dtype: object

### Seleccionar multiples filas

In [4]:
# Filas 0, 2, 4
trips.loc[[0, 2, 4]]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
2,3,8,1.8,Sol,subscriber
4,5,15,3.1,Cibeles,casual


### Seleccionar filas y columnas

In [5]:
# Filas 0-3, columnas especificas
trips.loc[0:3, ["trip_id", "duration_min", "station_start"]]

Unnamed: 0,trip_id,duration_min,station_start
0,1,12,Sol
1,2,25,Atocha
2,3,8,Sol
3,4,45,Retiro


**Nota:** Con `loc`, el rango `0:3` incluye el 3 (a diferencia de Python normal).

### Seleccionar todas las filas, algunas columnas

In [6]:
trips.loc[:, ["duration_min", "distance_km"]]

Unnamed: 0,duration_min,distance_km
0,12,2.5
1,25,5.0
2,8,1.8
3,45,8.2
4,15,3.1
5,30,6.0
6,18,3.5
7,22,4.2
8,35,7.0
9,10,2.0


---

## 2. Seleccion con `iloc` (por posicion)

`iloc` selecciona por **posicion numerica** (como arrays de NumPy).

In [7]:
# Primera fila
trips.iloc[0]

trip_id                   1
duration_min             12
distance_km             2.5
station_start           Sol
user_type        subscriber
Name: 0, dtype: object

In [8]:
# Primeras 3 filas, primeras 2 columnas
trips.iloc[:3, :2]

Unnamed: 0,trip_id,duration_min
0,1,12
1,2,25
2,3,8


In [9]:
# Ultima fila
trips.iloc[-1]

trip_id              10
duration_min         10
distance_km         2.0
station_start       Sol
user_type        casual
Name: 9, dtype: object

In [10]:
# Filas 0, 2, 4 - columnas 1 y 3
trips.iloc[[0, 2, 4], [1, 3]]

Unnamed: 0,duration_min,station_start
0,12,Sol
2,8,Sol
4,15,Cibeles


**Nota:** Con `iloc`, el rango `0:3` NO incluye el 3 (como Python normal).

---

## 3. Filtrado booleano

La forma mas comun de filtrar datos en pandas.

### Paso 1: Crear mascara booleana

In [11]:
# Viajes largos (> 20 min)
mask = trips["duration_min"] > 20
mask

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8     True
9    False
Name: duration_min, dtype: bool

### Paso 2: Aplicar mascara

In [12]:
trips[mask]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
1,2,25,5.0,Atocha,casual
3,4,45,8.2,Retiro,subscriber
5,6,30,6.0,Sol,subscriber
7,8,22,4.2,Retiro,casual
8,9,35,7.0,Cibeles,subscriber


### Todo en una linea

In [13]:
trips[trips["duration_min"] > 20]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
1,2,25,5.0,Atocha,casual
3,4,45,8.2,Retiro,subscriber
5,6,30,6.0,Sol,subscriber
7,8,22,4.2,Retiro,casual
8,9,35,7.0,Cibeles,subscriber


### Otros operadores

In [14]:
# Igual a
trips[trips["station_start"] == "Sol"]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
2,3,8,1.8,Sol,subscriber
5,6,30,6.0,Sol,subscriber
9,10,10,2.0,Sol,casual


In [15]:
# Diferente de
trips[trips["user_type"] != "casual"]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
2,3,8,1.8,Sol,subscriber
3,4,45,8.2,Retiro,subscriber
5,6,30,6.0,Sol,subscriber
6,7,18,3.5,Atocha,subscriber
8,9,35,7.0,Cibeles,subscriber


---

## 4. Combinar condiciones

Usa `&` (AND), `|` (OR), `~` (NOT).  
**Importante:** Cada condicion debe ir entre parentesis.

In [16]:
# Viajes largos Y desde Sol
trips[(trips["duration_min"] > 20) & (trips["station_start"] == "Sol")]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
5,6,30,6.0,Sol,subscriber


In [17]:
# Viajes cortos O usuarios casuales
trips[(trips["duration_min"] < 15) | (trips["user_type"] == "casual")]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
1,2,25,5.0,Atocha,casual
2,3,8,1.8,Sol,subscriber
4,5,15,3.1,Cibeles,casual
7,8,22,4.2,Retiro,casual
9,10,10,2.0,Sol,casual


In [18]:
# Viajes que NO son desde Sol
trips[~(trips["station_start"] == "Sol")]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
1,2,25,5.0,Atocha,casual
3,4,45,8.2,Retiro,subscriber
4,5,15,3.1,Cibeles,casual
6,7,18,3.5,Atocha,subscriber
7,8,22,4.2,Retiro,casual
8,9,35,7.0,Cibeles,subscriber


---

## 5. Metodos utiles de filtrado

### `isin()` - valores en una lista

In [19]:
# Viajes desde Sol o Atocha
trips[trips["station_start"].isin(["Sol", "Atocha"])]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
1,2,25,5.0,Atocha,casual
2,3,8,1.8,Sol,subscriber
5,6,30,6.0,Sol,subscriber
6,7,18,3.5,Atocha,subscriber
9,10,10,2.0,Sol,casual


### `between()` - valores en un rango

In [20]:
# Viajes entre 15 y 30 minutos (inclusive)
trips[trips["duration_min"].between(15, 30)]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
1,2,25,5.0,Atocha,casual
4,5,15,3.1,Cibeles,casual
5,6,30,6.0,Sol,subscriber
6,7,18,3.5,Atocha,subscriber
7,8,22,4.2,Retiro,casual


### `str.contains()` - buscar texto

In [21]:
# Estaciones que contienen "o"
trips[trips["station_start"].str.contains("o")]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
1,2,25,5.0,Atocha,casual
2,3,8,1.8,Sol,subscriber
3,4,45,8.2,Retiro,subscriber
5,6,30,6.0,Sol,subscriber
6,7,18,3.5,Atocha,subscriber
7,8,22,4.2,Retiro,casual
9,10,10,2.0,Sol,casual


### `isna()` / `notna()` - valores nulos

In [22]:
# Crear datos con nulos
trips_with_nulls = trips.copy()
trips_with_nulls.loc[2, "distance_km"] = np.nan
trips_with_nulls.loc[5, "distance_km"] = np.nan

# Filas con valores nulos en distance
trips_with_nulls[trips_with_nulls["distance_km"].isna()]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
2,3,8,,Sol,subscriber
5,6,30,,Sol,subscriber


In [23]:
# Filas SIN valores nulos en distance
trips_with_nulls[trips_with_nulls["distance_km"].notna()]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
1,2,25,5.0,Atocha,casual
3,4,45,8.2,Retiro,subscriber
4,5,15,3.1,Cibeles,casual
6,7,18,3.5,Atocha,subscriber
7,8,22,4.2,Retiro,casual
8,9,35,7.0,Cibeles,subscriber
9,10,10,2.0,Sol,casual


---

## 6. `query()` - sintaxis mas legible

In [24]:
# Sintaxis tradicional
trips[(trips["duration_min"] > 20) & (trips["distance_km"] > 5)]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
3,4,45,8.2,Retiro,subscriber
5,6,30,6.0,Sol,subscriber
8,9,35,7.0,Cibeles,subscriber


In [25]:
# Con query() - mas legible
trips.query("duration_min > 20 and distance_km > 5")

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
3,4,45,8.2,Retiro,subscriber
5,6,30,6.0,Sol,subscriber
8,9,35,7.0,Cibeles,subscriber


In [26]:
# Variables externas con @
min_duration = 15
trips.query("duration_min >= @min_duration")

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
1,2,25,5.0,Atocha,casual
3,4,45,8.2,Retiro,subscriber
4,5,15,3.1,Cibeles,casual
5,6,30,6.0,Sol,subscriber
6,7,18,3.5,Atocha,subscriber
7,8,22,4.2,Retiro,casual
8,9,35,7.0,Cibeles,subscriber


In [27]:
# Strings entre comillas
trips.query('station_start == "Sol"')

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
0,1,12,2.5,Sol,subscriber
2,3,8,1.8,Sol,subscriber
5,6,30,6.0,Sol,subscriber
9,10,10,2.0,Sol,casual


---

## 7. Filtrar y seleccionar columnas

In [28]:
# Filtrar filas + seleccionar columnas con loc
trips.loc[
    trips["duration_min"] > 20,
    ["trip_id", "duration_min", "station_start"]
]

Unnamed: 0,trip_id,duration_min,station_start
1,2,25,Atocha
3,4,45,Retiro
5,6,30,Sol
7,8,22,Retiro
8,9,35,Cibeles


In [29]:
# Equivalente encadenando
trips[trips["duration_min"] > 20][["trip_id", "duration_min", "station_start"]]

Unnamed: 0,trip_id,duration_min,station_start
1,2,25,Atocha
3,4,45,Retiro
5,6,30,Sol
7,8,22,Retiro
8,9,35,Cibeles


---

## 8. Modificar valores filtrados

In [30]:
trips_copy = trips.copy()

# Cambiar user_type a "premium" para viajes > 30 min
trips_copy.loc[trips_copy["duration_min"] > 30, "user_type"] = "premium"

trips_copy[trips_copy["duration_min"] > 30]

Unnamed: 0,trip_id,duration_min,distance_km,station_start,user_type
3,4,45,8.2,Retiro,premium
8,9,35,7.0,Cibeles,premium


---

## Resumen

| Metodo | Uso | Ejemplo |
|--------|-----|--------|
| `loc` | Por etiqueta | `df.loc[0:3, ["col1"]]` |
| `iloc` | Por posicion | `df.iloc[:3, :2]` |
| `[]` | Filtro booleano | `df[df["col"] > 5]` |
| `query()` | Filtro legible | `df.query("col > 5")` |

**Operadores para combinar:**
- `&` = AND
- `|` = OR  
- `~` = NOT

**Metodos utiles:**
- `.isin([lista])` - valor en lista
- `.between(a, b)` - valor en rango
- `.str.contains("texto")` - buscar texto
- `.isna()` / `.notna()` - valores nulos

---

**Anterior:** [04.02 - Lectura de Datos](04_02_reading_data.ipynb)  
**Siguiente:** [05.01 - Groupby y Agregaciones](../05_pandas_intermediate/05_01_groupby.ipynb)