# Lectura 26: DataFrame - Manipulación y selección II

## `cast`

Esta función castea las columnas de un DataFrame a un tipo de datos en específico. Recibe como parámetro un diccionario con el nombre de la(s) columna(s) que se desean castear y el tipo de dato al cual se van a castear.

In [1]:
import polars as pl
from datetime import date

df = pl.DataFrame(
    {
        "num": [1, 2, 3],
        "dec": [6.0, 7.0, 8.0],
        "date": [date(2024, 1, 2), date(2024, 3, 4), date(2023, 5, 6)],
        "date1": [date(2024, 5, 23), date(2024, 3, 14), date(2023, 5, 26)]
    }
)

df

num,dec,date,date1
i64,f64,date,date
1,6.0,2024-01-02,2024-05-23
2,7.0,2024-03-04,2024-03-14
3,8.0,2023-05-06,2023-05-26


In [2]:
df.cast({'num': pl.Float32, 'dec': pl.UInt8})

num,dec,date,date1
f32,u8,date,date
1.0,6,2024-01-02,2024-05-23
2.0,7,2024-03-04,2024-03-14
3.0,8,2023-05-06,2023-05-26


Podemos castear todas las columnas de un tipo de datos específico a otro tipo de datos usando selectores.

In [3]:
import polars.selectors as cs

df.cast({cs.date(): pl.Datetime})

num,dec,date,date1
i64,f64,datetime[μs],datetime[μs]
1,6.0,2024-01-02 00:00:00,2024-05-23 00:00:00
2,7.0,2024-03-04 00:00:00,2024-03-14 00:00:00
3,8.0,2023-05-06 00:00:00,2023-05-26 00:00:00


## `clone`

Con esta función podremos crear una copia de un DataFrame. Esta es una operación poco costosa porque no copia los datos.

In [4]:
vuelos = pl.read_parquet('./data/vuelos/', use_pyarrow=True)

vuelos_copy = vuelos.clone()

vuelos_copy

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,str,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,1,1,4,"""AS""",98,"""N407AS""","""ANC""","""SEA""",5,2354,-11,21,15,205,194,169,1448,404,4,430,408,-22,0,0,,,,,,
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,
2015,1,1,4,"""AS""",135,"""N527AS""","""SEA""","""ANC""",25,24,-1,11,35,235,215,199,1448,254,5,320,259,-21,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,12,31,4,"""B6""",688,"""N657JB""","""LAX""","""BOS""",2359,2355,-4,22,17,320,298,272,2611,749,4,819,753,-26,0,0,,,,,,
2015,12,31,4,"""B6""",745,"""N828JB""","""JFK""","""PSE""",2359,2355,-4,17,12,227,215,195,1617,427,3,446,430,-16,0,0,,,,,,
2015,12,31,4,"""B6""",1503,"""N913JB""","""JFK""","""SJU""",2359,2350,-9,17,7,221,222,197,1598,424,8,440,432,-8,0,0,,,,,,
2015,12,31,4,"""B6""",333,"""N527JB""","""MCO""","""SJU""",2359,2353,-6,10,3,161,157,144,1189,327,3,340,330,-10,0,0,,,,,,


## `explode`

Realiza un explode del DataFrame a un formato más largo al realizar un explode de las columnas proporcionadas. Para visualizar como funciona explode vamos a crear un nuevo DataFrame.

In [5]:
df_compacto = pl.DataFrame(
    {
        'letras': ['x', 'x', 'z', 'y'],
        'num': [[1], [2,3], [4,5], [6,7,8]]
    }
)

df_compacto

letras,num
str,list[i64]
"""x""",[1]
"""x""","[2, 3]"
"""z""","[4, 5]"
"""y""","[6, 7, 8]"


In [6]:
df_explode = df_compacto.explode('num')

df_explode

letras,num
str,i64
"""x""",1
"""x""",2
"""x""",3
"""z""",4
"""z""",5
"""y""",6
"""y""",7
"""y""",8


## `hstack`

Esta función retorna un nuevo DataFrame creciendo horizontalmente un DataFrame existente al agregarle múltiples series. Recordemos como está constituido el DataFrame `df_compacto` y empleemos `hstack` para crecerlo horizontalmente. 

In [7]:
df_compacto

letras,num
str,list[i64]
"""x""",[1]
"""x""","[2, 3]"
"""z""","[4, 5]"
"""y""","[6, 7, 8]"


In [8]:
colores = pl.Series('colores', ['rojo', 'verde', 'azul', 'verde'])

decimal = pl.Series('decimal', [1.2, 3.5, 5.3, 9.0])

In [9]:
df_extendido = df_compacto.hstack([colores, decimal])

df_extendido

letras,num,colores,decimal
str,list[i64],str,f64
"""x""",[1],"""rojo""",1.2
"""x""","[2, 3]","""verde""",3.5
"""z""","[4, 5]","""azul""",5.3
"""y""","[6, 7, 8]","""verde""",9.0


## `vstack` y `extend`

### `vstack`

Esta función crece el DataFrame verticalmente apilándole un DataFrame. Para ver su funcionamineto vamos a utulizar el Dataframe `vuelos` y el DataFrame `vuelos_copy` que previamente hemos creado.

In [10]:
vuelos.vstack(vuelos_copy)

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,str,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,1,1,4,"""AS""",98,"""N407AS""","""ANC""","""SEA""",5,2354,-11,21,15,205,194,169,1448,404,4,430,408,-22,0,0,,,,,,
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,
2015,1,1,4,"""AS""",135,"""N527AS""","""SEA""","""ANC""",25,24,-1,11,35,235,215,199,1448,254,5,320,259,-21,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,12,31,4,"""B6""",688,"""N657JB""","""LAX""","""BOS""",2359,2355,-4,22,17,320,298,272,2611,749,4,819,753,-26,0,0,,,,,,
2015,12,31,4,"""B6""",745,"""N828JB""","""JFK""","""PSE""",2359,2355,-4,17,12,227,215,195,1617,427,3,446,430,-16,0,0,,,,,,
2015,12,31,4,"""B6""",1503,"""N913JB""","""JFK""","""SJU""",2359,2350,-9,17,7,221,222,197,1598,424,8,440,432,-8,0,0,,,,,,
2015,12,31,4,"""B6""",333,"""N527JB""","""MCO""","""SJU""",2359,2353,-6,10,3,161,157,144,1189,327,3,340,330,-10,0,0,,,,,,


Esta función devuelve un nuevo DataFrame a menos que se especifique el parámetro `in_place=True`.

In [11]:
# Verificamos que el DataFrame vuelos no halla sido modificado

vuelos.shape

(5819079, 31)

In [12]:
vuelos.vstack(vuelos_copy, in_place=True)

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,str,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,1,1,4,"""AS""",98,"""N407AS""","""ANC""","""SEA""",5,2354,-11,21,15,205,194,169,1448,404,4,430,408,-22,0,0,,,,,,
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,
2015,1,1,4,"""AS""",135,"""N527AS""","""SEA""","""ANC""",25,24,-1,11,35,235,215,199,1448,254,5,320,259,-21,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,12,31,4,"""B6""",688,"""N657JB""","""LAX""","""BOS""",2359,2355,-4,22,17,320,298,272,2611,749,4,819,753,-26,0,0,,,,,,
2015,12,31,4,"""B6""",745,"""N828JB""","""JFK""","""PSE""",2359,2355,-4,17,12,227,215,195,1617,427,3,446,430,-16,0,0,,,,,,
2015,12,31,4,"""B6""",1503,"""N913JB""","""JFK""","""SJU""",2359,2350,-9,17,7,221,222,197,1598,424,8,440,432,-8,0,0,,,,,,
2015,12,31,4,"""B6""",333,"""N527JB""","""MCO""","""SJU""",2359,2353,-6,10,3,161,157,144,1189,327,3,340,330,-10,0,0,,,,,,


In [13]:
# Volvemos a verificar el DataFrame vuelos y veremos como ha sido modificado

vuelos.shape

(11638158, 31)

### `extend`

Esta función amplía la memoria respaldada por el DataFrame al cual se le aplica con los valores del DataFrame que se extiende.

A diferencia de `vstak`, que agrega los fragmentos del DataFrame que se pasa como parámetro a los fragmentos del DataFrame, `extend` agrega los datos del DataFrame pasado como parámetro a las ubicaciones de memoria subyacentes y, por lo tanto, puede provocar una reasignación.


Prefiera `extend` sobre `vstack` cuando desee realizar una consulta después de un solo append. Por ejemplo, durante operaciones en línea en las que agrega n filas y vuelve a ejecutar una consulta.

Prefiera `vstack` sobre `extend` cuando desee agregar muchas veces antes de realizar una consulta. Por ejemplo, cuando lee varios archivos y desea almacenarlos en un único DataFrame. En el último caso, finalice la secuencia de operaciones vstack con un `rechunk`.

Este método modifica el DataFrame in-place. El DataFrame es devuelto solo por conveniencia.

Para mostrar el funcionaminto de `extend` vamos a leer algunas particiones del DataFrame de vuelos que se encuentaran dentro de la carpeta vuelos_particionado y vamos a unirlos con `extend`.

In [14]:
vuelos_AA = pl.read_parquet('./data/vuelos_particionado/AIRLINE=AA/', use_pyarrow=True)

vuelos_AS = pl.read_parquet('./data/vuelos_particionado/AIRLINE=AS/', use_pyarrow=True)

vuelos_B6 = pl.read_parquet('./data/vuelos_particionado/AIRLINE=B6/', use_pyarrow=True)

In [15]:
vuelos = vuelos_AA.extend(vuelos_AS).extend(vuelos_B6)

In [16]:
vuelos

YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,1,1,4,2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,
2015,10,25,7,159,"""N3FLAA""","""11298""","""14771""",815,810,-5,14,824,222,216,199,1464,943,3,957,946,-11,0,0,,,,,,
2015,1,1,4,258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,
2015,10,25,7,937,"""N3GPAA""","""12478""","""14843""",815,808,-7,18,826,243,214,192,1598,1138,4,1218,1142,-36,0,0,,,,,,
2015,1,1,4,1112,"""N3LAAA""","""SFO""","""DFW""",30,19,-11,17,36,195,193,173,1464,529,3,545,532,-13,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,10,25,7,389,"""N236JB""","""10721""","""11278""",800,757,-3,14,811,102,91,74,399,925,3,942,928,-14,0,0,,,,,,
2015,10,25,7,683,"""N952JB""","""12478""","""13204""",803,759,-4,33,832,174,176,132,944,1044,11,1057,1055,-2,0,0,,,,,,
2015,10,25,7,453,"""N579JB""","""12478""","""14027""",805,802,-3,32,834,174,170,134,1028,1048,4,1059,1052,-7,0,0,,,,,,
2015,10,25,7,151,"""N267JB""","""10721""","""13204""",805,802,-3,15,817,198,179,155,1121,1052,9,1123,1101,-22,0,0,,,,,,


## `partition_by`

Esta función agrupa por las columnas proporcionadas y retorna los grupos como DataFrames separados en una lista.

Tomemos el DataFrame de `vuelos` que acabamos de crear y particionémoslo por la columna `MONTH`.

In [18]:
vuelos_por_mes = vuelos.partition_by('MONTH')

vuelos_por_mes

[shape: (78_939, 30)
 ┌──────┬───────┬─────┬─────────────┬───┬──────────────┬──────────────┬──────────────┬──────────────┐
 │ YEAR ┆ MONTH ┆ DAY ┆ DAY_OF_WEEK ┆ … ┆ SECURITY_DEL ┆ AIRLINE_DELA ┆ LATE_AIRCRAF ┆ WEATHER_DELA │
 │ ---  ┆ ---   ┆ --- ┆ ---         ┆   ┆ AY           ┆ Y            ┆ T_DELAY      ┆ Y            │
 │ i32  ┆ i32   ┆ i32 ┆ i32         ┆   ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
 │      ┆       ┆     ┆             ┆   ┆ i32          ┆ i32          ┆ i32          ┆ i32          │
 ╞══════╪═══════╪═════╪═════════════╪═══╪══════════════╪══════════════╪══════════════╪══════════════╡
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ null         ┆ null         ┆ null         │
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ null         ┆ null         ┆ null         │
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ null         ┆ null         ┆ null         │
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ null

In [19]:
vuelos_por_mes[1]

YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,10,25,7,159,"""N3FLAA""","""11298""","""14771""",815,810,-5,14,824,222,216,199,1464,943,3,957,946,-11,0,0,,,,,,
2015,10,25,7,937,"""N3GPAA""","""12478""","""14843""",815,808,-7,18,826,243,214,192,1598,1138,4,1218,1142,-36,0,0,,,,,,
2015,10,25,7,1213,"""N3FNAA""","""11298""","""14683""",815,814,-1,12,826,65,62,45,247,911,5,920,916,-4,0,0,,,,,,
2015,10,25,7,1410,"""N3KGAA""","""12953""","""13303""",815,810,-5,16,826,191,171,143,1096,1049,12,1126,1101,-25,0,0,,,,,,
2015,10,25,7,2269,"""N471AA""","""15376""","""11298""",815,811,-4,13,824,137,128,105,813,1209,10,1232,1219,-13,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,10,25,7,389,"""N236JB""","""10721""","""11278""",800,757,-3,14,811,102,91,74,399,925,3,942,928,-14,0,0,,,,,,
2015,10,25,7,683,"""N952JB""","""12478""","""13204""",803,759,-4,33,832,174,176,132,944,1044,11,1057,1055,-2,0,0,,,,,,
2015,10,25,7,453,"""N579JB""","""12478""","""14027""",805,802,-3,32,834,174,170,134,1028,1048,4,1059,1052,-7,0,0,,,,,,
2015,10,25,7,151,"""N267JB""","""10721""","""13204""",805,802,-3,15,817,198,179,155,1121,1052,9,1123,1101,-22,0,0,,,,,,


En caso de que deseemos que retorne los DataFrame en un diccionario podemos utilizar el parámetro `as_dict=True`.

In [20]:
vuelos_por_mes_dict = vuelos.partition_by('MONTH', as_dict=True)

vuelos_por_mes_dict

  vuelos_por_mes_dict = vuelos.partition_by('MONTH', as_dict=True)


{1: shape: (78_939, 30)
 ┌──────┬───────┬─────┬─────────────┬───┬──────────────┬──────────────┬──────────────┬──────────────┐
 │ YEAR ┆ MONTH ┆ DAY ┆ DAY_OF_WEEK ┆ … ┆ SECURITY_DEL ┆ AIRLINE_DELA ┆ LATE_AIRCRAF ┆ WEATHER_DELA │
 │ ---  ┆ ---   ┆ --- ┆ ---         ┆   ┆ AY           ┆ Y            ┆ T_DELAY      ┆ Y            │
 │ i32  ┆ i32   ┆ i32 ┆ i32         ┆   ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
 │      ┆       ┆     ┆             ┆   ┆ i32          ┆ i32          ┆ i32          ┆ i32          │
 ╞══════╪═══════╪═════╪═════════════╪═══╪══════════════╪══════════════╪══════════════╪══════════════╡
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ null         ┆ null         ┆ null         │
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ null         ┆ null         ┆ null         │
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ null         ┆ null         ┆ null         │
 │ 2015 ┆ 1     ┆ 1   ┆ 4           ┆ … ┆ null         ┆ n

In [21]:
vuelos_por_mes_dict.get(3)

YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,3,1,7,2400,"""N5DEAA""","""LAX""","""DFW""",5,,,,,168,,,1235,,,453,,,0,1,"""B""",,,,,
2015,3,1,7,258,"""N3HYAA""","""LAX""","""MIA""",20,16,-4,19,35,284,274,250,2342,745,5,804,750,-14,0,0,,,,,,
2015,3,1,7,1234,"""N3FPAA""","""LAS""","""ORD""",27,125,58,14,139,203,180,158,1514,617,8,550,625,35,0,0,,0,0,35,0,0
2015,3,1,7,1112,"""N3KGAA""","""SFO""","""DFW""",30,,,,,189,,,1464,,,539,,,0,1,"""B""",,,,,
2015,3,1,7,1674,"""N3GEAA""","""LAS""","""MIA""",45,53,8,12,105,264,251,232,2174,757,7,809,804,-5,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,3,18,3,361,"""N634JB""","""BOS""","""SJU""",2354,112,78,18,130,230,221,200,1674,450,3,344,453,69,0,0,,0,0,27,42,0
2015,3,18,3,98,"""N635JB""","""DEN""","""JFK""",2359,2,3,17,19,213,196,174,1626,513,5,532,518,-14,0,0,,,,,,
2015,3,18,3,839,"""N588JB""","""JFK""","""BQN""",2359,2357,-2,17,14,221,203,182,1576,316,4,340,320,-20,0,0,,,,,,
2015,3,18,3,745,"""N653JB""","""JFK""","""PSE""",2359,2,3,22,24,230,213,187,1617,331,4,349,335,-14,0,0,,,,,,


## `rename`

Esta función permite renombrar las columnas del DataFrame. 

In [22]:
vuelos.rename({'MONTH': 'mes', 'DAY': 'dia'})

YEAR,mes,dia,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,1,1,4,2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,
2015,10,25,7,159,"""N3FLAA""","""11298""","""14771""",815,810,-5,14,824,222,216,199,1464,943,3,957,946,-11,0,0,,,,,,
2015,1,1,4,258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,
2015,10,25,7,937,"""N3GPAA""","""12478""","""14843""",815,808,-7,18,826,243,214,192,1598,1138,4,1218,1142,-36,0,0,,,,,,
2015,1,1,4,1112,"""N3LAAA""","""SFO""","""DFW""",30,19,-11,17,36,195,193,173,1464,529,3,545,532,-13,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,10,25,7,389,"""N236JB""","""10721""","""11278""",800,757,-3,14,811,102,91,74,399,925,3,942,928,-14,0,0,,,,,,
2015,10,25,7,683,"""N952JB""","""12478""","""13204""",803,759,-4,33,832,174,176,132,944,1044,11,1057,1055,-2,0,0,,,,,,
2015,10,25,7,453,"""N579JB""","""12478""","""14027""",805,802,-3,32,834,174,170,134,1028,1048,4,1059,1052,-7,0,0,,,,,,
2015,10,25,7,151,"""N267JB""","""10721""","""13204""",805,802,-3,15,817,198,179,155,1121,1052,9,1123,1101,-22,0,0,,,,,,


## `with_columns`

Esta función permite agregar columnas al DataFrame. Si el nombre de la columna agregada coincide con un nombre de columna existente entonces se reemplazará la columna existente por la nueva columna.

In [23]:
from polars import col

vuelos.with_columns((col('DAY_OF_WEEK') * 10).alias('day_of_week_10'))

YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,day_of_week_10
i32,i32,i32,i32,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32,i32
2015,1,1,4,2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,40
2015,10,25,7,159,"""N3FLAA""","""11298""","""14771""",815,810,-5,14,824,222,216,199,1464,943,3,957,946,-11,0,0,,,,,,,70
2015,1,1,4,258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,40
2015,10,25,7,937,"""N3GPAA""","""12478""","""14843""",815,808,-7,18,826,243,214,192,1598,1138,4,1218,1142,-36,0,0,,,,,,,70
2015,1,1,4,1112,"""N3LAAA""","""SFO""","""DFW""",30,19,-11,17,36,195,193,173,1464,529,3,545,532,-13,0,0,,,,,,,40
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,10,25,7,389,"""N236JB""","""10721""","""11278""",800,757,-3,14,811,102,91,74,399,925,3,942,928,-14,0,0,,,,,,,70
2015,10,25,7,683,"""N952JB""","""12478""","""13204""",803,759,-4,33,832,174,176,132,944,1044,11,1057,1055,-2,0,0,,,,,,,70
2015,10,25,7,453,"""N579JB""","""12478""","""14027""",805,802,-3,32,834,174,170,134,1028,1048,4,1059,1052,-7,0,0,,,,,,,70
2015,10,25,7,151,"""N267JB""","""10721""","""13204""",805,802,-3,15,817,198,179,155,1121,1052,9,1123,1101,-22,0,0,,,,,,,70


Si no especificamos el nuevo nombre de columna se sobreescribirá la columna existente en el DataFrame.

In [24]:
vuelos.with_columns((col('DAY_OF_WEEK') * 10))

YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
i32,i32,i32,i32,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32
2015,1,1,40,2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,
2015,10,25,70,159,"""N3FLAA""","""11298""","""14771""",815,810,-5,14,824,222,216,199,1464,943,3,957,946,-11,0,0,,,,,,
2015,1,1,40,258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,
2015,10,25,70,937,"""N3GPAA""","""12478""","""14843""",815,808,-7,18,826,243,214,192,1598,1138,4,1218,1142,-36,0,0,,,,,,
2015,1,1,40,1112,"""N3LAAA""","""SFO""","""DFW""",30,19,-11,17,36,195,193,173,1464,529,3,545,532,-13,0,0,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,10,25,70,389,"""N236JB""","""10721""","""11278""",800,757,-3,14,811,102,91,74,399,925,3,942,928,-14,0,0,,,,,,
2015,10,25,70,683,"""N952JB""","""12478""","""13204""",803,759,-4,33,832,174,176,132,944,1044,11,1057,1055,-2,0,0,,,,,,
2015,10,25,70,453,"""N579JB""","""12478""","""14027""",805,802,-3,32,834,174,170,134,1028,1048,4,1059,1052,-7,0,0,,,,,,
2015,10,25,70,151,"""N267JB""","""10721""","""13204""",805,802,-3,15,817,198,179,155,1121,1052,9,1123,1101,-22,0,0,,,,,,


También podemos agregar varias columnas en una sola ejecución. Para ello debemos proporcionar las nuevas columnas en una lista como se muestra a continuación.

In [25]:
vuelos.with_columns(
    [
        (col('YEAR') + 1).alias('year_plus_1'),
        (col('AIR_TIME') / 60).alias('air_time_hrs'),
        col('TAIL_NUMBER').str.replace('N3','JO')
    ]
)

YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,year_plus_1,air_time_hrs
i32,i32,i32,i32,i32,str,str,str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,str,i32,i32,i32,i32,i32,i32,f64
2015,1,1,4,2336,"""JOKUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,2016,4.383333
2015,10,25,7,159,"""JOFLAA""","""11298""","""14771""",815,810,-5,14,824,222,216,199,1464,943,3,957,946,-11,0,0,,,,,,,2016,3.316667
2015,1,1,4,258,"""JOHYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,2016,4.3
2015,10,25,7,937,"""JOGPAA""","""12478""","""14843""",815,808,-7,18,826,243,214,192,1598,1138,4,1218,1142,-36,0,0,,,,,,,2016,3.2
2015,1,1,4,1112,"""JOLAAA""","""SFO""","""DFW""",30,19,-11,17,36,195,193,173,1464,529,3,545,532,-13,0,0,,,,,,,2016,2.883333
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,10,25,7,389,"""N236JB""","""10721""","""11278""",800,757,-3,14,811,102,91,74,399,925,3,942,928,-14,0,0,,,,,,,2016,1.233333
2015,10,25,7,683,"""N952JB""","""12478""","""13204""",803,759,-4,33,832,174,176,132,944,1044,11,1057,1055,-2,0,0,,,,,,,2016,2.2
2015,10,25,7,453,"""N579JB""","""12478""","""14027""",805,802,-3,32,834,174,170,134,1028,1048,4,1059,1052,-7,0,0,,,,,,,2016,2.233333
2015,10,25,7,151,"""N267JB""","""10721""","""13204""",805,802,-3,15,817,198,179,155,1121,1052,9,1123,1101,-22,0,0,,,,,,,2016,2.583333


## `unique`

Esta función elimina las filas duplicadas del DataFrame. Si no se le proporciona ningún parámetro usará todas las columnas para identificar las filas duplicadas y eliminarlas. En caso de que se desee indicar por cual columna(s) se debe aplicar el borrado se deberán proporcionar el parámetro `subset=[col1, col2, ..., colN]`.

Para mostrar su funcionamiento creemos un nuevo DataFrame.

In [26]:
df = pl.DataFrame(
    {
        'id': [1,2,3,1],
        'col_a': ['a', 'a', 'a', 'a'],
        'col_b': ['b', 'b', 'b', 'b']
    }
)

In [27]:
df.unique()

id,col_a,col_b
i64,str,str
3,"""a""","""b"""
2,"""a""","""b"""
1,"""a""","""b"""


Podemos mantener el orden del DataFrame original con el parámetro `maintain_order=True`. Esta operación es más costosa de calcular.

In [28]:
df.unique(maintain_order=True)

id,col_a,col_b
i64,str,str
1,"""a""","""b"""
2,"""a""","""b"""
3,"""a""","""b"""


Podemos indicarle la(s) columna(s) a considerar para identificar las filas duplicadas.

In [29]:
df.unique(subset=['col_a', 'col_b'])

id,col_a,col_b
i64,str,str
1,"""a""","""b"""
