# 7. Data Manipulation VI - Interoperation and IO

So far in this course, we've seen how to load data from a csv, load data from a parquet, and create a DataFrame from a dictionary of columns. However, there are so many more ways of storing tabular data, both in and out of Python: excel files, numpy arrays, Pandas dataframes, lists of dictionaries, dictionaries of lists, and still more. The goal of this module is to learn how to switch data seamlessly between Pandas, Numpy, and Polars, and be able to read and write data to numerous formats; we'll learn this by going through a few examples of common interoperation and IO use-cases.

But first we import `polars`...

In [2]:
import polars as pl

%run setup.py

/data/datasets/data/yellow_tripdata_2024-03.parquet already exists
/data/datasets/data/taxi_zone_lookup.csv already exists

local_parquet = /data/datasets/data/yellow_tripdata_2024-03.parquet
local_csv = /data/datasets/data/yellow_tripdata_2024-03.csv
taxi_zone_lookup_local = /data/datasets/data/taxi_zone_lookup.csv
taxi_zone_lookup_parquet = /data/datasets/data/taxi_zone_lookup.parquet


## 7.1. Creating Dataframes from dict of lists and lists of dicts

We've seen by now a few instances of creating a dataframe from a dictionary of columns:

In [3]:
df = pl.DataFrame(
    {
        "first_name": ["dan", "stan", "ron", "dawn"],
        "last_name": ["flanson", "cranson", "bronson", "johnson"],
    }
)
df.head()

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


But we can also create the same dataframe from a list of rows as dictionaries with `pl.from_dicts`:

In [4]:
df = pl.from_dicts(
    [
        {"first_name": "dan", "last_name": "flanson"},
        {"first_name": "stan", "last_name": "cranson"},
        {"first_name": "ron", "last_name": "bronson"},
        {"first_name": "dawn", "last_name": "johnson"},
    ]
)
df.head()

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


And if passing the column names in the data dictionary itself becomes too verbose, we can also create a dataframe with `pl.from_records`, specifying the schema in a separate argument:

In [5]:
df = pl.from_records(
    [
        ["dan", "flanson"],
        ["stan", "cranson"],
        ["ron", "bronson"],
        ["dawn", "johnson"],
    ],
    schema=["first_name", "last_name"],
    orient="row",
)
df

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


And the same can be done from a columns orientation:

In [6]:
df = pl.from_records(
    [
        ["dan", "stan", "ron", "dawn"],
        ["flanson", "cranson", "bronson", "johnson"],
    ],
    schema=["first_name", "last_name"],
    orient="col",
)
df

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


Finally, we can easily return to either `pl.DataFrame.to_dict()` or `pl.DataFrame.to_dicts()`:

In [7]:
df

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


In [8]:
display(df.to_dict(as_series=False))

{'first_name': ['dan', 'stan', 'ron', 'dawn'],
 'last_name': ['flanson', 'cranson', 'bronson', 'johnson']}

In [9]:
display(df.to_dicts())

[{'first_name': 'dan', 'last_name': 'flanson'},
 {'first_name': 'stan', 'last_name': 'cranson'},
 {'first_name': 'ron', 'last_name': 'bronson'},
 {'first_name': 'dawn', 'last_name': 'johnson'}]

## 7.2. Interoperating with `pl.Series`

You may have noticed in the call to `df.to_dict()` just now that we passed the argument `as_series=False`. But what if we had left the default there?

In [10]:
display(df.to_dict())

{'first_name': shape: (4,)
 Series: 'first_name' [str]
 [
 	"dan"
 	"stan"
 	"ron"
 	"dawn"
 ],
 'last_name': shape: (4,)
 Series: 'last_name' [str]
 [
 	"flanson"
 	"cranson"
 	"bronson"
 	"johnson"
 ]}

That's right! Though we haven't discussed it until now, `polars` offers a `pl.Series` class alongside the `pl.DataFrame` option. You can create a `pl.Series` similar to how we've created a `pl.DataFrame` in the prior section:

In [11]:
pl.Series(name="first_name", values=["dan", "stan", "ron", "dawn"])

first_name
str
"""dan"""
"""stan"""
"""ron"""
"""dawn"""


Instead of creating it again, we could have also extracted this `pl.Series` from the dataframe above using bracket notation:

In [12]:
df["first_name"]

first_name
str
"""dan"""
"""stan"""
"""ron"""
"""dawn"""


In fact, we can create a `pl.DataFrame` from a list of `pl.Series`s as well:

In [13]:
df_from_series = pl.DataFrame(
    [
        pl.Series(name="first_name", values=["dan", "stan", "ron", "dawn"]),
        pl.Series(
            name="last_name",
            values=["flanson", "cranson", "bronson", "johnson"],
        ),
    ]
)
display(df_from_series)

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


And if we ever want to convert one of the columns to a `pl.Series` object, it's easy to do with `.to_series()`, passing the index of the column you want to convert to a `pl.Series`:

In [14]:
series = df_from_series.to_series(0)
display(series)

first_name
str
"""dan"""
"""stan"""
"""ron"""
"""dawn"""


And the reverse is also possible:

In [15]:
series.to_frame()

first_name
str
"""dan"""
"""stan"""
"""ron"""
"""dawn"""


(Note the way that the `shape` changes from `(4,)` to `(4, 1)`.)

## 7.3. Interoperating With Pandas dataframes

`polars` enables us to seamlessly switch between `polars` and `pandas` dataframes. Again using the same toy dataframe:

In [16]:
df = pl.DataFrame(
    {
        "first_name": ["dan", "stan", "ron", "dawn"],
        "last_name": ["flanson", "cranson", "bronson", "johnson"],
    }
)
df.head()

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


It's just a simple call to `.to_pandas()`.

In [17]:
pandas_df = df.to_pandas()
display(type(pandas_df))
display(pandas_df)

pandas.core.frame.DataFrame

Unnamed: 0,first_name,last_name
0,dan,flanson
1,stan,cranson
2,ron,bronson
3,dawn,johnson


And we can go back with `pl.from_pandas()`:

In [18]:
polars_df = pl.from_pandas(pandas_df)
display(type(polars_df))
display(polars_df)

polars.dataframe.frame.DataFrame

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


## 7.4. Interoperating With Numpy Arrays

As with `pandas`, same with `numpy`:

In [19]:
numpy_array = df.to_numpy()
display(numpy_array)

array([['dan', 'flanson'],
       ['stan', 'cranson'],
       ['ron', 'bronson'],
       ['dawn', 'johnson']], dtype=object)

And we can just as easily switch back to `polars`:

In [20]:
df_from_numpy_array = pl.from_numpy(numpy_array)
display(df_from_numpy_array)

column_0,column_1
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


However, since `numpy` isn't truly a library for working with dataframes (i.e. with columnar data that has column names etc), if we want to bring back the column names then we'll have to pass them explicitly, with an argument:

In [21]:
df_from_numpy_array = pl.from_numpy(
    numpy_array, schema=["first_name", "last_name"]
)
display(df_from_numpy_array)

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


## 7.5. Dataframe Writing Rapid-Fire

Let's look at some more of the different ways that `polars` can write out our toy dataframe to disk:

In [22]:
df

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


#### Working with excel files

We can write data to an excel file...

In [23]:
df.write_excel("../data/toy_df.xlsx")

<xlsxwriter.workbook.Workbook at 0xe9b00bc1ca00>

...and load it back in from that file.

In [24]:
pl.read_excel("../data/toy_df.xlsx")

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


#### Working with json files

We can write data to a json file...

In [25]:
df.write_json("../data/toy_df.json")

... and load it back in from that file.

In [26]:
pl.read_json("../data/toy_df.json")

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


#### Working with njson files

We can write data to a newline delimited json file...

In [27]:
df.write_ndjson("../data/toy_df.njson")

... and load it back in from that file.

In [28]:
pl.read_ndjson("../data/toy_df.njson")

first_name,last_name
str,str
"""dan""","""flanson"""
"""stan""","""cranson"""
"""ron""","""bronson"""
"""dawn""","""johnson"""


#### Working with partitioned pyarrow datasets

We can also write data in partitions to disk, by leaning on `pyarrow` partition functionality. To show this, let's load in the taxi rides data, adding a `"tpep_pickup_date"` column for partitioning:

In [29]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
march_yellow_rides_df = (
    pl.read_parquet(local_parquet)
    .rename(yellow_rides_column_rename_mapping)
    .with_columns(
        pl.col("tpep_pickup_datetime").dt.date().alias("tpep_pickup_date")
    )
)
display(march_yellow_rides_df.shape)
display(march_yellow_rides_df.head())

(3582628, 20)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,tpep_pickup_date
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,date
1,2024-03-01 00:18:51,2024-03-01 00:23:45,0,1.3,1,"""N""",142,239,1,8.6,3.5,0.5,2.7,0.0,1.0,16.3,2.5,0.0,2024-03-01
1,2024-03-01 00:26:00,2024-03-01 00:29:06,0,1.1,1,"""N""",238,24,1,7.2,3.5,0.5,3.0,0.0,1.0,15.2,2.5,0.0,2024-03-01
2,2024-03-01 00:09:22,2024-03-01 00:15:24,1,0.86,1,"""N""",263,75,2,7.9,1.0,0.5,0.0,0.0,1.0,10.4,0.0,0.0,2024-03-01
2,2024-03-01 00:33:45,2024-03-01 00:39:34,1,0.82,1,"""N""",164,162,1,7.9,1.0,0.5,1.29,0.0,1.0,14.19,2.5,0.0,2024-03-01
1,2024-03-01 00:05:43,2024-03-01 00:26:22,0,4.9,1,"""N""",263,7,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0,2024-03-01


We can then save this dataframe, not to a single parquet file like we read from, but to a parquet file **pattern** (or simply directory).

In [30]:
march_yellow_rides_df.write_parquet(
    "../data/partitioned_march_rides_2/",
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["tpep_pickup_date"]},
)

And we can load this back in with a regular `.read_parquet()`, pointing to the directory, again using `pyarrow`:

In [31]:
loaded_df_all_partition = pl.read_parquet(
    "../data/partitioned_march_rides_2/", use_pyarrow=True
)
display(loaded_df_all_partition.shape)
display(loaded_df_all_partition.head())

(3582628, 20)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,tpep_pickup_date
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,cat
2,2002-12-31 22:17:10,2002-12-31 22:42:24,1,1.4,1,"""N""",50,162,1,10.0,1.0,0.5,3.0,0.0,1.0,18.0,2.5,0.0,"""2002-12-31"""
2,2002-12-31 23:08:30,2003-01-01 14:58:35,1,5.34,1,"""N""",132,124,2,21.9,0.0,0.5,0.0,0.0,1.0,25.15,0.0,1.75,"""2002-12-31"""
2,2024-02-29 23:52:39,2024-02-29 23:57:31,2,0.69,1,"""N""",234,113,1,6.5,1.0,0.5,2.3,0.0,1.0,13.8,2.5,0.0,"""2024-02-29"""
2,2024-02-29 23:59:33,2024-03-01 00:18:39,2,3.43,1,"""N""",68,148,1,19.8,1.0,0.5,3.0,0.0,1.0,27.8,2.5,0.0,"""2024-02-29"""
2,2024-02-29 23:59:13,2024-03-01 00:13:55,1,8.92,1,"""N""",132,39,1,34.5,1.0,0.5,0.0,0.0,1.0,38.75,0.0,1.75,"""2024-02-29"""


And all the data is there! If we'd wanted to, though, we could have loaded data from just one partition:

In [32]:
loaded_df_one_partition = pl.read_parquet(
    "../data/partitioned_march_rides_2/tpep_pickup_date=2024-03-06/",
    use_pyarrow=True,
)
display(loaded_df_one_partition.shape)
display(loaded_df_one_partition.head())

(132881, 19)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64
2,2024-03-06 00:08:24,2024-03-06 00:14:45,,2.59,,,140,224,0,11.87,0.0,0.5,3.17,0.0,1.0,19.04,,
1,2024-03-06 00:17:43,2024-03-06 00:32:26,,4.1,,,234,262,0,20.09,0.0,0.5,0.0,0.0,1.0,24.09,,
2,2024-03-06 00:40:33,2024-03-06 00:54:56,,6.84,,,137,40,0,24.13,0.0,0.5,4.43,0.0,1.0,32.56,,
2,2024-03-06 00:16:10,2024-03-06 00:24:43,,4.52,,,137,88,0,16.54,0.0,0.5,3.06,0.0,1.0,23.6,,
2,2024-03-06 00:49:08,2024-03-06 00:57:56,,1.91,,,230,246,0,11.56,0.0,0.5,0.0,0.0,1.0,15.56,,


#### Notes on other IO options

`polars` enables you to read from and write to still more file formats and locations than what we've gone through here, including, but not limited to, avro files, arbitrary database connections, Delta lake, and Apache Iceberg. Some of those can be read from, some of those can be scanned from (i.e. Lazy mode), while most of them can be both read and scanned from.

# Conclusion

In this module, we've learned how to perform interoperation between `polars`, `numpy`, and `pandas`, as well as with built in Python objects like lists and dicts. We've also broadened our options for reading and writing data.