-
-
Notifications
You must be signed in to change notification settings - Fork 284
/
data_format_conversion.md
179 lines (138 loc) · 5.26 KB
/
data_format_conversion.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
file_format: mystnb
---
```{currentmodule} pandera
```
(data-format-conversion)=
# Data Format Conversion
*new in 0.9.0*
The class-based API provides configuration options for converting data to/from
supported serialization formats in the context of
{py:func}`~pandera.decorators.check_types` -decorated functions.
:::{note}
Currently, {py:class}`pandera.typing.pandas.DataFrame` is the only data
type that supports this feature.
:::
Consider this simple example:
```{code-cell} python
import pandera as pa
from pandera.typing import DataFrame, Series
class InSchema(pa.DataFrameModel):
str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
int_col: Series[int]
class OutSchema(InSchema):
float_col: pa.typing.Series[float]
@pa.check_types
def transform(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
return df.assign(float_col=1.1)
```
With the schema type annotations and
{py:func}`~pandera.decorators.check_types` decorator, the `transform`
function validates DataFrame inputs and outputs according to the `InSchema`
and `OutSchema` definitions.
But what if your input data is serialized in parquet format, and you want to
read it into memory, validate the DataFrame, and then pass it to a downstream
function for further analysis? Similarly, what if you want the output of
`transform` to be a list of dictionary records instead of a pandas DataFrame?
## The `to/from_format` Configuration Options
To easily fulfill the use cases described above, you can implement the
read/write logic by hand, or you can configure schemas to do so. We can first
define a subclass of `InSchema` with additional configuration so that our
`transform` function can read data directly from parquet files or buffers:
```{code-cell} python
class InSchemaParquet(InSchema):
class Config:
from_format = "parquet"
```
Then, we define subclass of `OutSchema` to specify that `transform`
should output a list of dictionaries representing the rows of the output
dataframe.
```{code-cell} python
class OutSchemaDict(OutSchema):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
```
Note that the `{to/from}_format_kwargs` configuration option should be
supplied with a dictionary of key-word arguments to be passed into the
respective pandas `{to/from}_format` method.
Finally, we redefine our `transform` function:
```{code-cell} python
@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
return df.assign(float_col=1.1)
```
We can test this out using a buffer to store the parquet file.
:::{note}
A string or path-like object representing the filepath to a parquet file
would also be a valid input to `transform`.
:::
```{code-cell} python
import io
import json
buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)
dict_output = transform(buffer)
print(json.dumps(dict_output, indent=2))
```
## Custom Converters with Callables
In addition to specifying a literal string argument for `from_format` a
generic callable that returns a pandas dataframe can be passed. For example,
`pd.read_excel`, `pd.read_sql`, or `pd.read_gbq`. Depending on the function
passed, some of the kwargs arguments may be required rather than optional in
`from_format_kwargs` (`pd.read_sql` requires a connection object).
A callable can also be an argument for the `to_format` parameter, with the
additional, optional, `to_format_buffer` parameter. Some pandas dataframe writing
methods, such as `pd.to_pickle`, have a required path argument, that must be
either a string file path or a bytes object. An example for writing data to a
pickle file would be:
```{code-cell} python
import tempfile
def custom_to_pickle(data, *args, **kwargs):
return data.to_pickle(*args, **kwargs)
def custom_to_pickle_buffer():
"""Create a named temporary file handle to write the pickle file."""
return tempfile.NamedTemporaryFile()
class OutSchemaPickleCallable(OutSchema):
class Config:
to_format = custom_to_pickle
# If provided, the output of this function will be supplied as
# the first positional argument to the ``to_format`` function.
to_format_buffer = custom_to_pickle_buffer
```
In this example, we use a `custom_to_pickle_buffer` function as the
`to_format_buffer` property, which returns a {func}`tempfile.NamedTemporaryFile`.
This will be supplied as a positional argument to the `custom_to_pickle`
function.
The full set of configuration options are:
```{eval-rst}
.. list-table:: Title
:widths: 50 60
:header-rows: 1
* - Format
- Argument
* - dict
- "dict"
* - csv
- "csv"
* - json
- "json"
* - feather
- "feather"
* - parquet
- "parquet"
* - pickle
- "pickle"
* - Callable
- Callable
```
## Takeaway
Data Format Conversion using the `{to/from}_format` configuration option
can modify the behavior of {py:func}`~pandera.decorators.check_types` -decorated
functions to convert input data from a particular serialization format into
a dataframe. Additionally, you can convert the output data from a dataframe to
potentially another format.
This dovetails well with the {ref}`FastAPI Integration <fastapi-integration>`
for validating the inputs and outputs of app endpoints.