# Data Encoding, Decoding and Flow

## Apache Parquet, ORC and Arrow

We can easily read (decode) and write (encode) data from and to Parquet, ORC and Arrow files interchangeably. The `pyarrow` library allows us to read a Parquet or ORC file into a `pyarrow.Table` object, which is a columnar data structure that can be converted to a Pandas DataFrame. We can also write a `pyarrow.Table` to a Parquet or ORC file.

Parquet has the following types:

- boolean: 1 bit boolean
- int32: 32 bit signed ints
- int64: 64 bit signed ints
- int96: 96 bit signed ints
- float: IEEE 32-bit floating point values
- double: IEEE 64-bit floating point values
- byte_array: arbitrarily long byte arrays
- fixed_len_byte_array: fixed length byte arrays
- string: UTF-8 encoded strings
- enum: enumeration of strings
- temporal: a logical date type

ORC has the following types:

- boolean: 1 bit boolean
- tinyint: 8 bit signed ints
- smallint: 16 bit signed ints
- int: 32 bit signed ints
- bigint: 64 bit signed ints
- float: IEEE 32-bit floating point values
- double: IEEE 64-bit floating point values
- string: UTF-8 encoded strings
- char: ASCII strings
- varchar: UTF-8 strings
- binary: byte arrays
- timestamp: a logical date type
- date: a logical date type
- decimal: arbitrary precision decimals
- list: an ordered collection of objects
- map: a collection of key-value pairs
- struct: an ordered collection of named fields
- union: a list of types

![overview-diagram](../assets/diagram-2.png)

### Reading (Decoding) and Writing (Encoding) a Parquet File

Let's look at how to decode and encode a Parquet file with mock customers data.

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq

In [None]:
table = pq.read_table('../data/userdata1.parquet')

In [None]:
table

In [None]:
table.schema

In [None]:
metadata = pq.read_metadata('../data/userdata1.parquet')

metadata

In [None]:
metadata.schema

In [None]:
metadata.row_group(0).column(10)

Select the first 3 rows of the table:

In [None]:
table.take([0,1,2])

Convert a Table to a DataFrame:

In [None]:
df = table.to_pandas()

In [None]:
df

You can convert the DataFrame back to a Table (note we're using the method from `pa` which is pyarrow):

In [None]:
new_table = pa.Table.from_pandas(df)

new_table

You can write the table back to a Parquet file:

In [None]:
pq.write_table(new_table, "../data/userdata2.parquet")

> 1. How many males and females are there?
>
> 2. What is the average salary for customers from China?
>
> 3. Create a new column `full_name` which combines `first_name` and `last_name` with a space in between in the dataframe. Then convert it back to a new Table and write it to a Parquet file.

### Reading (Decoding) and Writing (Encoding) an ORC File

Let's look at how to decode and encode an ORC file with mock data.

In [None]:
import pyarrow as pa
from pyarrow import orc

In [None]:
table2 = orc.read_table('../data/userdata1.1.orc')

In [None]:
table2

In [None]:
df2 = table2.to_pandas()

df2

You can write the table back to an ORC file:

In [None]:
orc.write_table(table2, "../data/file2.orc")