## Reading Tabular Data

The Planetary Computer provides tabular data in the [Apache Parquet](https://parquet.apache.org/) file format. Small datasets can be read using [pandas](https://pandas.pydata.org/). For example, we can read the boundary table from the [Forest Inventory and Analysis](https://aka.ms/ai4edata-fia) dataset, which has about 190,000 rows of information about forest health and location in the US.

In [4]:
import pandas as pd

df = pd.read_parquet(
    "az://cpdata/raw/fia/boundary.parquet/part.0.parquet",
    storage_options={"account_name": "cpdataeuwest"},
    columns=["CN", "AZMLEFT", "AZMCORN"],
)
df

Unnamed: 0_level_0,CN,AZMLEFT,AZMCORN
__null_dask_index__,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,204719190010854,259,0
1,204719188010854,33,0
2,204719189010854,52,0
3,204719192010854,322,0
4,204719191010854,325,0
...,...,...,...
190395,310422187489998,330,50
190396,310422188489998,186,225
190397,310422189489998,291,356
190398,310422994489998,22,0


Larger datasets can be read using [Dask](https://dask.org/). For example, the `cpdata/raw/fia/tree.parquet` folder contains about 160 individual Parquet files, totalling about 22 million rows. In this case, pass the path to the directory to `dask.dataframe.read_parquet`.

In [5]:
import dask.dataframe as dd

df = dd.read_parquet(
    "az://cpdata/raw/fia/tree.parquet", storage_options={"account_name": "cpdataeuwest"}
)
df[["SPCD", "CARBON_BG", "CARBON_AG"]]

Unnamed: 0_level_0,SPCD,CARBON_BG,CARBON_AG
npartitions=160,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,int64,float64,float64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


That lazily loads the data into a Dask DataFrame. We can operate on the DataFrame with pandas-like methods, and call `.compute()` to get the result. In this case, we'll compute the average amount of carbon sequestered above and below ground for each tree, grouped by species type. To cut down on execution time we'll select just the first partition.

In [6]:
result = (
    df[["SPCD", "CARBON_BG", "CARBON_AG"]]
    .get_partition(0)
    .groupby("SPCD")  # group by species
    .mean()
    .compute()
)
result

Unnamed: 0_level_0,CARBON_BG,CARBON_AG
SPCD,Unnamed: 1_level_1,Unnamed: 2_level_1
43,37.864937,165.753430
67,3.549734,14.679764
68,9.071253,39.108406
107,19.321549,84.096184
110,29.964395,130.956288
...,...,...
973,4.632913,22.658887
975,38.988846,202.220124
976,25.385733,130.583668
993,12.570365,64.712301


### Learn more

See the [pandas documentation](https://pandas.pydata.org/docs/user_guide/io.html#parquet) for an introduction to Parquet, and the `read_parquet` reference documentation for [pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html?highlight=read_parquet) and [dask](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_parquet).

For more about the Forest Inventory and Analysis dataset, see [here](https://aka.ms/ai4edata-fia).

For more information on scaling workflows to large datasets, see [scaling with dask](scale-with-dask.ipynb).