## Reading Tabular Data

The Planetary Computer provides tabular data in the [Apache Parquet](https://parquet.apache.org/) file format, which provides a standardized high-performance columnar storage format.

When working from Python, there are several options for reading parquet datasets. The right choice depends on the size and kind of the data you're reading. When reading geospatial data, with one or more columns containing vector geometries, we recommend using [geopandas](https://geopandas.org/) for small datasets and [dask-geopandas](https://github.com/geopandas/dask-geopandas) for large datasets. For non-geospatial tabular data, we recommend [pandas](https://pandas.pydata.org/) for small datasets and [Dask](https://dask.org/) for large datasets.

Regardless of which library you're using to read the data, we recommend using [STAC](https://stacspec.org/) to discover which datasets are available, and which options should be provided when reading the data.

In this example we'll work with data from the US Forest Service's [Forest Inventory and Analysis](https://planetarycomputer.microsoft.com/dataset/fia) dataset. This includes a collection of tables providing information about forest health and location in the United States.

In [1]:
import pystac_client
import planetary_computer

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)
fia = catalog.get_collection("fia")
fia

0
ID: fia
Title: Forest Inventory and Analysis
"Description: Status and trends on U.S. forest location, health, growth, mortality, and production, from the U.S. Forest Service's [Forest Inventory and Analysis](https://www.fia.fs.fed.us/) (FIA) program. The Forest Inventory and Analysis (FIA) dataset is a nationwide survey of the forest assets of the United States. The FIA research program has been in existence since 1928. FIA's primary objective is to determine the extent, condition, volume, growth, and use of trees on the nation's forest land. Domain: continental U.S., 1928-2018 Resolution: plot-level (irregular polygon) This dataset was curated and brought to Azure by [CarbonPlan](https://carbonplan.org/)."
"Providers:  Forest Inventory & Analysis (producer, licensor)  CarbonPlan (processor)  Microsoft (host)"
type: Collection
title: Forest Inventory and Analysis
"assets: {'guide': {'href': 'https://www.fia.fs.fed.us/library/database-documentation/current/ver80/FIADB%20User%20Guide%20P2_8-0.pdf', 'type': 'application/pdf', 'roles': ['metadata'], 'title': 'Database Description and User Guide'}, 'thumbnail': {'href': 'https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/fia.png', 'type': 'image/gif', 'title': 'Forest Inventory and Analysis'}, 'geoparquet-items': {'href': 'abfs://items/fia.parquet', 'type': 'application/x-parquet', 'roles': ['stac-items'], 'title': 'GeoParquet STAC items', 'description': ""Snapshot of the collection's STAC items exported to GeoParquet format."", 'msft:partition_info': {'is_partitioned': False}, 'table:storage_options': {'account_name': 'pcstacitems', 'credential': 'st=2022-10-26T20%3A52%3A11Z&se=2022-11-03T20%3A52%3A11Z&sp=rl&sv=2021-06-08&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2022-10-27T20%3A52%3A10Z&ske=2022-11-03T20%3A52%3A10Z&sks=b&skv=2021-06-08&sig=b%2BnRXutS3qkpDV7z7xr5dzvWHMId7iio9xlOjQ9iM7w%3D'}}}"
"keywords: ['Forest', 'Species', 'Carbon', 'Biomass', 'USDA', 'Forest Service']"
"providers: [{'url': 'https://www.fia.fs.fed.us/', 'name': 'Forest Inventory & Analysis', 'roles': ['producer', 'licensor']}, {'url': 'https://carbonplan.org/', 'name': 'CarbonPlan', 'roles': ['processor']}, {'url': 'https://planetarycomputer.microsoft.com', 'name': 'Microsoft', 'roles': ['host']}]"
"item_assets: {'data': {'type': 'application/x-parquet', 'roles': ['data'], 'title': 'Dataset root', 'table:storage_options': {'account_name': 'cpdataeuwest'}}}"

0
https://stac-extensions.github.io/item-assets/v1.0.0/schema.json
https://stac-extensions.github.io/table/v1.2.0/schema.json

0
ID: tree_woodland_stems
"Bounding Box: [-179.14734, -14.53, 179.77847, 71.352561]"
Datetime: 2020-06-01 00:00:00+00:00
datetime: 2020-06-01T00:00:00Z
"table:columns: [{'name': 'CN', 'type': 'int64', 'description': 'Sequence number'}, {'name': 'PLT_CN', 'type': 'int64', 'description': 'Plot sequence number'}, {'name': 'INVYR', 'type': 'int64', 'description': 'Inventory year'}, {'name': 'STATECD', 'type': 'int64', 'description': 'State code'}, {'name': 'UNITCD', 'type': 'int64', 'description': 'Survey unit code'}, {'name': 'COUNTYCD', 'type': 'int64', 'description': 'County code'}, {'name': 'PLOT', 'type': 'int64', 'description': 'Plot number'}, {'name': 'SUBP', 'type': 'int64', 'description': 'Subplot number'}, {'name': 'TREE', 'type': 'int64', 'description': 'Woodland tree number'}, {'name': 'TRE_CN', 'type': 'int64', 'description': 'Tree sequence number'}, {'name': 'DIA', 'type': 'double', 'description': 'Woodland stem diameter'}, {'name': 'STATUSCD', 'type': 'int64', 'description': 'Woodland stem status code'}, {'name': 'STEM_NBR', 'type': 'int64', 'description': 'Woodland stem number'}, {'name': 'CYCLE', 'type': 'int64', 'description': 'Inventory cycle number'}, {'name': 'SUBCYCLE', 'type': 'int64', 'description': 'Inventory subcycle number'}, {'name': 'CREATED_BY', 'type': 'double', 'description': 'Created by'}, {'name': 'CREATED_DATE', 'type': 'byte_array', 'description': 'Created date'}, {'name': 'CREATED_IN_INSTANCE', 'type': 'int64', 'description': 'Created in instance'}, {'name': 'MODIFIED_BY', 'type': 'double', 'description': 'Modified by'}, {'name': 'MODIFIED_DATE', 'type': 'byte_array', 'description': 'Modified date'}, {'name': 'MODIFIED_IN_INSTANCE', 'type': 'double', 'description': 'Modified in instance'}]"
stac_extensions: ['https://stac-extensions.github.io/table/v1.2.0/schema.json']

0
https://stac-extensions.github.io/table/v1.2.0/schema.json

0
href: abfs://cpdata/raw/fia/tree_woodland_stems.parquet
Title: Dataset root
Media type: application/x-parquet
Roles: ['data']
Owner:
"table:storage_options: {'account_name': 'cpdataeuwest', 'credential': 'st=2022-10-26T20%3A52%3A11Z&se=2022-11-03T20%3A52%3A11Z&sp=rl&sv=2021-06-08&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2022-10-27T20%3A52%3A10Z&ske=2022-11-03T20%3A52%3A10Z&sks=b&skv=2021-06-08&sig=x29ciwiYJBx66xDI4KvuZkktiCNT7kO2eOMbIg5bgYg%3D'}"

0
Rel: collection
Target: https://planetarycomputer.microsoft.com/api/stac/v1/collections/fia
Media Type: application/json

0
Rel: parent
Target: https://planetarycomputer.microsoft.com/api/stac/v1/collections/fia
Media Type: application/json

0
Rel: root
Target: https://planetarycomputer.microsoft.com/api/stac/v1/
Media Type: application/json

0
Rel: self
Target: https://planetarycomputer.microsoft.com/api/stac/v1/collections/fia/items/tree_woodland_stems
Media Type: application/geo+json

0
Rel: items
Target: https://planetarycomputer.microsoft.com/api/stac/v1/collections/fia/items
Media Type: application/geo+json

0
Rel: parent
Target: https://planetarycomputer.microsoft.com/api/stac/v1/
Media Type: application/json

0
Rel: root
Target:
Media Type: application/json

0
Rel: self
Target: https://planetarycomputer.microsoft.com/api/stac/v1/collections/fia
Media Type: application/json

0
Rel: license
Target: https://www.fs.usda.gov/rds/archive/datauseinfo/open
Media Type: text/html

0
Rel: describedby
Target: https://planetarycomputer.microsoft.com/dataset/fia
Media Type: text/html

0
href: https://www.fia.fs.fed.us/library/database-documentation/current/ver80/FIADB%20User%20Guide%20P2_8-0.pdf
Title: Database Description and User Guide
Media type: application/pdf
Roles: ['metadata']
Owner:

0
href: https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/fia.png
Title: Forest Inventory and Analysis
Media type: image/gif
Owner:

0
href: abfs://items/fia.parquet
Title: GeoParquet STAC items
Description: Snapshot of the collection's STAC items exported to GeoParquet format.
Media type: application/x-parquet
Roles: ['stac-items']
Owner:
msft:partition_info: {'is_partitioned': False}
"table:storage_options: {'account_name': 'pcstacitems', 'credential': 'st=2022-10-26T20%3A52%3A11Z&se=2022-11-03T20%3A52%3A11Z&sp=rl&sv=2021-06-08&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2022-10-27T20%3A52%3A10Z&ske=2022-11-03T20%3A52%3A10Z&sks=b&skv=2021-06-08&sig=b%2BnRXutS3qkpDV7z7xr5dzvWHMId7iio9xlOjQ9iM7w%3D'}"


The FIA Collection has a number of items, each of which represents a different table stored in Parquet format.

In [2]:
list(fia.get_all_items())

[<Item id=tree_woodland_stems>,
 <Item id=tree_regional_biomass>,
 <Item id=tree_grm_midpt>,
 <Item id=tree_grm_estn>,
 <Item id=tree_grm_component>,
 <Item id=tree_grm_begin>,
 <Item id=tree>,
 <Item id=survey>,
 <Item id=subplot_regen>,
 <Item id=subplot>,
 <Item id=subp_cond_chng_mtrx>,
 <Item id=subp_cond>,
 <Item id=sitetree>,
 <Item id=seedling_regen>,
 <Item id=seedling>,
 <Item id=pop_stratum>,
 <Item id=pop_plot_stratum_assgn>,
 <Item id=pop_eval_typ>,
 <Item id=pop_eval_grp>,
 <Item id=pop_eval_attribute>,
 <Item id=pop_eval>,
 <Item id=pop_estn_unit>,
 <Item id=plotsnap>,
 <Item id=plot_regen>,
 <Item id=plotgeom>,
 <Item id=plot>,
 <Item id=p2veg_subp_structure>,
 <Item id=p2veg_subplot_spp>,
 <Item id=invasive_subplot_spp>,
 <Item id=dwm_visit>,
 <Item id=dwm_transect_segment>,
 <Item id=dwm_residual_pile>,
 <Item id=dwm_microplot_fuel>,
 <Item id=dwm_fine_woody_debris>,
 <Item id=dwm_duff_litter_fuel>,
 <Item id=dwm_coarse_woody_debris>,
 <Item id=county>,
 <Item id=cond_

To load a single table, get it's item and extract the `href` from the `data` asset. The "boundary" table, which provides information about subplots, is relatively small and doesn't contain a geospatial geometry column, so it can be read with pandas.

In [3]:
import pandas as pd
import planetary_computer

boundary = fia.get_item(id="boundary")
asset = boundary.assets["data"]

df = pd.read_parquet(
    asset.href,
    storage_options=asset.extra_fields["table:storage_options"],
    columns=["CN", "AZMLEFT", "AZMCORN"],
)
df.head()

Unnamed: 0_level_0,CN,AZMLEFT,AZMCORN
__null_dask_index__,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,204719190010854,259,0
1,204719188010854,33,0
2,204719189010854,52,0
3,204719192010854,322,0
4,204719191010854,325,0


There are a few imporant pieces to highlight

1. As usual with the Planetary Computer, we signed the STAC item so that we could access the data. See [Using tokens for data access](https://planetarycomputer.microsoft.com/docs/concepts/sas/) for more.
2. We relied on the asset to provide all the information necessary to load the data like the `href` and the `storage_options`. All we needed to know was the ID of the Collection and Item
3. We used pandas' and parquet's ability to select subsets of the data with the `columns` keyword.

Larger datasets can be read using [Dask](https://dask.org/). For example, the `cpdata/raw/fia/tree.parquet` folder contains about 160 individual Parquet files, totalling about 22 million rows. In this case, pass the path to the directory to `dask.dataframe.read_parquet`.

In [4]:
import dask.dataframe as dd

tree = fia.get_item(id="tree")
asset = tree.assets["data"]

df = dd.read_parquet(
    asset.href,
    storage_options=asset.extra_fields["table:storage_options"],
    columns=["SPCD", "CARBON_BG", "CARBON_AG"],
    engine="pyarrow",
)
df

Unnamed: 0_level_0,SPCD,CARBON_BG,CARBON_AG
npartitions=160,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,int64,float64,float64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


That lazily loads the data into a Dask DataFrame. We can operate on the DataFrame with pandas-like methods, and call `.compute()` to get the result. In this case, we'll compute the average amount of carbon sequestered above and below ground for each tree, grouped by species type. To cut down on execution time we'll select just the first partition.

In [5]:
result = df.get_partition(0).groupby("SPCD").mean().compute()  # group by species
result

Unnamed: 0_level_0,CARBON_BG,CARBON_AG
SPCD,Unnamed: 1_level_1,Unnamed: 2_level_1
43,37.864937,165.753430
67,3.549734,14.679764
68,9.071253,39.108406
107,19.321549,84.096184
110,29.964395,130.956288
...,...,...
973,4.632913,22.658887
975,38.988846,202.220124
976,25.385733,130.583668
993,12.570365,64.712301


### Geospatial parquet datasets

The `us-census` collection has some items that include a `geometry` column, and so can be loaded with `geopandas`. All parquet datasets hosted by the Planetary Computer with one or more geospatial columns use the [geoparquet](https://github.com/opengeospatial/geoparquet) standard for encoding the geospatial metadata.

In [6]:
import geopandas

item = catalog.get_collection("us-census").get_item("2020-cb_2020_us_state_500k")

asset = item.assets["data"]
df = geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
df.head()

Unnamed: 0,STATEFP,STATENS,AFFGEOID,GEOID,STUSPS,NAME,LSAD,ALAND,AWATER,geometry
0,66,1802705,0400000US66,66,GU,Guam,0,543555847,934337453,"MULTIPOLYGON (((144.64538 13.23627, 144.64716 ..."
1,48,1779801,0400000US48,48,TX,Texas,0,676680588914,18979352230,"MULTIPOLYGON (((-94.71830 29.72885, -94.71721 ..."
2,55,1779806,0400000US55,55,WI,Wisconsin,0,140292246684,29343721650,"MULTIPOLYGON (((-86.95617 45.35549, -86.95463 ..."
3,44,1219835,0400000US44,44,RI,Rhode Island,0,2677759219,1323691129,"MULTIPOLYGON (((-71.28802 41.64558, -71.28647 ..."
4,36,1779796,0400000US36,36,NY,New York,0,122049520861,19256750161,"MULTIPOLYGON (((-72.03683 41.24984, -72.03496 ..."


With this, we can visualize the boundaries for the continental United States.

In [7]:
import contextily

drop = ["GU", "AK", "MP", "HI", "VI", "PR", "AS"]
ax = df[~df.STUSPS.isin(drop)].plot()
contextily.add_basemap(ax, crs=df.crs.to_string())

### Learn more

This quickstart briefly introduced how to access tabular data on the Planetary Computer. For more, see

* The [pandas documentation](https://pandas.pydata.org/docs/user_guide/io.html#parquet) for an introduction to Parquet
* [Scale with Dask](https://planetarycomputer.microsoft.com/docs/quickstarts/scale-with-dask/) for more on using Dask to work with large datasets in parallel
* The [Forest Inventory and Analysis](https://planetarycomputer.microsoft.com/dataset/fia) catalog page.