## Warming up: what will you need going forward with Planetary Computer Hub?

The Planetary Computer Hub development environment relies on open-source tools to work with the data. In our effort to welcome everyone who may not be so familiarized on using Python tools for academic research, we present an overview on those and some reference guides.

### Tools landscape

### Retrieving collections

#### PySTAC

[PySTAC](https://pystac.readthedocs.io/en/latest/) is a library for working with [SpatioTemporal Asset Catalogs](https://stacspec.org/) (STAC) in Python 3.
On Planetary Computer, the main purpose behind PySTAC is to serve as an efficient crawler of STAC catalogs.
It enables us to manipulate the metadata on local, prior to loading the assets.

A comprehensive example of how it is used on Planetary Computer can be found in the [Reading Data from the STAC API](../quickstarts/reading-stac.ipynb) quickstart.

For getting the specifics on the PySTAC library, refer to the [official documentation](https://pystac.readthedocs.io/en/latest/).

#### PySTAC-client
> [PySTAC-Client](https://pystac-client.readthedocs.io/en/latest/index.html#) builds upon PySTAC library to add support for STAC APIs in addition to static STAC catalogs.

On Planetary Computer, [PySTAC-Client](https://pystac-client.readthedocs.io/en/latest/index.html#) performs a key integration point with Planetary Computer's STAC API by abstracting the queries and filters when working with the API.
This way, we can consume and search through the API based on metadata to locate only the collection/s we are interested in.

An explanation on consuming the Planetary Computer API and how it translates to using PySTAC-Client can be found on the [Using the Planetary Computer’s Data API](../quickstarts/using-the-data-api.ipynb) quickstart.

For getting the specifics on the PySTAC-Client library, refer to the [official documentation](https://pystac-client.readthedocs.io/en/latest/index.html).

### Data manipulation

#### Pandas

[Pandas](https://pandas.pydata.org/docs/index.html) is the go-to library for handling [tabular data](https://en.wikipedia.org/wiki/Table_(information)) in a neat, consistent and programmatic way.
It's a great library loaded with data analysis and manipulation tools.

An example of reading tabular data using Pandas can be found in the [Reading Tabular Data](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-tabular-data/) quickstart.
Another quick example on how to read any STAC asset, stored in JSON format, using Pandas can be found in the [Bulk STAC item queries with GeoParquet](../quickstarts/stac-geoparquet.ipynb#expanding-nested-fields) quickstart.

[Project Pythia's introduction to Pandas](https://foundations.projectpythia.org/core/pandas/pandas.html) covers the basics of what you will need, such as slicing and performing an exploratory analysis on _DataFrames_ and _DataSeries_.
But the [Pandas official documentation](https://pandas.pydata.org/docs/index.html) is thorough and leaves nothing to be desired.

#### GeoPandas

> [GeoPandas](https://geopandas.org/en/stable/index.html), as the name suggests, extends the popular data science library pandas by adding support for geospatial data.

The reason we choose to use GeoPandas is that it allows us to handle points, lines, curves and polygons (a.k.a., [vectors](https://en.wikipedia.org/wiki/Vector_graphics)). 
Then we can reference these shapes, packed with their own associated data, to a [Coordinate Reference System](https://en.wikipedia.org/wiki/Spatial_reference_system) (CRS). 
Under CRS, you will often find different [EPSG](https://en.wikipedia.org/wiki/EPSG_Geodetic_Parameter_Dataset) codes which represent different projections of our planet. 

GeoPandas is often used for mapping areas of interest, such as regions, cities, forests or lakes.
One such example can be found in the [Bulk STAC item queries with GeoParquet](../quickstarts/stac-geoparquet.ipynb) quickstart.
Dots or smaller areas can also be used to track objects or natural phenomena across time.
An example for this kind of usage is the [Visualizing Hurricane Florence](../tutorials/hurricane-florence-animation.ipynb) tutorial.

For getting the specifics on the StackSTAC Geopandas, refer to the [official documentation](https://geopandas.org/en/stable/index.html).



#### Xarray

[Xarray](https://docs.xarray.dev/en/stable/index.html) is our tool of choice for dealing with images and other [raster](https://en.wikipedia.org/wiki/Raster_graphics) data sources.

> **[Why Xarray?](https://docs.xarray.dev/en/stable/getting-started-guide/why-xarray.html)**
> 
> Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.

Xarray borrows heavily from Pandas, which enhances our human productivity by setting us up with a wide-spread interface. 
Also, it tightly integrates with Dask which in turn enhances our computing power through parallel processing.

A quick example on using Xarray can be found on the [Reading Zarr Data](../quickstarts/reading-zarr-data.ipynb) quickstart.

For those looking for an introduction on _DataArrays_ and _DataSets_ with Xarray, we recommend heading to [Project Pythia's introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) given its bias towards science. 
Nonetheless, it's always good to keep in mind the [Xarray's official documentation](https://docs.xarray.dev/en/stable/index.html) for browsing its full spectrum.

#### StackSTAC

[StackSTAC](https://stackstac.readthedocs.io/en/latest/index.html) is a library that can be leveraged to easily transform a PySTAC item's assets into a Xarray DataArray with some extra magic along the way:

- Figure out the geospatial parameters from the PySTAC metadata
- Transfer PySTAC metadata into Xarray coordinates for easy indexing
- Bootstrap the DataArray to be lazy and harness the computing boost provided by Dask

A quick example of how it is used on Planetary Computer can be found in the [Cloudless Mosaic](../tutorials//cloudless-mosaic-sentinel2.ipynb) tutorial.

For getting the specifics on the StackSTAC library, refer to the [official documentation](https://stackstac.readthedocs.io/en/latest/index.html).


### Computing power

#### Dask

[Dask](https://www.dask.org/get-started) helps us process big chunks of data by breaking it apart into smaller pieces.
A fundamental characteristic of Dask is that it is lazily evaluated, and there are two key things to take away from this:
- The results won't be computed until we call on `*.compute()`
- We can get a preview of the data structure and type

> **[Briefly, what problem does Dask solve for us?](https://docs.dask.org/en/stable/faq.html#briefly-what-problem-does-dask-solve-for-us)**
> 
> Dask is a general purpose parallel programming solution. As such it is used in many different ways.
>
> However, the most common problem that Dask solves is connecting Python analysts to distributed hardware, particularly for data science and machine learning workloads. [...]

See [Scale With Dask](../quickstarts/scale-with-dask.ipynb) for more on using Dask.
For getting the specifics on the Dask library, refer to the [official documentation](https://docs.dask.org/en/stable/).



