# Reading data from the cloud

**Author:** Sandro Campos | **[GitHub Issue](https://github.com/astronomy-commons/lsdb/issues/103)**

This notebook explores the behaviour of LSDB when running a [basic science workflow](https://github.com/swyatt7/ADASS_LSDB_tutorial/blob/main/nb/section3.ipynb) that requires pulling data from cloud buckets.

### What is the main goal of the workflow?

To load a 5 degree cone region of Gaia and perform a crossmatch with ZTF with a 0.1 degree radius:

```python
gaia.cone_search(
    ra=30,
    dec=30,
    radius_arcsec=5*3600,
).crossmatch(
    ztf,
    n_neighbors=1,
    radius_arcsec=0.1*3600,
    require_right_margin=False,
)
```

### Which resources were used on the experiments?

We conducted experiments with:

- A local Macbook laptop.

- An Azure computing node using Linux (@ [LinccHub](lsst.dirac.dev)).

We accessed ZTF DR14 and Gaia DR3 on Epyc and on an ABFS bucket:

- **Epyc** is a self-hosted remote machine at UW. Catalogs are publicly available @ `https://epyc.astro.washington.edu/~lincc-frameworks`.

- **ABFS** (Azure Blob File Storage) hosts the same catalogs but requires authentication with an `account_name` and respective `account_key`.

### Can we read catalogs from the cloud?

The answer is **yes**. We can read them by specifying:

- ... a full catalog path with a valid protocol - e.g. HTTPS or ABFS;

- ... the location of an Almanac, as demonstrated [here](../04_04/almanac.ipynb);

### Which metrics were collected?

The experiments gathered performance metrics for the following connections:

1. Local machine -> Epyc
2. Local machine -> ABFS
3. LinccHub -> Epyc
4. LinccHub -> ABFS

For the comparison to be as correct as possible we configured the Dask Client to have... 

- `3 workers`, each with `1 thread` and `2GB memory`

### How does accessing data on ABFS compare to Epyc?

In [12]:
from IPython.display import display, HTML

### Reading data from local

In [13]:
report_local_abfs = "results/local_abfs/compute_1.html"
display(HTML(report_local_abfs))

In [14]:
report_local_epyc = "results/local_epyc/compute_1.html"
display(HTML(report_local_epyc))


### Reading data from ABFS

In [15]:
report_azure_abfs = "results/azure_abfs/compute_1.html"
display(HTML(report_azure_abfs))

In [16]:
report_azure_epyc = "results/azure_epyc/compute_1.html"
display(HTML(report_azure_epyc))

### Summary and brief conclusions

The following table presents a summary of the metrics collected:

| Machine  | Bucket | Duration (s)  | Compute (s) | Read (s) | X-match (s) | Read (%) |
|----------|--------|---------------|-------------|----------|-------------|----------|
| Local    | Epyc   | 337.80s       | 16m38s      | 16m34s   | 2.39s       | 99.7%    |
| Local    | ABFS   | 108.11s       | 308.66s     | 304.86s  | 2.54s       | 99.1%    |
| LinccHub | Epyc   | 18.55s        | 34.91s      | 28.23s   | 5.21s       | 82.4%    |
| LinccHub | ABFS   | 15.22s        | 25.24s      | 18s      | 5.12s       | 75.3%    |

We may comfortably conclude that:

* We are able to load and process data hosted in the cloud (both the HTTPS and ABFS protocols have been successfully tested)!

* The **fastest data-connection is LinccHub -> ABFS**, which was expected, as both are hosted by Azure.

* The **slowest data-connections start at our local machine**. Reading data from Epyc (located on the west coast) has a big overhead.

Nevertheless, it's important to remember that performance will significantly depend on factors such as latency/distance to the data servers and their current workload.