# Data storage and formats

In this notebook, we will learn about cloud storage and the Parquet format for storing tabular data.

---

Large datasets are typically stored on cloud object storage, that are designed to:
* store massive files,
* for long periods of time, and
* support parallelism in I/O operations.

Some of the largest providers of object storage are Amazon S3, Google Cloud Storage, and Azure Data Lake. In this tutorial, we're accessing data stored on Google Cloud Storage.

## Data has gravity

It's almost always better to move your computations to the data (compared to vice-versa). This is because data transfers are typically the highest bottlenecks, so downloading data to a local machine and then computing will be very slow, even for small amounts of data.

If your data is stored locally (for example, on hard drives), you should consider a local / on-premise cluster setup.

If your data is stored on the cloud, you can spin up a cluster on the same cloud. Note that moving data between cloud providers can also get challenging.

## Cloud storage as file systems

Libraries like `s3fs` and `gcsfs` allow you to access the data with a Python interface. In this tutorial, [we're using `gcsfs`](https://gcsfs.readthedocs.io/en/latest/):

In [None]:
import gcsfs

In [None]:
fs = gcsfs.GCSFileSystem()

We're accessing public datasets, but you can also pass tokens for private buckets: `GCSFileSystem(token=your_token)`. 

You can now take a look at the storage bucket in a file-system like interface:

In [None]:
fs.ls("quansight-datasets/airline-ontime-performance")

### 💻 Your turn: Open the above folders to view the contents

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
fs.ls("quansight-datasets/airline-ontime-performance/csv/")

### 💻 Your turn: Read a line from one of the CSV files

Hint: How would you do it if it was a local file?

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
with fs.open("quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv", "r") as f:
    print(f.readline())

## Start a Dask Gateway cluster

This time, let's specify the various options explicitly:

In [None]:
import dask_gateway

gateway = dask_gateway.Gateway()

options = gateway.cluster_options(use_local_defaults=False)
options.profile = "Medium Worker"
options.conda_environment = "global/global-data-of-unusual-size"

cluster = gateway.new_cluster(options)

cluster.adapt(minimum=5, maximum=10)

client = cluster.get_client()

client

Make sure to open the following plots: Cluster map, task stream, progress, and workers memory!

## CSV data format

We'll download the CSV files again, as we did in the previous notebook, note the time various operations take:

In [None]:
import json

with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

In [None]:
import dask.dataframe as dd

In [None]:
%%time

ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*", dtype=dtypes) # Wall time: 20.2 s

In [None]:
%%time 

ddf.head() # Wall time: 8.62 s

### 💻 Your turn: Compute the number of unique flights taken each year

Make sure to time it, and watch the dashboard plots!

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
%%time

ddf.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute() # Wall time: 4min 17s

## Parquet data format

[Apache Parquet](https://parquet.apache.org/) is a columnar data format widely used for storing large tabular datasets.

### Parquet I/O

Parquet data is very efficient to store and access (i.e., compression and encoding), and stores metadata like data-types, column names, and ranges per file/partition.

In the following cells we read the full parquet dataset, notice how it's faster and that we did not need to explicitly share datatypes.

In [None]:
%%time

ddf_pq = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet") # Wall time: 1.05 s

In [None]:
%%time

ddf_pq.head() # Wall time: 1.16 s

### 💻 Your turn: Perform same computation as earlier to compute the number of unique flights taken each year

Time it again and compare against the previous value!

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
%%time

ddf_pq.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute() # Wall time: 14.1 s

### Read specific columns

As a column-oriented format, you can decide to read only necessary columns, further improving efficiency:

In [None]:
%%time

ddf_pq_subset = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet",
                                  columns= ['YEAR', 'OP_UNIQUE_CARRIER']) # Wall time: 400 ms

In [None]:
%%time

ddf_pq_subset.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute() # Wall time: 12.9 s

### Partitioned storage

Parquet files can be stored with a partitioning schema that works best for your computation.

It's useful to take the time to partition your dataset based on your workflows (partition structure, as well as number of partitions). Dask can partition your DataFrame accordingly when you read the data.

Here, we've partitioned the dataset by `YEAR`:

In [None]:
fs.ls("quansight-datasets/airline-ontime-performance/parquet_by_year")

In [None]:
fs.ls("quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022")

### Row-wise filtering

Parquet also stores the ranges of values present in each file and partition, so you can efficiently filter Parquet datasets row-wise while reading the data.

For example, consider we want to exclude 2020 because of its unique impact on the airline industry:

In [21]:
ddf_pq_subset = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet",
                                  columns= ['YEAR', 'OP_UNIQUE_CARRIER'],
                                  filters = [[('YEAR', '!=', 2020)]])

In [None]:
ddf_pq_subset.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute()

#### 💻 Your turn: Groupby month instead of year, and only read+calculate unique flights for Q4 of each year

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
ddf_pq_q4 = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet",
                                  columns= ['MONTH', 'OP_UNIQUE_CARRIER'],
                                  filters = [[('MONTH', '>=', 9)]])

ddf_pq_q4.groupby('MONTH').OP_UNIQUE_CARRIER.count().compute()

## Convert from CSV to Parquet

You can convert CSV files to Parquet in two main ways:

- Dask (and pandas) have a `to_parquet()` function, you can also partition the data while converting.
- You can use powerful Parquet engines like [`pyarrow`](https://arrow.apache.org/docs/python/csv.htmlhttps://arrow.apache.org/docs/python/csv.html) or `fastparquet` directly (Dask and pandas use these engines internally)

```python

# You do not need to execute this code

import dask.dataframe as dd

dd.read_csv("path_to_csv_files_on_cloud_storage")
dd.to_parquet("path_to_cloud_storage_loaction", partition_by="")
```

We created the Parquet dataset using Dask, and our code is available in `scripts/csv_to_parquet.ipynb`.

## Notable mentions

* If you're working with multidimensional arrays, [Zarr](https://zarr.readthedocs.io/en/stable/index.html) is an excellent format to store chunked array data (similar to partitioning, but along multiple dimensions).
* If you expect your workflows to have SQL-like query operations, storing your data in [Snowflake](https://www.snowflake.com/en/) can be a good option.
* [Creating Disk Partitioned Lakes with Dask using partition_on](https://www.coiled.io/blog/dask-disk-partition-on), a blog post by Coiled, has some valuable best practices.

---
## Next →

[Big data analysis with Dask](./05-big-data-analysis-with-dask.ipynb)!