# Data storage and formats

Large datasets are typically stored on cloud object storage, that are designed to:
* store massive files,
* for long periods of time, and
* support parallelism in I/O operations.

Some of the largest providers of object storage are Amazon S3, Google Cloud Storage, and Azure Data Lake. In this tutorial, we accessing data stored on Google Cloud Storage.

## Data has gravity

It's almost always better to move your computations to the data (compared to vice-versa). This is because data transfers are typically the highest bottlenecks, so downloading data to a local machine and then computing will be very slow, even for small amounts of data.

If your data is stored locally (for example, on hard drives), you should consider a local / on-prem cluster setup.

If your data is stored on the cloud, you can spin up a cluster on the same cloud. Note that moving data between cloud providers can also get challenging.

## Cloud storage as file systems

Libraries like `s3fs` and `gcsfs` allow you to access the data with a Python interface. In this tutorial, we're using [`gcsfs`](https://gcsfs.readthedocs.io/en/latest/https://gcsfs.readthedocs.io/en/latest/):

In [1]:
import gcsfs

In [2]:
fs = gcsfs.GCSFileSystem()

We're accessing public datasets, but you can also pass tokens for private buckets: `GCSFileSystem(token=your_token)`. 

You can now take a look at the storage bucket in a file-system like interface:

In [3]:
fs.ls("quansight-datasets/airline-ontime-performance")

['quansight-datasets/airline-ontime-performance/csv',
 'quansight-datasets/airline-ontime-performance/full_dataset.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year',
 'quansight-datasets/airline-ontime-performance/sorted']

### Your turn: Open the above folders to view the contents

In [None]:
# Your code here

In [None]:
fs.ls("quansight-datasets/airline-ontime-performance/csv/")

### Your turn: Read a line from one of the CSV files

In [None]:
# Your code here

In [None]:
with fs.open("quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv", "r") as f:
    print(f.readline())

## Start a Dask Gateway cluster

As we learnt in the previous notebook.

In [23]:
import dask_gateway

gateway = dask_gateway.Gateway()

options = gateway.cluster_options(use_local_defaults=False)
options.profile = "Medium Worker"
options.conda_environment = "global/global-pycon2023"

cluster = gateway.new_cluster(options)

cluster.adapt(minimum=1, maximum=20)

client = cluster.get_client()

client

0,1
Connection method: Cluster object,Cluster type: dask_gateway.GatewayCluster
Dashboard: https://nebari.quansight.dev/gateway/clusters/dev.ed3d755e9f574c51b7d48be97901eb37/status,


Make sure to open the following plots: Cluster map, task stream, progress bar, workers memory

## CSV data format

We'll download the CSV files again as we did in the previous notebook, note the time various operation take:

In [24]:
import json

with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

In [25]:
import dask.dataframe as dd

In [26]:
%%time

ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*", dtype=dtypes)

CPU times: user 1.73 s, sys: 184 ms, total: 1.92 s
Wall time: 20.7 s


In [27]:
%%time

ddf.head()

CPU times: user 111 ms, sys: 7.38 ms, total: 118 ms
Wall time: 5.41 s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
1,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
2,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
3,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
4,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,


### Your turn: Compute the number of unique flights taken each year

Make sure to time it, and watch the dashboard plots!

In [None]:
# Your code here

In [30]:
%%time

ddf.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute()

CPU times: user 253 ms, sys: 21.3 ms, total: 275 ms
Wall time: 2min 40s


YEAR
2003    6456575
2004    7129270
2005    7140596
2006    7141922
2007    7455458
2008    7009726
2009    6450285
2010    6450117
2011    5848283
2012    6096762
2013    6316788
2014    5819811
2015    4985448
2016    5573520
2017    5674621
2018    5976879
2019    7422037
2020    4605296
2021    5954897
2022    6172030
Name: OP_UNIQUE_CARRIER, dtype: int64

## Parquet data format

[Apache Parquet](https://parquet.apache.org/https://parquet.apache.org/) is a columnar data format widely used for storing large tabular datasets.

### Parquet I/O

Parquet data is very efficient to store and access (i.e., compression and encoding), and stores metadata like data-types, column names, and ranges per file/partition.

In the following cells we read the full parquet dataset, notice how it's faster and that we did not need to explicitly share datatypes.

In [31]:
%%time

ddf_pq = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet")

CPU times: user 266 ms, sys: 16.5 ms, total: 282 ms
Wall time: 1.62 s


In [32]:
%%time

ddf_pq.head()

CPU times: user 75.9 ms, sys: 5.28 ms, total: 81.2 ms
Wall time: 1.51 s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
1,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
2,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
3,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
4,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,


### Your turn: Perform same computation as earlier to compute the number of unique flights taken each year

Time it again and compare against the previous value!

In [None]:
# Your code here

In [None]:
%%time

ddf_pq.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute()

### Read specific columns

As a column-oriented format, you can decide to read only necessary columns, further improving efficiency:

In [38]:
%%time

ddf_pq_five_cols = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet",
                                  columns= ['YEAR', 'OP_UNIQUE_CARRIER'])

CPU times: user 300 ms, sys: 4.5 ms, total: 305 ms
Wall time: 524 ms


In [39]:
%%time

ddf_pq_five_cols.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute()

CPU times: user 92 ms, sys: 7.38 ms, total: 99.4 ms
Wall time: 12.5 s


YEAR
2003    6456575
2004    7129270
2005    7140596
2006    7141922
2007    7455458
2008    7009726
2009    6450285
2010    6450117
2011    5848283
2012    6096762
2013    6316788
2014    5819811
2015    4985448
2016    5573520
2017    5674621
2018    5976879
2019    7422037
2020    4605296
2021    5954897
2022    6172030
Name: OP_UNIQUE_CARRIER, dtype: int64

### Partitioned storage

Parquet files can be stored with a partitioning schema that works best for your computation.

It's useful to take the time to partition your dataset based on your workflows (partition structure, as well as number of partitions). Dask can partition your DataFrame accordingly when you read the data.

Here, we've partitioned the dataset by `YEAR`:

In [40]:
fs.ls("quansight-datasets/airline-ontime-performance/parquet_by_year")

['quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2003',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2004',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2005',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2006',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2007',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2008',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2009',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2010',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2011',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2012',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2013',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2014',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2015',
 'quansight-

In [41]:
fs.ls("quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022")

['quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.200.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.250.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.251.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.252.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.307.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.308.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.309.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.359.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.360.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.361.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2

### Row-wise filtering

Parquet also stores the ranges of values present in each file and partition, so you can efficiently filter Parquet datasets row-wise while reading the data.

For example, consider we want to exclude 2020 because of it's unique impact on airline industry:

In [42]:
ddf_pq_five_cols = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet",
                                  columns= ['YEAR', 'OP_UNIQUE_CARRIER'],
                                  filters = [[('YEAR', '!=', 2020)]])

In [44]:
ddf_pq_five_cols.groupby('YEAR').OP_UNIQUE_CARRIER.count().compute()

YEAR
2003    6456575
2004    7129270
2005    7140596
2006    7141922
2007    7455458
2008    7009726
2009    6450285
2010    6450117
2011    5848283
2012    6096762
2013    6316788
2014    5819811
2015    4985448
2016    5573520
2017    5674621
2018    5976879
2019    7422037
2021    5954897
2022    6172030
Name: OP_UNIQUE_CARRIER, dtype: int64

#### Your turn: Groupby month instead of year, and only read+calculate unique flights for Q4 of each year

In [45]:
# Your code here

## Convert from CSV to Parquet

You can convert CSV files to Parquet in two main ways:

- Dask (and pandas) have a `to_parquet()` function, you can also partition the data while converting.
- You can use powerful Parquet engines like [`pyarrow`](https://arrow.apache.org/docs/python/csv.htmlhttps://arrow.apache.org/docs/python/csv.html) or `fastparquet` directly (Dask and pandas use these engines internally)



```python
import dask.dataframe as dd

dd.read_csv("path_to_csv_files_on_cloud_storage")
dd.to_parquet("path_to_cloud_storage_loaction", partition_by="")
```

We created the Parquet dataset using Dask, and our code is available in `scripts/csv_to_parquet.ipynb`.

## Notable mentions

* If you're working with multi-dimensional arrays, [Zarr](https://zarr.readthedocs.io/en/stable/index.htmlhttps://zarr.readthedocs.io/en/stable/index.html) is an excellent format to store chunked array data (similar to partitioning, but along multiple dimensions).
* If you expect your workflows to have SQL-like query operations, storing your data in [Snowflake](https://www.snowflake.com/en/https://www.snowflake.com/en/) can be good option.
* [Creating Disk Partitioned Lakes with Dask using partition_on](https://www.coiled.io/blog/dask-disk-partition-onhttps://www.coiled.io/blog/dask-disk-partition-on), a blog post by Coiled, has some valuable best practices.

---
## Next

Big data analysis with Dask!