# Data storage and formats

Typically, large datasets are stored on the cloud in object storage, that are designed to store massive files for long periods of time.

The largest providers are Amazon S3, Google Cloud Storage, and Azure Data Lake. 

In the previous notebook, we read our data from Google Cloud Storage. 

In [11]:
import warnings
warnings.filterwarnings("ignore")

## Cloud storage as file systems

- You can access these cloud storage in a file-system like interface 
- We use `gcsfs`: https://gcsfs.readthedocs.io/en/latest/
- Read/write data in Python

In [2]:
import json
import gcsfs

In [4]:
fs = gcsfs.GCSFileSystem()

fs.ls("quansight-datasets/airline-ontime-performance")

### Your turn: Open the above folders to view the contents

In [5]:
# Your code here

In [6]:
fs.ls("quansight-datasets/airline-ontime-performance/csv/")

['quansight-datasets/airline-ontime-performance/csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2004.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2005.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2006.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2007.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2008.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2009.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2010.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2011.csv',
 'quansight-datasets/airline-ontime-performanc

### Your turn: Read a line from one of the CSV files

In [None]:
# Your code here

In [7]:
with fs.open("quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv", "r") as f:
    print(f.readline())

YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEE

## Start a Dask Gateway cluster

In [2]:
import dask_gateway

gateway = dask_gateway.Gateway()

In [3]:
options = gateway.cluster_options(use_local_defaults=False)
options

VBox(children=(HTML(value='<h2>Cluster Options</h2>'), GridBox(children=(HTML(value="<p style='font-weight: bo…

In [4]:
cluster = gateway.new_cluster(options)
cluster

VBox(children=(HTML(value='<h2>GatewayCluster</h2>'), HBox(children=(HTML(value='\n<div>\n<style scoped>\n    …

In [5]:
client = cluster.get_client()
client

0,1
Connection method: Cluster object,Cluster type: dask_gateway.GatewayCluster
Dashboard: https://nebari.quansight.dev/gateway/clusters/dev.0d931c1585be4d1db905975400ab38e7/status,


## Download CSV data from S3

You can do this directly in pandas (and dask):

`storage_options` takes arguments that will be passed on to `GCSFileSystem`.

In [8]:
import json

with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

In [6]:
import dask.dataframe as dd

In [None]:
%%time

ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*", dtype=dtypes)

In [22]:
%%time

ddf.head()

CPU times: user 104 ms, sys: 17 ms, total: 121 ms
Wall time: 4.09 s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
1,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
2,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
3,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
4,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
5,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
6,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
7,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
8,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
9,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,


In [33]:
ddf.groupby('MONTH').OP_UNIQUE_CARRIER.count().compute()

MONTH
4     10266703
8     11060873
12    10098575
2      9655805
1     10326940
7     11258172
6     10148619
3     11086120
5     10378138
11    10215500
10    10787660
9     10397216
Name: OP_UNIQUE_CARRIER, dtype: int64

## Understanding Parquet

- Columnar format
- Stores metadata (like datatypes, column names, and ranges per partition)
- Can do parallel read/write, so we can leverage distributed computing
- Can be stored in an efficient format (partitioned by any columns)
- and more: filter, set index/dtypes, while reading data!

## Download Parquet data from S3

In [10]:
%%time

# Note:  No need to specify dtypes

ddf_pq = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet")

CPU times: user 184 ms, sys: 27.6 ms, total: 211 ms
Wall time: 3.41 s


In [27]:
%%time

ddf_pq.head(10)

CPU times: user 79.4 ms, sys: 1.95 ms, total: 81.4 ms
Wall time: 1.01 s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
1,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
2,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
3,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
4,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
5,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
6,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
7,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
8,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,
9,2003,2,4,1,2,2003-04-01,AA,19805,AA,,...,,,,,,,,,,


In [34]:
%%time

ddf_pq.groupby('MONTH').OP_UNIQUE_CARRIER.count().compute()

CPU times: user 258 ms, sys: 43.4 ms, total: 301 ms
Wall time: 3min 8s


MONTH
4     10266703
8     11060873
12    10098575
2      9655805
1     10326940
7     11258172
6     10148619
3     11086120
5     10378138
11    10215500
10    10787660
9     10397216
Name: OP_UNIQUE_CARRIER, dtype: int64

### Read specific columns

In [29]:
%%time

ddf_pq_five_cols = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet",
                                  columns= ['MONTH', 'DAY_OF_MONTH', 'OP_UNIQUE_CARRIER'])

CPU times: user 252 ms, sys: 1.16 ms, total: 253 ms
Wall time: 447 ms


In [32]:
ddf_pq_five_cols.groupby('MONTH').OP_UNIQUE_CARRIER.count().compute()

MONTH
4     10266703
8     11060873
12    10098575
2      9655805
1     10326940
7     11258172
6     10148619
3     11086120
5     10378138
11    10215500
10    10787660
9     10397216
Name: OP_UNIQUE_CARRIER, dtype: int64

## Convert from CSV to Parquet

* Use pandas or Dask, `to_parquet()`
* Use `pyarrow`, on of the engines for Parquet workflows: https://arrow.apache.org/docs/python/csv.html

```python
import dask.dataframe as dd

dd.read_csv("path_to_csv_files_on_cloud_storage")

dd.to_parquet("path_to_cloud_storage_loaction", partition_by="")

```

Ref: `scripts/csv_to_parquet.ipynb`

## Sidenotes

* Zarr for multidimensional array workflows
* Snowflake for SQL-like operations

## Best practices

### Data has gravity

- Always move compute to the data
- Data transfer is the highest bottleneck
- Moreover, moving data between clouds can get tricky
- Downloading data locally and then computing will be slow, even for small amounts of data

### Glob storage when possible

- Use glob/object storage when possible because
- These are optimized for long term storage
- Optimized for parallel read/write

### Format based on workflow

- Data should be partitioned/structured based on your workflows
- Parquet, Zarr, etc., can partition, chunk respectively for your specific workflows.

TODO: Link to Coiled's partitioning/filtering blog post

In [None]:
client.shutdown()

---

## Next

Introduction to interactive visualization!