# Data storage and formats

Typically, large datasets are stored on the cloud in object storage, that are designed to store massive files for long periods of time. The largest providers are Amazon S3, Google Cloud Storage, and Azure Data Lake. 

In the previous notebook, we read our data from Google Cloud Storage. 

In [1]:
import warnings
warnings.filterwarnings("ignore")

## Cloud storage as file systems

- You can access these cloud storage in a file-system like interface 
- We use `gcsfs`: https://gcsfs.readthedocs.io/en/latest/
- Read/write data in Python

In [2]:
import json
import gcsfs

In [5]:
token = json.load(open("prep/credentials.json"))
fs = gcsfs.GCSFileSystem(token=token)

In [6]:
fs.ls("quansight-datasets/airline-ontime-performance")

['quansight-datasets/airline-ontime-performance/csv',
 'quansight-datasets/airline-ontime-performance/full_dataset.parquet',
 'quansight-datasets/airline-ontime-performance/parquet_by_year']

### Your turn: Open the above folders to view the contents

In [5]:
# Your code here

In [6]:
fs.ls("quansight-datasets/airline-ontime-performance/csv/")

['quansight-datasets/airline-ontime-performance/csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2004.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2005.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2006.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2007.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2008.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2009.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2010.csv',
 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2011.csv',
 'quansight-datasets/airline-ontime-performanc

### Your turn: Read a line from one of the CSV files

In [None]:
# Your code here

In [7]:
with fs.open("quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv", "r") as f:
    print(f.readline())

YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEE

## Download CSV data (subset) from S3

In [7]:
import pandas as pd

In [8]:
with fs.open("quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv", "r") as f:
    df = pd.read_csv(f)

In [9]:
df.head()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
1,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
2,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
3,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,
4,2003,2,4,1,2,4/1/2003 12:00:00 AM,AA,19805,AA,,...,,,,,,,,,,


You can do this directly in pandas (and dask):

`storage_options` takes arguments that will be passed on to `GCSFileSystem`.

In [10]:
storage_options = {"token": token}

In [11]:
files = [f"gcs://{f}" for f in fs.glob("quansight-datasets/airline-ontime-performance/csv/*2022.csv")]

In [12]:
%%time

df_list = []

for file in files:
    df_temp = pd.read_csv(file, storage_options=storage_options)
    df_list.append(df_temp)

CPU times: user 58.5 s, sys: 18.7 s, total: 1min 17s
Wall time: 8min 10s


In [13]:
df_csv = pd.concat(df_list)

In [14]:
df_csv.head()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N132EV,...,,,,,,,,,,
1,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
2,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
3,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
4,2022,2,4,1,5,4/1/2022 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,


In [15]:
len(df_csv)

6172030

## Understanding Parquet

- Columnar format
- Stores metadata (like datatypes, column names, and ranges per partition)
- Can do parallel read/write, so we can leverage distributed computing
- Can be stored in an efficient format (partitioned by any columns)
- and more: filter, set index/dtypes, while reading data!

## Download parquet data (subset) from S3

In [16]:
files = [f"gcs://{f}" for f in fs.glob("quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2003/*")]

In [17]:
%%time

df_list = []

for file in files:
    df_temp = pd.read_parquet(file, storage_options=storage_options)
    df_list.append(df_temp)

CPU times: user 23.1 s, sys: 8.28 s, total: 31.4 s
Wall time: 2min 21s


In [18]:
df_parq = pd.concat(df_list)

In [19]:
df_parq.head()

Unnamed: 0,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,3,7,1,2,2003-07-01,AA,19805,AA,,1094,...,,,,,,,,,,
1,3,7,1,2,2003-07-01,AA,19805,AA,,1415,...,,,,,,,,,,
2,3,7,1,2,2003-07-01,AA,19805,AA,,1548,...,,,,,,,,,,
3,3,7,1,2,2003-07-01,AA,19805,AA,,1599,...,,,,,,,,,,
4,3,7,1,2,2003-07-01,AA,19805,AA,,1843,...,,,,,,,,,,


In [20]:
len(df_parq)

4321462

## Comparing performance

- Can read only the columns you want
- **More in the Dask notebook!**

In [22]:
%%time

df_list = []

for file in files:
    df_temp = pd.read_parquet(file, 
                              columns= ['MONTH', 'DAY_OF_MONTH', 'OP_UNIQUE_CARRIER'],
                              storage_options=storage_options)
    df_list.append(df_temp)

CPU times: user 3.44 s, sys: 5.69 s, total: 9.12 s
Wall time: 1min 56s


## Convert from CSV to Parquet

* Use pandas or Dask, `to_parquet()`
* Use `pyarrow`, on of the engines for Parquet workflows: https://arrow.apache.org/docs/python/csv.html

```python
import dask.dataframe as dd

dd.read_csv("path_to_csv_files_on_cloud_storage")

dd.to_parquet("path_to_cloud_storage_loaction", partition_by="")

```

Ref: `scripts/csv_to_parquet.ipynb`

## Sidenotes

* Zarr for multidimensional array workflows
* Snowflake for SQL-like operations

## Best practices

### Data has gravity

- Always move compute to the data
- Data transfer is the highest bottleneck
- Moreover, moving data between clouds can get tricky
- Downloading data locally and then computing will be slow, even for small amounts of data

### Glob storage when possible

- Use glob/object storage when possible because
- These are optimized for long term storage
- Optimized for parallel read/write

### Format based on workflow

- Data should be partitioned/structured based on your workflows
- Parquet, Zarr, etc., can partition, chunk respectively for your specific workflows.

TODO: Link to Coiled's partitioning/filtering blog post

---

## Next

Introduction to interactive visualization!