## Project to Upload Files to GCS using Pandas

As part of the series of lectures we will see how to upload files to GCS using Python Pandas. We will be using `glob`, `os`, `pandas` to build the application logic.

Here are the design details.
* First, we need to get list of file names from the local file system to upload.
* As we want to have right column names for our data set, we need to ensure that the column names are extracted from **schemas.json** file in `data/retail_db`.
* Once we get the file names, we can use `pd.read_csv` with `names` to create Dataframe and then write to target GCS location using `parquet` file format.
* We will use metadata or data driven development approach to take care uploading all the files related to retail to GCS.
* Blobs or Files using Parquet format will be named using file names as reference.

In [None]:
!gsutil rm -r gs://airetail/retail_db_parquet

In [None]:
!gsutil ls gs://airetail/

In [None]:
import glob
import os

In [None]:
def get_file_names(src_base_dir):
    items = glob.glob(f'{src_base_dir}/**', recursive=True)
    return list(filter(lambda item: os.path.isfile(item) and item.endswith('part-00000'), items))


In [None]:
src_base_dir = '../../data/retail_db'

In [None]:
get_file_names(src_base_dir)

In [None]:
import json

In [None]:
schemas = json.load(open('../../data/retail_db/schemas.json'))
schemas

In [None]:
ds_schema = sorted(schemas['orders'], key=lambda col: col['column_position'])
ds_schema

In [None]:
def get_column_names(schemas_file, ds_name):
    schemas = json.load(open(schemas_file))
    ds_schema = sorted(schemas[ds_name], key=lambda col: col['column_position'])
    columns = [col['column_name'] for col in ds_schema]
    return columns

In [None]:
get_column_names('../../data/retail_db/schemas.json', 'orders')

In [None]:
for ds in [
    'departments', 'categories', 'products',
    'customers', 'orders', 'order_items'
]:
    column_names = get_column_names('../../data/retail_db/schemas.json', ds)
    print(f'''columns for {ds} are {','.join(column_names)}''')

In [None]:
tgt_base_dir = 'retail_db_parquet'

In [None]:
import pandas as pd

In [None]:
schemas_file = '../../data/retail_db/schemas.json'
bucket = 'airetail'
files = get_file_names(src_base_dir)
for file in files:
    print(f'Uploading file {file}')
    blob_suffix = '/'.join(file.split('/')[4:])
    ds_name = file.split('/')[-2]
    blob_name = f'gs://{bucket}/{tgt_base_dir}/{blob_suffix}.snappy.parquet'
    columns = get_column_names(schemas_file, ds_name)
    df = pd.read_csv(file, names=columns)
    df.to_parquet(blob_name, index=False)

In [None]:
!gsutil ls -r gs://airetail/retail_db_parquet

In [None]:
pd.read_parquet('gs://airetail/retail_db_parquet/orders/part-00000.snappy.parquet')

In [None]:
for ds in [
    'departments', 'categories', 'products',
    'customers', 'orders', 'order_items'
]:
    df = pd.read_csv(f'../../data/retail_db/{ds}/part-00000')
    print(f'''Shape of {ds} in local files system is {df.shape}''')

In [None]:
for ds in [
    'departments', 'categories', 'products',
    'customers', 'orders', 'order_items'
]:
    df = pd.read_parquet(f'gs://{bucket}/{tgt_base_dir}/{ds}/part-00000.snappy.parquet')
    print(f'''Shape of {ds} in gcs is {df.shape}''')