## Project to Upload Files to GCS using Pandas

As part of the series of lectures we will see how to upload files to GCS using Python Pandas. We will be using `glob`, `os`, and `pandas` to build the application logic.

Here are the design details.
* First, we need to get list of file names from the local file system to upload.
* As we want to have right column names for our data set, we need to ensure that the column names are extracted from **schemas.json** file in `data/retail_db`.
* Once we get the file names, we can use `pd.read_csv` with `names` to create Dataframe and then write to target GCS location using `parquet` file format.
* We will use metadata or data driven development approach to take care of uploading all the files related to retail to GCS.
* Blobs or Files using Parquet format will be named using file names as reference.

In [1]:
!gsutil rm -r gs://airetail/retail_db_parquet



Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update

CommandException: No URLs matched: gs://airetail/retail_db_parquet


In [2]:
!gsutil ls gs://airetail/

gs://airetail/pythondemo/
gs://airetail/retail_db/


In [3]:
import glob
import os

In [4]:
def get_file_names(src_base_dir):
    items = glob.glob(f'{src_base_dir}/**', recursive=True)
    return list(filter(lambda item: os.path.isfile(item) and item.endswith('part-00000'), items))


In [5]:
src_base_dir = '../../data/retail_db'

In [6]:
get_file_names(src_base_dir)

['../../data/retail_db/customers/part-00000',
 '../../data/retail_db/products/part-00000',
 '../../data/retail_db/departments/part-00000',
 '../../data/retail_db/order_items/part-00000',
 '../../data/retail_db/orders/part-00000',
 '../../data/retail_db/categories/part-00000']

In [8]:
import json

In [9]:
schemas = json.load(open('../../data/retail_db/schemas.json'))
schemas

{'departments': [{'column_name': 'department_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'department_name',
   'data_type': 'string',
   'column_position': 2}],
 'categories': [{'column_name': 'category_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'category_department_id',
   'data_type': 'integer',
   'column_position': 2},
  {'column_name': 'category_name',
   'data_type': 'string',
   'column_position': 3}],
 'orders': [{'column_name': 'order_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'order_date', 'data_type': 'string', 'column_position': 2},
  {'column_name': 'order_customer_id',
   'data_type': 'timestamp',
   'column_position': 3},
  {'column_name': 'order_status',
   'data_type': 'string',
   'column_position': 4}],
 'products': [{'column_name': 'product_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'product_cateogry_id',
   'data_type': 'integer',
   'c

In [10]:
schemas['orders']

[{'column_name': 'order_id', 'data_type': 'integer', 'column_position': 1},
 {'column_name': 'order_date', 'data_type': 'string', 'column_position': 2},
 {'column_name': 'order_customer_id',
  'data_type': 'timestamp',
  'column_position': 3},
 {'column_name': 'order_status', 'data_type': 'string', 'column_position': 4}]

In [12]:
sorted(schemas['orders'], key=lambda col: col['column_position'])

[{'column_name': 'order_id', 'data_type': 'integer', 'column_position': 1},
 {'column_name': 'order_date', 'data_type': 'string', 'column_position': 2},
 {'column_name': 'order_customer_id',
  'data_type': 'timestamp',
  'column_position': 3},
 {'column_name': 'order_status', 'data_type': 'string', 'column_position': 4}]

In [13]:
ds_schema = sorted(schemas['orders'], key=lambda col: col['column_position'])
ds_schema

[{'column_name': 'order_id', 'data_type': 'integer', 'column_position': 1},
 {'column_name': 'order_date', 'data_type': 'string', 'column_position': 2},
 {'column_name': 'order_customer_id',
  'data_type': 'timestamp',
  'column_position': 3},
 {'column_name': 'order_status', 'data_type': 'string', 'column_position': 4}]

In [14]:
[col['column_name'] for col in ds_schema]

['order_id', 'order_date', 'order_customer_id', 'order_status']

In [15]:
def get_column_names(schemas_file, ds_name):
    schemas = json.load(open(schemas_file))
    ds_schema = sorted(schemas[ds_name], key=lambda col: col['column_position'])
    columns = [col['column_name'] for col in ds_schema]
    return columns

In [16]:
get_column_names('../../data/retail_db/schemas.json', 'orders')

['order_id', 'order_date', 'order_customer_id', 'order_status']

In [17]:
for ds in [
    'departments', 'categories', 'products',
    'customers', 'orders', 'order_items'
]:
    column_names = get_column_names('../../data/retail_db/schemas.json', ds)
    print(f'''columns for {ds} are {','.join(column_names)}''')

columns for departments are department_id,department_name
columns for categories are category_id,category_department_id,category_name
columns for products are product_id,product_cateogry_id,product_name,product_description,product_price,product_image
columns for customers are customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode
columns for orders are order_id,order_date,order_customer_id,order_status
columns for order_items are order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price


In [19]:
import pandas as pd

In [23]:
src_base_dir = '../../data/retail_db'
schemas_file = '../../data/retail_db/schemas.json'
bucket = 'airetail'
files = get_file_names(src_base_dir)

In [25]:
file = files[0]
file

'../../data/retail_db/customers/part-00000'

In [29]:
'/'.join(file.split('/')[-2:])

'customers/part-00000'

In [31]:
ds_name = file.split('/')[-2]
ds_name

'customers'

In [33]:
columns = get_column_names(schemas_file, ds_name)
columns

['customer_id',
 'customer_fname',
 'customer_lname',
 'customer_email',
 'customer_password',
 'customer_street',
 'customer_city',
 'customer_state',
 'customer_zipcode']

In [34]:
src_base_dir = '../../data/retail_db'
tgt_base_dir = 'retail_db_parquet'
schemas_file = '../../data/retail_db/schemas.json'
bucket = 'airetail'
files = get_file_names(src_base_dir)
for file in files:
    print(f'Uploading file {file}')
    blob_suffix = '/'.join(file.split('/')[-2:])
    ds_name = file.split('/')[-2]
    blob_name = f'gs://{bucket}/{tgt_base_dir}/{blob_suffix}.snappy.parquet'
    columns = get_column_names(schemas_file, ds_name)
    df = pd.read_csv(file, names=columns)
    df.to_parquet(blob_name, index=False)

Uploading file ../../data/retail_db/customers/part-00000
Uploading file ../../data/retail_db/products/part-00000
Uploading file ../../data/retail_db/departments/part-00000
Uploading file ../../data/retail_db/order_items/part-00000
Uploading file ../../data/retail_db/orders/part-00000
Uploading file ../../data/retail_db/categories/part-00000


In [35]:
!gsutil ls -r gs://airetail/retail_db_parquet

gs://airetail/retail_db_parquet/:

gs://airetail/retail_db_parquet/categories/:
gs://airetail/retail_db_parquet/categories/part-00000.snappy.parquet

gs://airetail/retail_db_parquet/customers/:
gs://airetail/retail_db_parquet/customers/part-00000.snappy.parquet

gs://airetail/retail_db_parquet/departments/:
gs://airetail/retail_db_parquet/departments/part-00000.snappy.parquet

gs://airetail/retail_db_parquet/order_items/:
gs://airetail/retail_db_parquet/order_items/part-00000.snappy.parquet

gs://airetail/retail_db_parquet/orders/:
gs://airetail/retail_db_parquet/orders/part-00000.snappy.parquet

gs://airetail/retail_db_parquet/products/:
gs://airetail/retail_db_parquet/products/part-00000.snappy.parquet


In [39]:
pd.read_csv('../../data/retail_db/orders/part-00000', header=None)

Unnamed: 0,0,1,2,3
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE
3,4,2013-07-25 00:00:00.0,8827,CLOSED
4,5,2013-07-25 00:00:00.0,11318,COMPLETE
...,...,...,...,...
68878,68879,2014-07-09 00:00:00.0,778,COMPLETE
68879,68880,2014-07-13 00:00:00.0,1117,COMPLETE
68880,68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68881,68882,2014-07-22 00:00:00.0,10000,ON_HOLD


In [36]:
pd.read_parquet('gs://airetail/retail_db_parquet/orders/part-00000.snappy.parquet')

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE
3,4,2013-07-25 00:00:00.0,8827,CLOSED
4,5,2013-07-25 00:00:00.0,11318,COMPLETE
...,...,...,...,...
68878,68879,2014-07-09 00:00:00.0,778,COMPLETE
68879,68880,2014-07-13 00:00:00.0,1117,COMPLETE
68880,68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68881,68882,2014-07-22 00:00:00.0,10000,ON_HOLD


In [41]:
for ds in [
    'departments', 'categories', 'products',
    'customers', 'orders', 'order_items'
]:
    df = pd.read_csv(f'../../data/retail_db/{ds}/part-00000', header=None)
    print(f'''Shape of {ds} in local files system is {df.shape}''')

Shape of departments in local files system is (6, 2)
Shape of categories in local files system is (58, 3)
Shape of products in local files system is (1345, 6)
Shape of customers in local files system is (12435, 9)
Shape of orders in local files system is (68883, 4)
Shape of order_items in local files system is (172198, 6)


In [42]:
for ds in [
    'departments', 'categories', 'products',
    'customers', 'orders', 'order_items'
]:
    df = pd.read_parquet(f'gs://{bucket}/{tgt_base_dir}/{ds}/part-00000.snappy.parquet')
    print(f'''Shape of {ds} in gcs is {df.shape}''')

Shape of departments in gcs is (6, 2)
Shape of categories in gcs is (58, 3)
Shape of products in gcs is (1345, 6)
Shape of customers in gcs is (12435, 9)
Shape of orders in gcs is (68883, 4)
Shape of order_items in gcs is (172198, 6)
