## Google Cloud Storage using Python and Pandas

#### Using gsutil inside the notebook using !

In [5]:
!gsutil list gs://gcs-data-lake-1

gs://gcs-data-lake-1/data/


In [6]:
!gsutil ls -R gs://gcs-data-lake-1

gs://gcs-data-lake-1/data/:

gs://gcs-data-lake-1/data/retail_db/:
gs://gcs-data-lake-1/data/retail_db/create_db_tables_pg.sql
gs://gcs-data-lake-1/data/retail_db/load_db_tables_pg.sql
gs://gcs-data-lake-1/data/retail_db/schemas.json

gs://gcs-data-lake-1/data/retail_db/categories/:
gs://gcs-data-lake-1/data/retail_db/categories/part-00000

gs://gcs-data-lake-1/data/retail_db/customers/:
gs://gcs-data-lake-1/data/retail_db/customers/part-00000

gs://gcs-data-lake-1/data/retail_db/departments/:
gs://gcs-data-lake-1/data/retail_db/departments/part-00000

gs://gcs-data-lake-1/data/retail_db/order_items/:
gs://gcs-data-lake-1/data/retail_db/order_items/part-00000

gs://gcs-data-lake-1/data/retail_db/orders/:
gs://gcs-data-lake-1/data/retail_db/orders/part-00000

gs://gcs-data-lake-1/data/retail_db/products/:
gs://gcs-data-lake-1/data/retail_db/products/part-00000


# Using Python to interact with Google Storage

In [None]:
# By using the below command, it sets the application defaults for the GCS Python library to connect to google cloud

!gcloud auth application-default login

#### Create a client to Google Storage

In [30]:
from google.cloud import storage
import json
import pandas as pd


print("")

gsclient = storage.Client()
gsclient




<google.cloud.storage.client.Client at 0x7f2373bdfbb0>

#### List all the buckets in the account

In [36]:
print(f"Buckets in the account are {list(gsclient.list_buckets())}")

bucket0 = list(gsclient.list_buckets())[0]
bucket0.name

Buckets in the account are [<Bucket: gcs-data-lake-1>]


'gcs-data-lake-1'

#### List all blobs in current bucket and show contents of one of the blobs

In [33]:
for i in bucket0.list_blobs():
    print(i)

<Blob: gcs-data-lake-1, data/retail_db/categories/part-00000, 1676193065875141>
<Blob: gcs-data-lake-1, data/retail_db/create_db_tables_pg.sql, 1676193065836272>
<Blob: gcs-data-lake-1, data/retail_db/customers/part-00000, 1676193065868069>
<Blob: gcs-data-lake-1, data/retail_db/departments/part-00000, 1676193065828750>
<Blob: gcs-data-lake-1, data/retail_db/load_db_tables_pg.sql, 1676193065865869>
<Blob: gcs-data-lake-1, data/retail_db/order_items/part-00000, 1676193065874727>
<Blob: gcs-data-lake-1, data/retail_db/orders/part-00000, 1676193065904849>
<Blob: gcs-data-lake-1, data/retail_db/products/part-00000, 1676193065917054>
<Blob: gcs-data-lake-1, data/retail_db/schemas.json, 1676193065864417>


In [34]:
# Read the last object in memory, which is "schemas.json"

json.loads(i.download_as_string())

{'departments': [{'column_name': 'department_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'department_name',
   'data_type': 'string',
   'column_position': 2}],
 'categories': [{'column_name': 'category_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'category_department_id',
   'data_type': 'integer',
   'column_position': 2},
  {'column_name': 'category_name',
   'data_type': 'string',
   'column_position': 3}],
 'orders': [{'column_name': 'order_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'order_date', 'data_type': 'string', 'column_position': 2},
  {'column_name': 'order_customer_id',
   'data_type': 'timestamp',
   'column_position': 3},
  {'column_name': 'order_status',
   'data_type': 'string',
   'column_position': 4}],
 'products': [{'column_name': 'product_id',
   'data_type': 'integer',
   'column_position': 1},
  {'column_name': 'product_cateogry_id',
   'data_type': 'integer',
   'c

#### Selecting a specific bucket, and working with it

In [65]:
[i.name for i in gsclient.get_bucket(bucket_or_name="gcs-data-lake-1").list_blobs()]

['data/retail_db/categories/part-00000',
 'data/retail_db/create_db_tables_pg.sql',
 'data/retail_db/customers/part-00000',
 'data/retail_db/departments/part-00000',
 'data/retail_db/load_db_tables_pg.sql',
 'data/retail_db/order_items/part-00000',
 'data/retail_db/orders/part-00000',
 'data/retail_db/products/part-00000',
 'data/retail_db/schemas.json']

##### Creating new blobs of data

In [53]:
# Creating a new blob, and deleting
gsclient.get_bucket(bucket_or_name="gcs-data-lake-1").blob("new_blob").upload_from_string("This is a new blob")
gsclient.get_bucket(bucket_or_name="gcs-data-lake-1").blob("new_blob").exists()
gsclient.get_bucket(bucket_or_name="gcs-data-lake-1").blob("new_blob").delete()

##### Selecting as blob, downloading data and displaying with Pandas Dataframe

In [95]:
blob_name =  "data/retail_db/orders/part-00000"
blob = gsclient.get_bucket("gcs-data-lake-1").get_blob(blob_name)
blob

<Blob: gcs-data-lake-1, data/retail_db/orders/part-00000, 1676193065904849>

In [114]:
from io import BytesIO
io = BytesIO()

io.write(blob.download_as_string())
io.seek(0)
pd.read_csv(io, header=None).head(3)

Unnamed: 0,0,1,2,3
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE


##### Downloading with URI

In [116]:
io = BytesIO()
gsclient.download_blob_to_file("gs://gcs-data-lake-1/data/retail_db/orders/part-00000", io)
io.seek(0)
pd.read_csv(io, header=None).head(3)

Unnamed: 0,0,1,2,3
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE


# Project to upload multiple files to GCS using Python modules

As part of this section, we will be using the modules like google-cloud-storage, OS and glob to enable us to transfer files from local to the object storage with help of Python API
 
Design requirements - 
1. Get list of file names from local
2. Build a blob for each file to upload
3. use upload method to upload the related data
4. use metadata for help with uploading the right info
5. rename blobs with help of file names as reference


In [117]:
!gsutil mb -b on -c standard --autoclass --placement us-east1,us-east4 gs://test-data-lake-1

Creating gs://test-data-lake-1/...


In [118]:
list(gsclient.list_buckets())

[<Bucket: gcs-data-lake-1>, <Bucket: test-data-lake-1>]

At this point, now we have two buckets, with the one we recently created

In [164]:
## Function to get list of all the files that we require

from glob import glob
from os import path

def get_file_list(src_dir):
    items = glob(f"{src_dir}/**", recursive=True)
    for i in filter(lambda item: path.isfile(item), items): yield i

create_blob_name = lambda x: x[x.index("/", x.index("/")+1)+1:]
[create_blob_name(i) for i in get_file_list("../data/")][:2]

['cards/smalldecks/deckofcards.txt', 'cards/smalldecks/deckofcards.tar.gz']

In [166]:
### Let's upload these files to bucket we created

bucket1 = gsclient.get_bucket(bucket_or_name="test-data-lake-1")
for file_to_upload in get_file_list("../data/"):
    # Get the file object
    blob_name=create_blob_name(file_to_upload)
    blob = bucket1.blob(blob_name=blob_name)
    with open(file_to_upload, "rb") as f:
        print(f"Upload of file in progress - {blob_name}")
        blob.upload_from_file(f)

Upload of file in progress - cards/smalldecks/deckofcards.txt
Upload of file in progress - cards/smalldecks/deckofcards.tar.gz
Upload of file in progress - cards/deckofcards.txt
Upload of file in progress - cards/zippeddecks/zippeddeck.txt.gz
Upload of file in progress - cards/zippeddecks/zippeddeck.tar
Upload of file in progress - cards/largedeck.txt.gz
Upload of file in progress - electionresults/ls2014.tsv
Upload of file in progress - nyse/companylist_noheader.csv
Upload of file in progress - nyse/nyse_data.tar.gz
Upload of file in progress - nyse_all/nyse_data/NYSE_2004.txt.gz
Upload of file in progress - nyse_all/nyse_data/NYSE_2013.txt.gz
Upload of file in progress - nyse_all/nyse_data/NYSE_2011.txt.gz
Upload of file in progress - nyse_all/nyse_data/NYSE_2009.txt.gz
Upload of file in progress - nyse_all/nyse_data/NYSE_1998.txt.gz
Upload of file in progress - nyse_all/nyse_data/NYSE_2003.txt.gz
Upload of file in progress - nyse_all/nyse_data/NYSE_2014.txt.gz
Upload of file in prog

In [167]:
len(list(bucket1.list_blobs()))

82

In [169]:
# Another way of listing the files of interest

list(gsclient.list_blobs(bucket_or_name="test-data-lake-1", prefix="retail_db/orders"))

[<Blob: test-data-lake-1, retail_db/orders/part-00000, 1676239097434223>]

# Overview of using pandas to process the files in GCS