<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_object_storage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

this tutorial requires you to have created a GCP project and a storage bucket created.

1.    Log into console.cloud.google.com
2.    Create a storage bucket
3.    Create an IAM service account with Storage Object User Role
4.    Add a KEY (JSON)
5.    Upload the key to you colab VM

In [None]:
from google.colab import userdata
import os
import sqlite3
import pandas as pd

In [None]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/fleet-space-407416-a177ff0a8af7.json"

download the northwind sqlite db

In [None]:
!wget -O northwind.db https://github.com/matthewpecsok/data_engineering/raw/main/data/northwind.db

create a connection object

In [None]:
conn = sqlite3.connect("northwind.db")

using pandas and the connection object retrieve a list of table names from the database

In [None]:
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)



1.   read the customers table into a pandas dataframe
2.   write the dataframe to a csv file (exclude the index)
3.   using the shell command tail show the last 10 rows of the csv file.



In [None]:
customers = pd.read_sql("SELECT * FROM customers", conn)
customers.to_csv("customers.csv",index=False)
!tail customers.csv

1.   read the orders table into a pandas dataframe
2.   write the dataframe to a csv file (exclude the index)
3.   using the shell command tail show the last 10 rows of the csv file.


In [None]:
orders = pd.read_sql("SELECT * FROM orders", conn)
orders.to_csv("orders.csv",index=False)
!tail orders.csv

1.   read the order detail table into a pandas dataframe
2.   write the dataframe to a csv file (exclude the index)
3.   using the shell command tail show the last 10 rows of the csv file.


In [None]:
order_detail = pd.read_sql("SELECT * FROM 'Order Details'", conn)
order_detail.to_csv("order_detail.csv",index=False)
!tail order_detail.csv

In [None]:


# create a new function for uploading to GCP cloud storage
def upload_to_storage(bucket_name, source_file_name, destination_blob_name):
    from google.cloud import storage

    storage_client = storage.Client() # create the Client.
    bucket = storage_client.bucket(bucket_name) # get the bucket instance
    blob = bucket.blob(destination_blob_name) # create a new blob

    blob.upload_from_filename(source_file_name) # upload the file

    print(f"File {source_file_name} uploaded to {destination_blob_name}.") # print the success message

# create a new function for downloading from GCP cloud storage
def download_from_storage(bucket_name, source_file_name, destination_blob_name):
    from google.cloud import storage

    storage_client = storage.Client() # create the Client.
    bucket = storage_client.bucket(bucket_name) # get the bucket instance
    blob = bucket.blob(source_file_name) # create a new blob

    blob.download_to_filename(destination_blob_name) # upload the file

    print(f"File {source_file_name} downloaded to {destination_blob_name}.") # print the success message


upload the 3 local csv files into our GCP cloud storage bucket.

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "customers.csv" # Replace with the path to your local file
destination_blob_name = "customers.csv" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "orders.csv" # Replace with the path to your local file
destination_blob_name = "orders.csv" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "order_detail.csv" # Replace with the path to your local file
destination_blob_name = "order_detail.csv" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)

Now download them back from storage but this time don't save them to the filesystem, read them directly into a pandas dataframe in memory.

In [None]:
import gcsfs

gcs = gcsfs.GCSFileSystem(project='fleet-space-407416')

In [None]:
bucket_name = '6850test1'
file_path = 'orders.csv'

# Use the gcsfs file system object to open the CSV file
with gcs.open(f'{bucket_name}/{file_path}') as file:
    orders_df = pd.read_csv(file)


In [None]:
orders_df.shape

In [None]:
orders_df.head()

In [None]:
# Use the gcsfs file system object to open the CSV file
file_path = 'customers.csv'

with gcs.open(f'{bucket_name}/{file_path}') as file:
    customers_df = pd.read_csv(file)

customers_df.shape

In [None]:
customers_df.head()

join the customers and orders pandas dataframes

In [None]:
customers_orders = orders_df.merge(customers_df, left_on='CustomerID', right_on='CustomerID', how='left')
customers_orders.columns

In [None]:
customers_orders.head()

In [None]:
# Use the gcsfs file system object to open the CSV file
file_path = 'order_detail.csv'

with gcs.open(f'{bucket_name}/{file_path}') as file:
    order_detail_df = pd.read_csv(file)

order_detail_df.shape

In [None]:
order_detail_df.columns

join the customers_order and order detail pandas dataframes

In [None]:
customers_order_detail = customers_orders.merge(order_detail_df, left_on='OrderID', right_on='OrderID', how='left')
customers_order_detail.columns

write the new dataframe to a parquet file

In [None]:
customers_order_detail.to_parquet('customers_order_detail.parquet')

upload the parquet file to GCP cloud storage.

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "customers_order_detail.parquet" # Replace with the path to your local file
destination_blob_name = "customers_order_detail.parquet" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)