<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_object_storage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

this tutorial requires you to have created a GCP project and a storage bucket created.

1.    Log into console.cloud.google.com
2.    Create a storage bucket
3.    Create an IAM service account with Storage Object User Role
4.    Add a KEY (JSON)
5.    Upload the key to you colab VM

In [1]:
from google.colab import userdata
import os
import sqlite3
import pandas as pd

In [None]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/fleet-space-407416-a177ff0a8af7.json"

download the northwind sqlite db

In [2]:
!wget -O northwind.db https://github.com/matthewpecsok/data_engineering/raw/main/data/northwind.db

--2024-08-01 17:11:28--  https://github.com/matthewpecsok/data_engineering/raw/main/data/northwind.db
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/matthewpecsok/data_engineering/main/data/northwind.db [following]
--2024-08-01 17:11:28--  https://raw.githubusercontent.com/matthewpecsok/data_engineering/main/data/northwind.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 602112 (588K) [application/octet-stream]
Saving to: ‘northwind.db’


2024-08-01 17:11:28 (16.4 MB/s) - ‘northwind.db’ saved [602112/602112]



create a connection object

In [3]:
conn = sqlite3.connect("northwind.db")

using pandas and the connection object retrieve a list of table names from the database

In [4]:
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)

Unnamed: 0,name
0,Categories
1,sqlite_sequence
2,CustomerCustomerDemo
3,CustomerDemographics
4,Customers
5,Employees
6,EmployeeTerritories
7,Order Details
8,Orders
9,Products




1.   read the customers table into a pandas dataframe
2.   write the dataframe to a csv file (exclude the index)
3.   using the shell command tail show the last 10 rows of the csv file.



In [5]:
customers = pd.read_sql("SELECT * FROM customers", conn)
customers.to_csv("customers.csv",index=False)
!tail customers.csv

Val2 ,IT,Val2,IT,,,,,,,
VALON,IT,Valon Hoti,IT,,,,,,,
VICTE,Victuailles en stock,Mary Saveley,Sales Agent,"2, rue du Commerce",Lyon,Western Europe,69004,France,78.32.54.86,78.32.54.87
VINET,Vins et alcools Chevalier,Paul Henriot,Accounting Manager,59 rue de l'Abbaye,Reims,Western Europe,51100,France,26.47.15.10,26.47.15.11
WANDK,Die Wandernde Kuh,Rita Müller,Sales Representative,Adenauerallee 900,Stuttgart,Western Europe,70563,Germany,0711-020361,0711-035428
WARTH,Wartian Herkku,Pirkko Koskitalo,Accounting Manager,Torikatu 38,Oulu,Scandinavia,90110,Finland,981-443655,981-443655
WELLI,Wellington Importadora,Paula Parente,Sales Manager,"Rua do Mercado, 12",Resende,South America,08737-363,Brazil,(14) 555-8122,
WHITC,White Clover Markets,Karl Jablonski,Owner,305 - 14th Ave. S. Suite 3B,Seattle,North America,98128,USA,(206) 555-4112,(206) 555-4115
WILMK,Wilman Kala,Matti Karttunen,Owner/Marketing Assistant,Keskuskatu 45,Helsinki,Scandinavia,21240,Finland,90-224 8858,90-224 8858
WOLZA,Wolski

In [6]:
customers.shape

(93, 11)

In [7]:
!ls -ltrh customers.csv

-rw-r--r-- 1 root root 13K Aug  1 17:12 customers.csv


In [8]:
customers.to_parquet("customers.parquet",index=False)

In [9]:
!ls -ltrh customers*

-rw-r--r-- 1 root root 13K Aug  1 17:12 customers.csv
-rw-r--r-- 1 root root 18K Aug  1 17:12 customers.parquet


#  Parquet vs CSV

based on these results is it better to use csv or parquet for customer data?

1.   read the orders table into a pandas dataframe
2.   write the dataframe to a csv file (exclude the index)
3.   using the shell command tail show the last 10 rows of the csv file.


In [None]:
orders = pd.read_sql("SELECT * FROM orders", conn)
orders.to_csv("orders.csv",index=False)
!tail orders.csv

1.   read the order detail table into a pandas dataframe
2.   write the dataframe to a csv file (exclude the index)
3.   using the shell command tail show the last 10 rows of the csv file.


In [None]:
order_detail = pd.read_sql("SELECT * FROM 'Order Details'", conn)
order_detail.to_csv("order_detail.csv",index=False)
!tail order_detail.csv

In [None]:


# create a new function for uploading to GCP cloud storage
def upload_to_storage(bucket_name, source_file_name, destination_blob_name):
    from google.cloud import storage

    storage_client = storage.Client() # create the Client.
    bucket = storage_client.bucket(bucket_name) # get the bucket instance
    blob = bucket.blob(destination_blob_name) # create a new blob

    blob.upload_from_filename(source_file_name) # upload the file

    print(f"File {source_file_name} uploaded to {destination_blob_name}.") # print the success message

# create a new function for downloading from GCP cloud storage
def download_from_storage(bucket_name, source_file_name, destination_blob_name):
    from google.cloud import storage

    storage_client = storage.Client() # create the Client.
    bucket = storage_client.bucket(bucket_name) # get the bucket instance
    blob = bucket.blob(source_file_name) # create a new blob

    blob.download_to_filename(destination_blob_name) # upload the file

    print(f"File {source_file_name} downloaded to {destination_blob_name}.") # print the success message


upload the 3 local csv files into our GCP cloud storage bucket.

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "customers.csv" # Replace with the path to your local file
destination_blob_name = "customers.csv" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "orders.csv" # Replace with the path to your local file
destination_blob_name = "orders.csv" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "order_detail.csv" # Replace with the path to your local file
destination_blob_name = "order_detail.csv" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)

Now download them back from storage but this time don't save them to the filesystem, read them directly into a pandas dataframe in memory.

In [None]:
import gcsfs

gcs = gcsfs.GCSFileSystem(project='fleet-space-407416')

In [None]:
bucket_name = '6850test1'
file_path = 'orders.csv'

# Use the gcsfs file system object to open the CSV file
with gcs.open(f'{bucket_name}/{file_path}') as file:
    orders_df = pd.read_csv(file)


In [None]:
orders_df.shape

In [None]:
orders_df.head()

In [None]:
# Use the gcsfs file system object to open the CSV file
file_path = 'customers.csv'

with gcs.open(f'{bucket_name}/{file_path}') as file:
    customers_df = pd.read_csv(file)

customers_df.shape

In [None]:
customers_df.head()

join the customers and orders pandas dataframes

In [None]:
customers_orders = orders_df.merge(customers_df, left_on='CustomerID', right_on='CustomerID', how='left')
customers_orders.columns

In [None]:
customers_orders.head()

In [None]:
# Use the gcsfs file system object to open the CSV file
file_path = 'order_detail.csv'

with gcs.open(f'{bucket_name}/{file_path}') as file:
    order_detail_df = pd.read_csv(file)

order_detail_df.shape

In [None]:
order_detail_df.columns

join the customers_order and order detail pandas dataframes

In [None]:
customers_order_detail = customers_orders.merge(order_detail_df, left_on='OrderID', right_on='OrderID', how='left')
customers_order_detail.columns

write the new dataframe to a parquet file

In [None]:
customers_order_detail.to_parquet('customers_order_detail.parquet')

upload the parquet file to GCP cloud storage.

In [None]:
bucket_name = "6850test1" # Replace with your bucket name
source_file_name = "customers_order_detail.parquet" # Replace with the path to your local file
destination_blob_name = "customers_order_detail.parquet" # Replace with the destination object name in the bucket

upload_to_storage(bucket_name, source_file_name, destination_blob_name)