<a href="https://colab.research.google.com/github/guilhermelaviola/BusinessIntelligenceAndBigDataArchitectureWithAppliedDataScience/blob/main/Class02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Warehouse Architecture & Granularity**
Understanding modern data management involves recognizing the roles of Data Warehouses, Data Lakes, and supporting architectures in handling large-scale data. Data Warehouses store structured, historical data optimized for analysis using models such as star, snowflake, and galaxy, with data granularity carefully balanced to meet analytical and performance needs. Data Lakes complement this by storing vast amounts of raw, structured and unstructured data in a flexible format, enabling exploratory analysis. To support real-time and large-scale processing, architectures like Lambda and Kappa, along with microservices, enhance scalability, responsiveness, and maintainability. Leveraging cloud platforms such as Google Cloud further enables secure and scalable data storage and management, allowing organizations to design data architectures tailored to their specific business requirements.

## **Example: Simulating a Data Lake using Colab’s local filesystem**
If your goal is learning, testing, or demonstrating logic, you can replace GCS with local folders. This works 100% offline, with no installs, and behaves very similarly for practice.

In [2]:
import os
import shutil

BASE_PATH = '/content/data_lake'

def create_bucket(bucket_name):
    path = os.path.join(BASE_PATH, bucket_name)
    os.makedirs(path, exist_ok=True)
    print(f'{bucket_name} bucket created successfully (local).')

def upload_to_bucket(blob_name, file_path, bucket_name):
    bucket_path = os.path.join(BASE_PATH, bucket_name)
    os.makedirs(bucket_path, exist_ok=True)
    destination = os.path.join(bucket_path, blob_name)
    shutil.copy(file_path, destination)
    print(f'{file_path} file sent to {bucket_name}/{blob_name}.")

## **Example: Dependency-free Data Lake**
The following example show hoe to run dependency-free by simulating Google Cloud Storage with Colab’s local filesystem and keeping the same learning intent (download + read CSV).

In [6]:
# Createing a dummy CSV file to simulate 'student-mat.csv':
dummy_csv_content = """col1;col2;col3
1;a;x
2;b;y
3;c;z
"""
dummy_file_name = 'student-mat.csv'
with open(dummy_file_name, 'w') as f:
    f.write(dummy_csv_content)
print(f'Dummy file {dummy_file_name} created for simulation.')

# Creating the 'datalakedescomplica' bucket:
create_bucket(bucket_name='datalakedescomplica')

# Uploading the dummy file to the simulated data lake bucket:
upload_to_bucket(
    blob_name='student-mat.csv',
    file_path=dummy_file_name,
    bucket_name='datalakedescomplica'
)

Dummy file student-mat.csv created for simulation.
datalakedescomplica bucket created successfully (local).
student-mat.csv file sent to datalakedescomplica/student-mat.csv.


In [7]:
import os
import pandas as pd
import shutil

BASE_PATH = '/content/data_lake'

# Re-defining functions from previous cell (5hBWutN2jGDV)
# to ensure they are available and the data lake setup is performed.
def create_bucket(bucket_name):
    path = os.path.join(BASE_PATH, bucket_name)
    os.makedirs(path, exist_ok=True)
    print(f'{bucket_name} bucket created successfully (local).')

def upload_to_bucket(blob_name, file_path, bucket_name):
    bucket_path = os.path.join(BASE_PATH, bucket_name)
    os.makedirs(bucket_path, exist_ok=True) # Ensure bucket path exists before copying
    destination = os.path.join(bucket_path, blob_name)
    shutil.copy(file_path, destination)
    print(f'{file_path} file sent to {bucket_name}/{blob_name}.')

def download_blob(bucket_name, source_blob_name, destination_file_name):
    bucket_path = os.path.join(BASE_PATH, bucket_name)
    source_path = os.path.join(bucket_path, source_blob_name)

    shutil.copy(source_path, destination_file_name)
    print(f'Blob {source_blob_name} downloaded to {destination_file_name}.')

download_blob(
    bucket_name='datalakedescomplica',
    source_blob_name='student-mat.csv',
    destination_file_name='student-mat.csv'
)

def read_csv_blob(bucket_name, blob_name, **kwargs):
    file_path = os.path.join(BASE_PATH, bucket_name, blob_name)
    df = pd.read_csv(file_path, **kwargs)
    return df

df = read_csv_blob(
    'datalakedescomplica',
    'student-mat.csv',
    sep=';'
)

df.head()

Blob student-mat.csv downloaded to student-mat.csv.


Unnamed: 0,col1,col2,col3
0,1,a,x
1,2,b,y
2,3,c,z
