# Lab 02.1 - Introduction to object storage with S3 & MinIO

## What is an object storage

An object storage system is a way of storing data as objects in key-value stores (called "buckets" or "containers"), an object is made of:

- **Data**: The actual content of the file (text, image, video, etc.).
- **Metadata**: Information about the object, such as file type, creation date, size, permissions, or custom tags you define.

Each object is assigned a unique identifier within the bucket, so it can be accessed using the `bucket + key` combination.

These stores are then replicated across multiple availability zones and/or regions to ensure high availability and fast access. Users can access these systems over HTTP/HTTPS connections, typically via dedicated APIs. These characteristics make this type of storage ideal and widely used for building data lakes for business intelligence solutions.

Among the most popular object stores is [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html), which adds many extra features over this basic object storage definition and offers a wide range of cheap and reliable storaging options based on the cloud. S3 has set an standard on how these systems work from a functionality standpoint with other storaging services adopting the S3 API conventions for there storage solutions. 

However, S3 is propietary and paid-by-use software which is a barrier for using it in organizations with resources restrictions, therefore in this course we will be using [MinIO](https://docs.min.io/community/minio-object-store/index.html), which is an open-source and self-hosted alternative to S3.


## 1. Connecting to a bucket

As mentioned in the previous section object storage is accessed via APIs over an HTTP/HTTPS connection, therefore to connect to a bucket we would need:

- `Endpoint`: The URL of the object storage system that manages the bucket
- `Bucket`: The bucket identifier
- `Key`: A key that identifies the user (think of it as the user name)
- `Secret`: A secret that is only known by the user represented by `Key` (think of it as the password)

To abstract the intricacies of the API you will tipically use a client to manage the connection, Amazon has made available an SDK (Software Development Kit) for python called [boto3](https://pypi.org/project/boto3/), although there are other clients like [s3fs](https://s3fs.readthedocs.io/en/latest/) used by pandas.


In [None]:
import boto3

# We load the configurations of the connection from the environment variables. Never store your credentials in your code files!
s3 = boto3.client(
    "s3",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    region_name=os.getenv("AWS_REGION", "us-east-1"),
    endpoint_url=os.getenv("AWS_ENDPOINT")
)

## 2. Upload a file to a bucket




In [None]:
local_file_path = "data/HousePrices.csv"
s3_bucket_name = "test"
file_prefix = "notebooks/introduction_to_s3"

### 2.1 Upload using boto3


In [None]:
s3_boto_file_key = f"{file_prefix}/HousePricesBoto.csv"
s3.upload_file(Filename=local_file_path, Bucket=s3_bucket_name, Key=s3_boto_file_key)

### 2.2 Upload using pandas

In [None]:
import pandas as pd
import os

local_pandas_df = pd.read_csv(local_file_path)

s3_pandas_file_key = f"{file_prefix}/HousePricesPandas.csv"

# All s3 file urls must start with s3:// or with s3a://
s3_pandas_file_url = f"s3a://{s3_bucket_name}/{s3_pandas_file_key}"

local_pandas_df.to_csv(
    s3_pandas_file_url,
    index=False,
    storage_options={ 
        "key" : os.getenv("AWS_ACCESS_KEY_ID"),
        "secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
        "client_kwargs" : {
            "endpoint_url": os.getenv("AWS_ENDPOINT")
        },
    }
)

### 2.3 Upload using PySpark

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkWithS3Files") \
    .master("local[*]") \
    .getOrCreate()

def load_config(spark_context: SparkContext):
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.getenv("AWS_ACCESS_KEY_ID"))
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.getenv("AWS_SECRET_ACCESS_KEY"))
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.endpoint", os.getenv("AWS_ENDPOINT"))
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.region", os.getenv("AWS_ENDPOINT"))
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.attempts.maximum", "1")
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.establish.timeout", "5000")
    spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.timeout", "10000")

load_config(spark.sparkContext)

In [None]:
local_spark_df = spark.read.csv(
    local_file_path,   
    header=True,         
    inferSchema=True
)

# Spark by default will create a folder and within it will place the result file in chunks (worker node chunks)
s3_spark_file_key = f"{file_prefix}/HousePricesSpark"

# All s3 file urls must start with s3:// or with s3a://
s3_spark_file_url = f"s3a://{s3_bucket_name}/{s3_spark_file_key}"

local_spark_df.write\
    .format('csv')\
    .option("header", "true")\
    .mode("overwrite")\
    .save(s3_spark_file_url)

## 3. Read a file from a bucket

### 3.1 Read using boto3

In [None]:
import hashlib

response = s3.get_object(Bucket=s3_bucket_name, Key=s3_boto_file_key)


with open(local_file_path, "rb") as f:
    local_file_data = f.read()

s3_file_data = response["Body"].read()

# check that the hash of the local file and the s3 file matches
assert hashlib.sha256(local_file_data).hexdigest() == hashlib.sha256(s3_file_data).hexdigest()


### 3.1 Read using pandas

In [None]:
s3_pandas_df = pd.read_csv(
    s3_pandas_file_url, 
    storage_options={ 
        "key" : os.getenv("AWS_ACCESS_KEY_ID"),
        "secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
        #"region" : os.getenv("AWS_REGION"),
        "client_kwargs" : {'endpoint_url': os.getenv("AWS_ENDPOINT")},
    }
)

# Check that the dataframe was saved correctly
assert len(local_pandas_df.columns) == len(s3_pandas_df.columns)
assert len(local_pandas_df) == len(s3_pandas_df)

### 3.3 Read using PySpark

In [None]:
s3_spark_df = spark.read.csv(
    s3_spark_file_url,
    header=True,      # Use first row as column names
    inferSchema=True  # Automatically detect data types
)

assert local_spark_df.count() == s3_spark_df.count()