# MinIO

MinIO é um sistema de Object Storage compatível com o S3. Em nossa arquitetura ele é disponibilizado de forma não-distribuida então funcionalidade que garantem alta disponibilidade, armazenamento redundante não estão disponíveis.

MinIO pode ser utilizado com uma CLI (não disponível em nossa arquitetura) e através de uma [web-ui](http://localhost:9090/login)

### Credenciais
- User: admin
- Password: password


### Criando buckets

Para o PySpark conseguir armazenar MANAGED TABLES é necessário criar um bucket para armazenar os dados das tabelas.
No arquivo spark-defaults.conf, localizado no diretório $SPARK_HOME/conf/ estão algumas configurações padrões. 

Entre elas temos 
> spark.sql.warehouse.dir s3a://warehouse/

Esse bucket não vem criado por padrão então é necessário criá-lo seguindo as instruções:

- Acesse a [web-ui](http://localhost:9090/login) (usuario: _admin_ | senha: _password_)
- Na barra lateral esquerda clique no botão _Buckets_
- No lado direito clique em _Create Bucket +_
- Defina o nome do bucket como _warehouse_ e em seguida em _Create Bucket_

Repita as instruções para criar quantos bucket quiser.

### Explorando buckets

- Clique em _Object Browser_, você conseguirá acessar os buckets que foram criados e os prefixos/objetos.
- Uma vez em um bucket é possivel também criar novos prefixos/subpastas clicando em _Create new path_

# Experimento :: Leitura e escrita no MinIO com PySpark

- Vamos utilizar um dataset do Kaggle [Data Science Salaries 2023](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023). Baixe, descompacte e reserve o ds_salaries.csv

- Crie um bucket no MinIO chamado _my-bucket_ com os sub-paths _input_, _input/salaries_ e _output_

> s3a://my-bucket/

> s3a://my-bucket/input/

> s3a://my-bucket/input/salaries

> s3a://my-bucket/output

Utilize o Object Brower para acessar o path _input/salaries_ e clique em Upload File e selecione o _ds_salaries.csv_

## Criação de Spark Session

Configurações disponíveis no $SPARK_HOME/conf/spark-defaults.config

In [18]:
# Executar para visualizar spark-defaults.conf
#!cat $SPARK_HOME/conf/spark-defaults.conf

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled 

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col

spark = SparkSession\
    .builder\
    .appName("Teste de Leitura e Escrita com Minio")\
    .getOrCreate()

spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/08/16 22:59:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/08/16 22:59:08 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties



Com a sessão criada podemos ter acesso as spark-ui da Spark Session, Spark Driver e Workers
### [Spark Session](http://127.0.0.1:4040)  
> ### [Spark Master](http://127.0.0.1:5050)  
>> |-- [Spark Worker A](http://127.0.0.1:5051)  
>> |-- [Spark Worker B](http://127.0.0.1:5052)

Com o arquivo do dataset de exemplo no diretório do MinIO conseguimos utilizar o Spark e construir um DataFrame dos dados

In [2]:
path = "s3a://my-bucket/input/salaries/ds_salaries.csv"

# Definindo schema
schema = StructType() \
    .add('work_year', IntegerType(), True)\
    .add('experience_level', StringType(), True)\
    .add('employment_type', StringType(), True)\
    .add('job_title', StringType(), True)\
    .add('salary', DoubleType(), True)\
    .add('salary_currency', StringType(), True)\
    .add('salary_in_usd', DoubleType(), True)\
    .add('employee_residence', StringType(), True)\
    .add('remote_ratio', DoubleType(), True)\
    .add('company_location', StringType(), True)

# Lendo os dados
df = spark.read.csv(path = path, schema = schema, header = True)

# Schema do DataFrame
df.printSchema()

# Printando 5 linhas aleatórias
df.show(5)

23/08/16 22:59:37 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://my-bucket/input/salaries/ds_salaries.csv.
org.apache.hadoop.fs.s3a.UnknownStoreException: `s3a://my-bucket/input/salaries/ds_salaries.csv': getFileStatus on s3a://my-bucket/input/salaries/ds_salaries.csv: com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 177BFF49E5DBD6A4; S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8; Proxy: null), S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8:NoSuchBucket: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 177BFF49E5DBD6A4; S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8; Proxy: null)
	at org.apache.hadoop.fs.s3a.S3AUtils.tra

Py4JJavaError: An error occurred while calling o58.csv.
: org.apache.hadoop.fs.s3a.UnknownStoreException: `s3a://my-bucket/input/salaries/ds_salaries.csv': getFileStatus on s3a://my-bucket/input/salaries/ds_salaries.csv: com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 177BFF49ECCBE9A4; S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8; Proxy: null), S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8:NoSuchBucket: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 177BFF49ECCBE9A4; S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8; Proxy: null)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:263)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3858)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$exists$34(S3AFileSystem.java:4703)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4701)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:784)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:782)
	at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
	at scala.util.Success.$anonfun$map$1(Try.scala:255)
	at scala.util.Success.map(Try.scala:213)
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 177BFF49ECCBE9A4; S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8; Proxy: null), S3 Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5397)
	at com.amazonaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:971)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$11(S3AFileSystem.java:2595)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:414)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:377)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:2586)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3832)
	... 23 more


Com o DataFrame em memória, vamos fazer algumas alterações, como filtrar pessoas que ganham em USD

In [12]:
df.where(col('salary_currency') == 'USD').show(5)

+---------+----------------+---------------+-----------------+--------+---------------+-------------+------------------+------------+----------------+
|work_year|experience_level|employment_type|        job_title|  salary|salary_currency|salary_in_usd|employee_residence|remote_ratio|company_location|
+---------+----------------+---------------+-----------------+--------+---------------+-------------+------------------+------------+----------------+
|     2023|              MI|             CT|      ML Engineer| 30000.0|            USD|      30000.0|                US|       100.0|              US|
|     2023|              MI|             CT|      ML Engineer| 25500.0|            USD|      25500.0|                US|       100.0|              US|
|     2023|              SE|             FT|   Data Scientist|175000.0|            USD|     175000.0|                CA|       100.0|              CA|
|     2023|              SE|             FT|   Data Scientist|120000.0|            USD|     12

E agora, vamos gravar de volta no ~s3~ MinIO, de forma particionada e sobrescrevendo qualquer dado que estiver nesse path.

In [13]:
df.write\
    .option("compression", "snappy")\
    .mode('overwrite')\
    .partitionBy('employment_type')\
    .format('parquet')\
    .save(path = "s3a://my-bucket/output/salaries_usd/")

                                                                                

Acesse a [web-ui](http://localhost:9090/login) do MinIO e verifique se ele gravou as partições corretamente.

In [28]:
spark.stop()