# Nemesys 101 - Tutorial básico do Nemesys

Neste tutorial vamos utilizar o Nemesys Data Platform para baixar um arquivo de vendas em formato xlsx, utilizar o spark para criar uma arquitetura [Medallion](https://www.databricks.com/glossary/medallion-architecture) (com camadas bronze, silver e gold).

<img src='https://cms.databricks.com/sites/default/files/inline-images/building-data-pipelines-with-delta-lake-120823.png' width='800px' height='600px'>

## Obtendo os dados
Os dados serão baixados do site da [The UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/352/online+retail) utiliando da ferramenta ```curl```e depois descompactado com o uso do ```unzip```.

In [1]:
%%bash
rm -Rf lakehouse/bronze/vendas

In [2]:
%%bash

if [ -f lakehouse/landing/online_retail.zip ]; then
    rm lakehouse/landing/online_retail.zip
fi

if [ -f lakehouse/landing/Online\ Retail.xlsx ]; then
    rm lakehouse/landing/Online\ Retail.xlsx
fi

curl https://archive.ics.uci.edu/static/public/352/online+retail.zip -o lakehouse/landing/online_retail.zip
unzip lakehouse/landing/online_retail.zip -d lakehouse/landing

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.6M    0 22.6M    0     0  7933k      0 --:--:--  0:00:02 --:--:-- 7931k


Archive:  lakehouse/landing/online_retail.zip
 extracting: lakehouse/landing/Online Retail.xlsx  


## Criar e configurar a instancia do Spark

In [3]:
import pyspark.sql.functions
from pyspark.sql import SparkSession
from delta import *

builder = (SparkSession.builder
         .appName(f"Nemesys-101")
         .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
         .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
         .config("spark.sql.execution.arrow.pyspark.enabled", "true")
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

## Criando a Camada Bronze
Primeiramente a planilha será carregada em um Data Frame Pandas.

In [4]:
%%time
import polars as pd
pdf = pd.read_excel("lakehouse/landing/Online Retail.xlsx", sheet_name='Online Retail', engine='calamine', schema_overrides={"InvoiceNo": pd.String})

CPU times: user 6.85 s, sys: 606 ms, total: 7.46 s
Wall time: 7.29 s


Após a carga, o ```Dataframe Pandas``` será convertido para ```PySpark```, mostrando ao final do processo o schema do ```dataframe``` e a quantidade de linhas processadas.

In [5]:
%%time
df = spark.createDataFrame(pdf.to_pandas())
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

CPU times: user 832 ms, sys: 267 ms, total: 1.1 s
Wall time: 4.64 s


Where:
- __InvoiceNo__: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. _If this code starts with letter 'c', it indicates a cancellation_.
- __StockCode__: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- __Description__: Product (item) name. Nominal.
- __Quantity__: The quantities of each product (item) per transaction. Numeric.	
- __InvoiceDate__: Invoice Date and time. Numeric, the day and time when each transaction was generated.
- __UnitPrice__: Unit price. Numeric, Product price per unit in sterling.
- __CustomerID__: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- __Country__: Country name. Nominal, the name of the country where each customer resides. 

In [6]:
%%time
df.count()

CPU times: user 1.93 ms, sys: 7.72 ms, total: 9.65 ms
Wall time: 4.67 s


541909

In [7]:
df.show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|2010-12-01 08:26:00|     7.65|   17850.0|United Kingdom|
|   536365|    21730|GLASS S

### Persistindo a camada bronze
Após o processo de carga, a tabela de vendas na camada bronze será gravada como uma ```delta table```, o que permite diversas
ações de alto desempenho.
Porém, antes de gravar os dados anteriores serão apagados somente por uma questão de economia de espaço em disco. 
No processo normal, os _arquivos não utilizados_ são apagados __automáticamente__ após um certo período de tempo.

In [8]:
%%time
df.write.format("delta").mode("overwrite").save("lakehouse/bronze/vendas")

CPU times: user 11.9 ms, sys: 899 µs, total: 12.8 ms
Wall time: 15.4 s


## Comparativo de tamanhos
Comparação entre o tmanho do arquivo excel (.xlsx) com os mesmos dados gravados no formato de ```delta table```.

In [9]:
%%bash

du -h lakehouse/landing/*.xlsx
du -h lakehouse/bronze/vendas

23M	lakehouse/landing/Online Retail.xlsx
4.0K	lakehouse/bronze/vendas/_delta_log/_commits
60K	lakehouse/bronze/vendas/_delta_log
5.7M	lakehouse/bronze/vendas
