## Setup

Configuração do ambiente de desenvolvimento para realização do ETL (Extract Transform Load) de todos os indices que nós preparamos

### Instalando bibliotecas

Instalando todas as bibliotecas necessárias e criação do ambiente spark

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pandas as pd
import numpy as np

# Create a Spark session
spark = SparkSession.builder.master("local").appName("PySpark Tutorial").getOrCreate()

# Verify Spark version
print("Spark version: ", spark.version)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/23 19:11:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark version:  3.5.4


### Criação da estrutura dos valores

Mapeamento dos campos que serão recebidos e atribuição do tipo primitivo de cada um, Normalização de todos as colunas, retirando acentuação e caracteres especiais, e definição dos campos de particionamento.

In [2]:

schema_investing_fields = StructType([
    StructField("Data", DateType(), True),
    StructField("Último", FloatType(), True),
    StructField("Abertura", FloatType(), True),
    StructField("Máxima", FloatType(), True),
    StructField("Mínima", FloatType(), True),
    StructField("Vol.", StringType(), True),
    StructField("Var%", StringType(), True),
])

columns_to_float = ['ultimo', 'abertura', 'maxima', 'minima']


rename_fields = {
    "Data": "data",
    "Último": "ultimo",
    "Abertura": "abertura",
    "Máxima": "maxima",
    "Mínima": "minima",
    "Vol.": "volume",
    "Var%": "variacao"
}

partitions = ['category', 'item']

## Read

Mapeamento dos paths que irão ser utilizados para o tratamento dos dados e leitura dos arquivos utilizando particionamento spark

In [3]:
INPUT_PATH = '/home/lucas-nunes/workspace/Postech/challenges/2_ibov/data/bronze/source_investing/'
INPUT_PATH_SAMPLE = '/home/lucas-nunes/workspace/Postech/challenges/2_ibov/input/data/source_investing/category=commodities/item=cobre/Dados Históricos - Cobre Futuros.csv'

SILVER_PATH = '/home/lucas-nunes/workspace/Postech/challenges/2_ibov/data/silver'
BRONZE_PATH = '/home/lucas-nunes/workspace/Postech/challenges/2_ibov/data/bronze'

df = spark.read.csv(INPUT_PATH, header=True)

df = df.withColumnsRenamed(rename_fields)

                                                                                

## Process

Tratamento dos valores e colunas, remoção de caracteres especiais e abreviações de milhar "K" ou milhão "M"

In [4]:
for column in columns_to_float:

    df = df.withColumn(column, regexp_replace(regexp_replace(column, r'\.', ''), ',', r'\.').astype('float'))


df = df.withColumn('variacao', regexp_replace(regexp_replace('variacao', r'%', ''), ',', r'\.').astype('float'))
df = df.withColumn('volume', regexp_replace(regexp_replace('volume', r'K', ''), ',', r'\.').astype('float'))
df = df.withColumn('data', to_date(col('data'), 'dd.MM.yyyy'))
df = df.drop_duplicates(subset=['data', 'item'])


## Write

### Escrevendo arquivo tratado pelo tier bronze, com todos os dados concatenados e tratados

In [5]:
# df.toPandas().to_csv(f'{SILVER_PATH}/silver.csv')
df.toPandas().to_parquet(f'{SILVER_PATH}/silver.parquet')


25/03/23 19:11:59 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [6]:
df = pd.read_parquet(f'{SILVER_PATH}/silver.parquet')

In [7]:
df

Unnamed: 0,data,ultimo,abertura,maxima,minima,volume,variacao,category,item
0,1980-01-02,30.049999,30.049999,30.049999,30.049999,,3.44,commodities,prata
1,1980-01-03,31.049999,31.049999,31.049999,31.049999,,3.33,commodities,prata
2,1980-01-04,32.049999,32.049999,32.049999,32.049999,,3.22,commodities,prata
3,1980-01-07,33.049999,33.049999,33.049999,33.049999,,3.12,commodities,prata
4,1980-01-08,32.750000,33.974998,34.049999,32.500000,,-0.91,commodities,prata
...,...,...,...,...,...,...,...,...,...
205266,2025-03-17,130834.000000,128959.000000,131313.000000,128957.000000,,1.46,index,ibov
205267,2025-03-18,131475.000000,130832.000000,131834.000000,130722.000000,,0.49,index,ibov
205268,2025-03-19,132508.000000,131476.000000,132984.000000,131451.000000,,0.79,index,ibov
205269,2025-03-20,131955.000000,132505.000000,132713.000000,131813.000000,,-0.42,index,ibov
