# MVP – Camada Bronze (Ingestão dos dados brutos do Synthea)

Esta etapa implementa a camada Bronze do pipeline do MVP, responsável por:
- ler os arquivos CSV brutos do Synthea armazenados no volume `staging.synthea_raw`,
- aplicar tipagens mínimas necessárias para consistência técnica (ex.: timestamps),
- persistir as tabelas como Delta tables no schema `bronze`,
- registrar metadados,
- executar checagens básicas de qualidade (nulos, chaves e consistência temporal).


In [0]:
%sql
-- Seleciona o catálogo do MVP e o schema Bronze
USE CATALOG mvp_engenharia_de_dados;
USE SCHEMA bronze;

In [0]:
# Imports padrão utilizados no notebook
from pyspark.sql import functions as F
from pyspark.sql.functions import try_to_timestamp


In [0]:
# Caminho base dos arquivos brutos do Synthea
base_path = "/Volumes/mvp_engenharia_de_dados/staging/synthea_raw"


In [0]:

# Leitura do arquivo patients.csv
df_patients = (
    spark.read
         .format("csv")
         .option("header", "true")
         .option("inferSchema", "true")
         .load(f"{base_path}/patients.csv")
)

# Visualização
display(df_patients.limit(10))
df_patients.printSchema()



Id,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,MIDDLE,LAST,SUFFIX,MAIDEN,MARITAL,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,CITY,STATE,COUNTY,FIPS,ZIP,LAT,LON,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE,INCOME
cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,2016-10-24,,999-90-8784,,,,Margareta320,Aracelis412,Yundt842,,,,white,nonhispanic,F,Amherst Massachusetts US,694 White Wynd,Charlton,Massachusetts,Worcester County,,0,42.15291672803789,-71.93639278709503,11793.44,9108.57,141848
9c43b243-6ce0-bb9a-52f1-ae426dc840a1,2002-12-15,,999-65-8958,S99992259,X88257069X,Ms.,Marlyn309,Trudi580,Reichel38,,,,white,nonhispanic,F,Lawrence Massachusetts US,997 Mante Wall Apt 36,Plainville,Massachusetts,Norfolk County,,0,41.99993136128743,-71.34072771832133,113194.6,95703.64,123743
f044c345-b58f-4304-8120-f16ee25c3552,2004-06-15,,999-25-4282,S99952238,X20988121X,Ms.,Keith571,,Stokes453,,,,white,nonhispanic,F,Chelsea Massachusetts US,536 Corkery Wynd,Chicopee,Massachusetts,Hampden County,25013.0,1013,42.17341424796212,-72.58680299375206,125715.35,174703.21,45223
04a6d0fd-dd6d-94a9-984a-36907ffd5dc9,1973-10-08,,999-93-2159,S99973296,X41513019X,Mrs.,Ronda430,Ka422,Boyer713,,Bosco882,M,black,nonhispanic,F,Monson Center Massachusetts US,152 Cremin Well Apt 5,Woburn,Massachusetts,Middlesex County,25017.0,1890,42.52811910518509,-71.11147783658413,609653.19,131054.58,102242
3e9dd1f5-c9fc-4015-648e-0c79efb02594,1972-02-26,,999-39-6039,S99945641,X43098897X,Mrs.,Angelic427,Renata373,Vandervort697,,Hickle134,M,white,nonhispanic,F,Russell Massachusetts US,128 Kuvalis Terrace Suite 52,Oak Bluffs,Massachusetts,Dukes County,,0,41.42922569053695,-70.5329668542055,932534.22,299132.62,103003
6276e83f-636c-70e8-aaae-27a74d204ee3,1988-09-17,,999-36-4670,S99991265,X50821604X,Mr.,Derrick232,Terrance440,Witting912,,,S,white,nonhispanic,M,Pinehurst Massachusetts US,592 Ortiz Route Apt 17,New Bedford,Massachusetts,Bristol County,25005.0,2745,41.71755587259173,-70.85194405294612,145387.33,485.96,81651
d1a76952-bd8a-1304-6a25-5a68107ddef5,2015-12-08,,999-80-9745,,,,Dennise990,Veronika907,Crist667,,,,hawaiian,hispanic,F,Springfield Massachusetts US,799 Funk Well,Groveland,Massachusetts,Essex County,,0,42.78087990609578,-71.0017664845296,28457.77,0.0,114520
9f139e0d-3ee7-dcc8-4c15-b19f9b6077c7,2020-03-13,,999-17-3983,,,,Harrison106,,Goodwin327,,,,white,nonhispanic,M,Worcester Massachusetts US,1061 Cartwright Row Apt 94,Brookline,Massachusetts,Norfolk County,25021.0,2446,42.334815815370845,-71.14667329222318,4761.9,13002.5,189808
fcc5cb15-c638-1ed3-664c-c69aedd94e50,2004-12-19,,999-35-4664,S99997944,X40905897X,Mr.,Emory494,Angel97,Bogisich202,,,,white,nonhispanic,M,Fall River Massachusetts US,725 Gleichner Parade Suite 95,Mansfield,Massachusetts,Bristol County,,0,42.01503024218797,-71.18379267568328,56297.25,10387.34,51036
89442147-a7d9-9a15-6264-f038fd11ac81,2014-06-16,,999-32-3369,,,,Danyelle408,Joanne343,Durgan499,,,,asian,nonhispanic,F,Shanghai Shanghai Municipality CN,706 Koepp Corner,Wellesley,Massachusetts,Norfolk County,25021.0,2457,42.303261336134206,-71.27559312143254,11754.95,7788.86,531025


root
 |-- Id: string (nullable = true)
 |-- BIRTHDATE: date (nullable = true)
 |-- DEATHDATE: date (nullable = true)
 |-- SSN: string (nullable = true)
 |-- DRIVERS: string (nullable = true)
 |-- PASSPORT: string (nullable = true)
 |-- PREFIX: string (nullable = true)
 |-- FIRST: string (nullable = true)
 |-- MIDDLE: string (nullable = true)
 |-- LAST: string (nullable = true)
 |-- SUFFIX: string (nullable = true)
 |-- MAIDEN: string (nullable = true)
 |-- MARITAL: string (nullable = true)
 |-- RACE: string (nullable = true)
 |-- ETHNICITY: string (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- BIRTHPLACE: string (nullable = true)
 |-- ADDRESS: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- STATE: string (nullable = true)
 |-- COUNTY: string (nullable = true)
 |-- FIPS: integer (nullable = true)
 |-- ZIP: integer (nullable = true)
 |-- LAT: double (nullable = true)
 |-- LON: double (nullable = true)
 |-- HEALTHCARE_EXPENSES: double (nullable = true)
 |--

In [0]:
# Leitura do arquivo encounters.csv
df_encounters = (
    spark.read
        .format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{base_path}/encounters.csv")
)

# Converte datas com tolerância a erros de parsing
df_encounters = (
    df_encounters
        .withColumn("START", try_to_timestamp("START"))
        .withColumn("STOP",  try_to_timestamp("STOP"))
)

display(df_encounters.limit(10))
df_encounters.printSchema()


Id,START,STOP,PATIENT,ORGANIZATION,PROVIDER,PAYER,ENCOUNTERCLASS,CODE,DESCRIPTION,BASE_ENCOUNTER_COST,TOTAL_CLAIM_COST,PAYER_COVERAGE,REASONCODE,REASONDESCRIPTION
cfb6239f-9bc0-8b4f-610f-e84dd785f7d8,2016-10-24T06:53:58.000Z,2016-10-24T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,1034.05,0.0,,
cfb6239f-9bc0-8b4f-beed-fc61be2d7916,2016-11-28T06:53:58.000Z,2016-11-28T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,272.8,0.0,,
cfb6239f-9bc0-8b4f-297a-c280fd930088,2017-01-30T06:53:58.000Z,2017-01-30T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,1096.58,756.34,,
cfb6239f-9bc0-8b4f-9220-8a9883c75e99,2017-04-03T06:53:58.000Z,2017-04-03T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,816.8,653.44,,
cfb6239f-9bc0-8b4f-8be5-74d7ac2182fa,2017-07-03T06:53:58.000Z,2017-07-03T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,1150.65,920.52,,
cfb6239f-9bc0-8b4f-9fae-e1b8fc82b8f7,2017-09-02T06:53:58.000Z,2017-09-02T07:53:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,0d7fd824-1c7b-3f8b-a662-f6c2005e82e1,1f8a41e7-9f81-3247-a651-d28a50019e39,d31fccc3-1767-390d-966a-22a5156f4219,emergency,50849002,Emergency room admission (procedure),146.18,146.18,116.94,384709000.0,Sprain (morphologic abnormality)
cfb6239f-9bc0-8b4f-7f57-abd63e149bbe,2017-10-02T06:53:58.000Z,2017-10-02T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,136.8,109.44,,
cfb6239f-9bc0-8b4f-ca10-c77e6ff2e218,2018-01-01T06:53:58.000Z,2018-01-01T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,1475.1,13.68,,
cfb6239f-9bc0-8b4f-19a9-3aad0f9c06b6,2018-03-18T06:53:58.000Z,2018-03-18T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,a2a41caf-758f-3962-8e5f-e6e603f93d17,40af32a1-6fe8-35be-b92a-e0e9a5e48ff6,d31fccc3-1767-390d-966a-22a5156f4219,outpatient,185345009,Encounter for symptom (procedure),85.55,85.55,68.44,65363002.0,Otitis media (disorder)
cfb6239f-9bc0-8b4f-d498-fd314f5b0c07,2018-04-02T06:53:58.000Z,2018-04-02T07:08:58.000Z,cfb6239f-9bc0-8b4f-58cb-0789a9da2f7f,4ff8b164-4cf5-3ab4-b0f2-7ce8fda920e5,75742a69-e63f-39e0-bbe8-3634bd82b239,d31fccc3-1767-390d-966a-22a5156f4219,wellness,410620009,Well child visit (procedure),136.8,546.26,437.01,,


root
 |-- Id: string (nullable = true)
 |-- START: timestamp (nullable = true)
 |-- STOP: timestamp (nullable = true)
 |-- PATIENT: string (nullable = true)
 |-- ORGANIZATION: string (nullable = true)
 |-- PROVIDER: string (nullable = true)
 |-- PAYER: string (nullable = true)
 |-- ENCOUNTERCLASS: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- DESCRIPTION: string (nullable = true)
 |-- BASE_ENCOUNTER_COST: string (nullable = true)
 |-- TOTAL_CLAIM_COST: string (nullable = true)
 |-- PAYER_COVERAGE: string (nullable = true)
 |-- REASONCODE: string (nullable = true)
 |-- REASONDESCRIPTION: string (nullable = true)



In [0]:
df_encounters.filter("START IS NULL OR STOP IS NULL").count()

1

## Persistência dos dados na camada Bronze
Os DataFrames resultantes foram persistidos na camada Bronze como Delta Tables (patients_bronze e encounters_bronze), que passam a representar a versão estruturada e confiável dos dados brutos, servindo como base para as transformações da camada Silver.

In [0]:
# Salva encounters como Delta table na camada Bronze
df_encounters.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("encounters_bronze")


In [0]:
# Salva patients como Delta table na camada Bronze
df_patients.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("patients_bronze")


## Metadados (catálogo): comentários por coluna

Nesta seção, são registrados comentários de coluna nas tabelas Bronze usando `COMMENT ON COLUMN`.
Isso facilita exploração, documentação e governança mínima do catálogo no contexto do MVP.


In [0]:
columns_comments_encounters = {
    "Id": "Identificador único do encontro",
    "START": "Data e hora de início do encontro",
    "STOP": "Data e hora de término do encontro",
    "PATIENT": "Identificador do paciente",
    "ORGANIZATION": "Organização responsável pelo encontro",
    "PROVIDER": "Prestador de serviço do encontro",
    "PAYER": "Identificador do pagador",
    "ENCOUNTERCLASS": "Classe do encontro (ex: ambulatório, emergência)",
    "CODE": "Código do tipo de encontro",
    "DESCRIPTION": "Descrição do tipo de encontro",
    "BASE_ENCOUNTER_COST": "Custo base do encontro",
    "TOTAL_CLAIM_COST": "Custo total do encontro",
    "PAYER_COVERAGE": "Cobertura do pagador",
    "REASONCODE": "Código do motivo do encontro",
    "REASONDESCRIPTION": "Descrição do motivo do encontro"
}

for col, comment in columns_comments_encounters.items():
    spark.sql(f"COMMENT ON COLUMN encounters_bronze.{col} IS '{comment}'")

In [0]:
columns_comments = {
    "Id": "Identificador único do paciente",
    "BIRTHDATE": "Data de nascimento do paciente",
    "DEATHDATE": "Data de falecimento do paciente",
    "SSN": "Número de Seguro Social",
    "DRIVERS": "Número da carteira de motorista",
    "PASSPORT": "Número do passaporte",
    "PREFIX": "Prefixo do nome",
    "FIRST": "Primeiro nome",
    "MIDDLE": "Nome do meio",
    "LAST": "Último nome",
    "SUFFIX": "Sufixo do nome",
    "MAIDEN": "Nome de solteiro",
    "MARITAL": "Estado civil",
    "RACE": "Raça",
    "ETHNICITY": "Etnia",
    "GENDER": "Gênero",
    "BIRTHPLACE": "Local de nascimento",
    "ADDRESS": "Endereço residencial",
    "CITY": "Cidade",
    "STATE": "Estado",
    "COUNTY": "Condado",
    "FIPS": "Código FIPS do local",
    "ZIP": "Código postal",
    "LAT": "Latitude do endereço",
    "LON": "Longitude do endereço",
    "HEALTHCARE_EXPENSES": "Despesas com saúde",
    "HEALTHCARE_COVERAGE": "Cobertura de saúde",
    "INCOME": "Renda anual"
}

for col, comment in columns_comments.items():
    spark.sql(f"COMMENT ON COLUMN patients_bronze.{col} IS '{comment}'")

In [0]:
%sql
-- Metadados da tabela encounters_bronze
DESCRIBE TABLE EXTENDED bronze.encounters_bronze;


col_name,data_type,comment
Id,string,Identificador único do encontro
START,timestamp,Data e hora de início do encontro
STOP,timestamp,Data e hora de término do encontro
PATIENT,string,Identificador do paciente
ORGANIZATION,string,Organização responsável pelo encontro
PROVIDER,string,Prestador de serviço do encontro
PAYER,string,Identificador do pagador
ENCOUNTERCLASS,string,"Classe do encontro (ex: ambulatório, emergência)"
CODE,string,Código do tipo de encontro
DESCRIPTION,string,Descrição do tipo de encontro


In [0]:
%sql
-- Metadados da tabela encounters_bronze
DESCRIBE TABLE EXTENDED bronze.patients_bronze;


col_name,data_type,comment
Id,string,Identificador único do paciente
BIRTHDATE,date,Data de nascimento do paciente
DEATHDATE,date,Data de falecimento do paciente
SSN,string,Número de Seguro Social
DRIVERS,string,Número da carteira de motorista
PASSPORT,string,Número do passaporte
PREFIX,string,Prefixo do nome
FIRST,string,Primeiro nome
MIDDLE,string,Nome do meio
LAST,string,Último nome


## Profiling de ingestão – Camada Bronze

Esta seção realiza checagens básicas de qualidade com finalidade diagnóstica,
sem aplicação de filtros ou correções.  
O objetivo é documentar o estado dos dados no momento da ingestão.


In [0]:
enc = spark.table("encounters_bronze")

# ==========
# ENCOUNTERS
# ==========

print("=== ENCOUNTERS_BRONZE ===")
print("Total de linhas:", enc.count())

print("\nSchema:")
enc.printSchema()

# Datas
print("\nLinhas com START nulo:", enc.filter("START IS NULL").count())
print("Linhas com STOP nulo:", enc.filter("STOP IS NULL").count())
print("Linhas com STOP < START:", enc.filter("STOP < START").count())

# Chaves
print("\nLinhas com Id nulo:", enc.filter("Id IS NULL").count())
print("Linhas com PATIENT nulo:", enc.filter("PATIENT IS NULL").count())

print("\nDuplicatas de Id em encounters:")
enc.groupBy("Id").count().filter("count > 1").show(10, truncate=False)

# ENCOUNTERCLASS
print("\nValores distintos de ENCOUNTERCLASS:")
enc.select("ENCOUNTERCLASS").distinct().show()

print("\nContagem por ENCOUNTERCLASS:")
enc.groupBy("ENCOUNTERCLASS").count().show()

# Diagnóstico principal
print("\nLinhas com REASONDESCRIPTION nulo:", enc.filter("REASONDESCRIPTION IS NULL").count())
print("\nAlguns REASONDESCRIPTION mais frequentes:")
enc.groupBy("REASONDESCRIPTION").count().orderBy("count", ascending=False).show(10, truncate=False)


=== ENCOUNTERS_BRONZE ===
Total de linhas: 510380

Schema:
root
 |-- Id: string (nullable = true)
 |-- START: timestamp (nullable = true)
 |-- STOP: timestamp (nullable = true)
 |-- PATIENT: string (nullable = true)
 |-- ORGANIZATION: string (nullable = true)
 |-- PROVIDER: string (nullable = true)
 |-- PAYER: string (nullable = true)
 |-- ENCOUNTERCLASS: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- DESCRIPTION: string (nullable = true)
 |-- BASE_ENCOUNTER_COST: string (nullable = true)
 |-- TOTAL_CLAIM_COST: string (nullable = true)
 |-- PAYER_COVERAGE: string (nullable = true)
 |-- REASONCODE: string (nullable = true)
 |-- REASONDESCRIPTION: string (nullable = true)


Linhas com START nulo: 0
Linhas com STOP nulo: 1
Linhas com STOP < START: 0

Linhas com Id nulo: 0
Linhas com PATIENT nulo: 0

Duplicatas de Id em encounters:
+---+-----+
|Id |count|
+---+-----+
+---+-----+


Valores distintos de ENCOUNTERCLASS:
+--------------------+
|      ENCOUNTERCLASS|
+-------

### Observações (encounters_bronze)

Eventuais inconsistências encontradas nesta checagem (ex.: nulos em `REASONDESCRIPTION`, timestamps não parseados ou registros sem paciente correspondente) são registradas como evidência de qualidade.
A correção e o tratamento dessas exceções serão realizados nas camada Silver e Gold, onde serão aplicados filtros, regras de integridade e transformações necessárias ao DW.


In [0]:
# ==========
# PATIENTS
# ==========

pat = spark.table("patients_bronze")

print("\n\n=== PATIENTS_BRONZE ===")
print("Total de linhas:", pat.count())

print("\nSchema:")
pat.printSchema()

# Chave de paciente
print("\nLinhas com Id nulo:", pat.filter("Id IS NULL").count())
print("Duplicatas de Id em patients:")
pat.groupBy("Id").count().filter("count > 1").show(10, truncate=False)

# Datas
print("\nLinhas com BIRTHDATE nulo:", pat.filter("BIRTHDATE IS NULL").count())
print("Linhas com DEATHDATE antes de BIRTHDATE:")
pat.filter("DEATHDATE IS NOT NULL AND BIRTHDATE IS NOT NULL AND DEATHDATE < BIRTHDATE").show(10, truncate=False)

# Demografia básica
print("\nDistribuição de GENDER:")
pat.groupBy("GENDER").count().show()

print("\nRACE distintos:")
pat.select("RACE").distinct().show()

print("\nETHNICITY distintos:")
pat.select("ETHNICITY").distinct().show()



=== PATIENTS_BRONZE ===
Total de linhas: 9037

Schema:
root
 |-- Id: string (nullable = true)
 |-- BIRTHDATE: date (nullable = true)
 |-- DEATHDATE: date (nullable = true)
 |-- SSN: string (nullable = true)
 |-- DRIVERS: string (nullable = true)
 |-- PASSPORT: string (nullable = true)
 |-- PREFIX: string (nullable = true)
 |-- FIRST: string (nullable = true)
 |-- MIDDLE: string (nullable = true)
 |-- LAST: string (nullable = true)
 |-- SUFFIX: string (nullable = true)
 |-- MAIDEN: string (nullable = true)
 |-- MARITAL: string (nullable = true)
 |-- RACE: string (nullable = true)
 |-- ETHNICITY: string (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- BIRTHPLACE: string (nullable = true)
 |-- ADDRESS: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- STATE: string (nullable = true)
 |-- COUNTY: string (nullable = true)
 |-- FIPS: integer (nullable = true)
 |-- ZIP: integer (nullable = true)
 |-- LAT: double (nullable = true)
 |-- LON: double (nullable = true

## Conclusão da Camada Bronze

As tabelas `patients_bronze` e `encounters_bronze` foram ingeridas a partir dos CSVs brutos do Synthea e persistidas como Delta tables no schema `bronze`.
Foram aplicadas tipagens mínimas (timestamps em `START/STOP`), registrados comentários por coluna e executadas checagens básicas de qualidade.

O próximo passo (camada Silver) irá:
- filtrar e padronizar registros relevantes para o DW (ex.: `ENCOUNTERCLASS = 'inpatient'`),
- e preparar as tabelas para a modelagem dimensional e construção da camada Gold.
