# 📒 Notebook : 03_silver_transform

---

## 📝 Objectif
 
- Lire les données de la couche Bronze (après ingestion et nettoyage initial).
- Appliquer les enrichissements et transformations de niveau Silver :
- Standardisation des variables métier (âge, genre…)
- Extraction de features analytiques (diagnostics, timestamps…)
- Préparation des données pour la modélisation en étoile (tables de faits et dimensions).
- Sauvegarder les jeux de données nettoyés et enrichis dans la couche Silver (Delta).

---

## 📥 Inputs

| Dataset         | Format     | Emplacement                      | Description                   |
|-----------------|-----------|-----------------------------------|-------------------------------|
| icd_code       | Delta      | /health_lakehouse/Files/bronze/icd_code         | Données catégorie de maladie après nettoyage    |
| patient      | Delta      | /health_lakehouse/Files/bronze/patient         | Données patient après nettoyage    |
| maladie       | Delta      | /health_lakehouse/Files/bronze/maladie         | Données maladie après nettoyage    |
| motif_admission       | Delta      | /health_lakehouse/Files/bronze/motif_admission         | Données motif admission après nettoyage    |
| medicament       | Delta      | /health_lakehouse/Files/bronze/medicament         | Données médicament après nettoyage    |
| meds_code        | Delta        | /health_lakehouse/Files/bronze/meds_code       | Données catégorie de médicaments après nettoyage       |
| motifs_code        | Delta        | /health_lakehouse/Files/bronze/motifs_code       | Données catégorie de motifs d'admission après nettoyage     |

---

## 📤 Outputs

| Dataset        | Format     | Emplacement                              | Description                        |
|----------------|-----------|-------------------------------------------|-------------------------------------|
| icd_code       | Delta      | /health_lakehouse/Files/silver/icd_code         | Données catégorie de maladie enrichies    |
| patient      | Delta      | /health_lakehouse/Files/silver/patient         | Données patient enrichies    |
| maladie       | Delta      | /health_lakehouse/Files/silver/maladie         | Données maladie listées    |
| motif_admission       | Delta      | /health_lakehouse/Files/silver/motif_admission         | Données motif admission listées    |
| medicament       | Delta      | /health_lakehouse/Files/silver/medicament         | Données médicament listées    |
| meds_code        | Delta        | /health_lakehouse/Files/silver/meds_code       | Données catégorie de médicaments enrichies       |
| motifs_code        | Delta        | /health_lakehouse/Files/silver/motifs_code       | Données catégorie de motifs d'admission enrichies     |


---

## 👤 Auteur(s) / Contact

- SEKARI Inès — [ines.sekari@efrei.net]
- NKUIDA Malaïka - [malaika.nkuida@efrei.net]

---

## 🗓️ Versioning & Mise à jour

| Version | Date        | Modifications          |
|---------|-------------|------------------------|
| 1.0     | 2025-04-28  | Création et structuration initiale Silver               |
| 1.1     | 2025-05-05  | Ajout enrichissement hospital features    |

---


In [31]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Lecture Bronze
bronze_df_patient = spark.read.format("delta").load("Files/bronze/patient")

display(bronze_df_patient.head(5))

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 33, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 6c17c980-7d20-43f5-8e03-db2f4d51a344)

In [32]:
# Nettoyage gender et age
silver_df_patient = (
    bronze_df_patient
    .withColumn(
        "gender", 
        F.when(F.col("gender").isin("MALE", "FEMALE"), F.col("gender")).otherwise(F.lit(None))
    )
    .withColumn(
        "age", 
        F.when((F.col("age") >= 0) & (F.col("age") <= 120), F.col("age")).otherwise(F.lit(None))
    )
)

display(silver_df_patient.head(5))

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 34, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, bd14fad1-f26a-4cf4-8c95-2e4c078d57ad)

In [33]:
# Mapping mois et jour (anglais > numérique)
month_map = {
    "January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6,
    "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12
}
month_map_expr = F.create_map([F.lit(x) for x in sum(month_map.items(), ())])

dow_map = {
    "Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4,
    "Friday": 5, "Saturday": 6, "Sunday": 7
}
dow_map_expr = F.create_map([F.lit(x) for x in sum(dow_map.items(), ())])

silver_df_patient = (
    silver_df_patient
    .withColumn("month_num", month_map_expr[F.col("arrivalmonth")])
    .withColumn("weekday_num", dow_map_expr[F.col("arrivalday")])
)

display(silver_df_patient.head(5))

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 35, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 24eae796-c782-4754-8197-eab1b98b5a70)

In [34]:
# Découpage 'arrivalhour_bin'
silver_df_patient = (
    silver_df_patient
    .withColumn("hour_bin_start", F.split("arrivalhour_bin", "-").getItem(0).cast("int"))
    .withColumn("hour_bin_end", F.split("arrivalhour_bin", "-").getItem(1).cast("int"))
)

# Ajout d'un id unique, stable (row_number)
w = Window.orderBy(F.monotonically_increasing_id())
silver_df_patient = silver_df_patient.withColumn("consultation_id", F.row_number().over(w))

# Génération année fictive + flag pédagogique
silver_df_patient = (
    silver_df_patient
    .withColumn("year_fictive", (F.floor(F.rand()*4)+2014).cast("int"))
    .withColumn("is_year_fictive", F.lit(True))
)

display(silver_df_patient.head(5))

print(silver_df_patient.columns)

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 36, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 8cf3a065-ff9a-42be-a6c4-38a4b0b15145)

['patient_id', 'profile_id', 'dep_name', 'age', 'gender', 'ethnicity', 'race', 'lang', 'religion', 'maritalstatus', 'employstatus', 'insurance_status', 'disposition', 'arrivalmode', 'arrivalmonth', 'arrivalday', 'arrivalhour_bin', 'previousdispo', 'n_surgeries', 'n_edvisits', 'n_admissions', 'month_num', 'weekday_num', 'hour_bin_start', 'hour_bin_end', 'consultation_id', 'year_fictive', 'is_year_fictive']


In [35]:
# Lecture Bronze
bronze_df_maladie = spark.read.format("delta").load("Files/bronze/maladie")

display(bronze_df_maladie.head(5))

# Construction de la liste des pathologies détectées (<> ici, récupère ta variable pathology_cols !!!)
pathology_cols = [
    '2ndarymalig', 'anemia', 'burns', 'carditis', 'abdomhernia', 'abdomnlpain', 'abortcompl',
    'acqfootdef', 'acrenlfail', 'acutecvd', 'abdomhernia', 'abdomnlpain', 'abortcompl', 'acqfootdef', 'acrenlfail', 'acutecvd', 'acutemi', 'acutphanm', 'adjustmentdisorders', 'adltrespfl', 'alcoholrelateddisorders', 'allergy', 'amniosdx', 'analrectal', 'anemia', 'aneurysm', 'anxietydisorders', 'appendicitis', 'artembolism', 'asppneumon', 'asthma', 'attentiondeficitconductdisruptivebeha', 'backproblem', 'biliarydx', 'birthasphyx', 'birthtrauma', 'bladdercncr', 'blindness', 'bnignutneo', 'bonectcncr', 'bph', 'brainnscan', 'breastcancr', 'breastdx', 'brnchlngca', 'bronchitis', 'burns', 'cardiaarrst', 'cardiacanom', 'carditis', 'cataract', 'cervixcancr', 'chestpain', 'chfnonhp', 'chrkidneydisease', 'coaghemrdx', 'coloncancer', 'comabrndmg', 'complicdevi', 'complicproc', 'conduction', 'contraceptiv', 'copd', 'coronathero', 'crushinjury', 'cysticfibro', 'deliriumdementiaamnesticothercognitiv', 'developmentaldisorders', 'diabmelnoc', 'diabmelwcm', 'disordersusuallydiagnosedininfancych', 'diverticulos', 'dizziness', 'dminpreg', 'dysrhythmia', 'earlylabor', 'ecodesadverseeffectsofmedicalcare', 'ecodesadverseeffectsofmedicaldrugs', 'ecodescutpierce', 'ecodesdrowningsubmersion', 'ecodesfall', 'ecodesfirearm', 'ecodesfireburn', 'ecodesmachinery', 'ecodesmotorvehicletrafficmvt', 'ecodesnaturalenvironment', 'ecodesotherspecifiedandclassifiable', 'ecodesotherspecifiednec', 'ecodespedalcyclistnotmvt', 'ecodesplaceofoccurrence', 'ecodespoisoning', 'ecodesstruckbyagainst', 'ecodessuffocation', 'ecodestransportnotmvt', 'ecodesunspecified', 'ectopicpreg', 'encephalitis', 'endometrios', 'epilepsycnv', 'esophcancer', 'esophgealdx', 'exameval', 'eyeinfectn', 'fatigue', 'femgenitca', 'feminfertil', 'fetaldistrs', 'fluidelcdx', 'fuo', 'fxarm', 'fxhip', 'fxleg', 'fxskullfac', 'gangrene', 'gasduoulcer', 'gastritis', 'gastroent', 'giconganom', 'gihemorrhag', 'giperitcan', 'glaucoma', 'goutotcrys', 'guconganom', 'hdnckcancr', 'headachemig', 'hemmorhoids', 'hemorrpreg', 'hepatitis', 'hivinfectn', 'hodgkinsds', 'hrtvalvedx', 'htn', 'htncomplicn', 'htninpreg', 'hyperlipidem', 'immunitydx', 'immunizscrn', 'impulsecontroldisordersnec', 'inducabortn', 'infectarth', 'influenza', 'infmalegen', 'intestinfct', 'intobstruct', 'intracrninj', 'jointinjury', 'kidnyrnlca', 'lateeffcvd', 'leukemias', 'liveborn', 'liveribdca', 'longpregncy', 'lowbirthwt', 'lungexternl', 'lymphenlarg', 'maintchemr', 'malgenitca', 'maligneopls', 'malposition', 'meningitis', 'menopausldx', 'menstrualdx', 'miscellaneousmentalhealthdisorders', 'mooddisorders', 'mouthdx', 'ms', 'multmyeloma', 'mycoses', 'nauseavomit', 'neoplsmunsp', 'nephritis', 'nervcongan', 'nonepithca', 'nonhodglym', 'nutritdefic', 'obrelatedperintrauma', 'opnwndextr', 'opnwndhead', 'osteoarthros', 'osteoporosis', 'otacqdefor', 'otaftercare', 'otbnignneo', 'otbonedx', 'otcirculdx', 'otcomplbir', 'otconganom', 'otconntiss', 'otdxbladdr', 'otdxkidney', 'otdxstomch', 'otendodsor', 'otfemalgen', 'othbactinf', 'othcnsinfx', 'othematldx', 'othercvd', 'othereardx', 'otheredcns', 'othereyedx', 'othergidx', 'othergudx', 'otherinjury', 'otherpregnancyanddeliveryincludingnormal', 'otherscreen', 'othfracture', 'othheartdx', 'othinfectns', 'othliverdx', 'othlowresp', 'othmalegen', 'othnervdx', 'othskindx', 'othveindx', 'otinflskin', 'otitismedia', 'otjointdx', 'otnutritdx', 'otperintdx', 'otpregcomp', 'otprimryca', 'otrespirca', 'otupprresp', 'otuprspin', 'ovariancyst', 'ovarycancer', 'pancreascan', 'pancreasdx', 'paralysis', 'parkinsons', 'pathologfx', 'pelvicobstr', 'perintjaund', 'peripathero', 'peritonitis', 'personalitydisorders', 'phlebitis', 'pid', 'pleurisy', 'pneumonia', 'poisnnonmed', 'poisnotmed', 'poisonpsych', 'precereoccl', 'prevcsectn', 'prolapse', 'prostatecan', 'pulmhartdx', 'rctmanusca', 'rehab', 'respdistres', 'retinaldx', 'rheumarth', 'schizophreniaandotherpsychoticdisorde', 'screeningandhistoryofmentalhealthan', 'septicemia', 'septicemiaexceptinlabor', 'sexualinfxs', 'shock', 'sicklecell', 'skininfectn', 'skinmelanom', 'sle', 'socialadmin', 'spincorinj', 'spontabortn', 'sprain', 'stomchcancr', 'substancerelateddisorders', 'suicideandintentionalselfinflictedin', 'superficinj', 'syncope', 'teethdx', 'testiscancr', 'thyroidcncr', 'thyroiddsor', 'tia', 'tonsillitis', 'tuberculosis', 'ulceratcol', 'ulcerskin', 'umbilcord', 'unclassified', 'urinstone', 'urinyorgca', 'uteruscancr', 'uti', 'varicosevn', 'viralinfect', 'whtblooddx'
]


diagnosis_expr = "filter(array({}), x -> x is not null)".format(
    ','.join([f"if({c}=1, '{c}', null)" for c in pathology_cols])
)

silver_df_maladie = bronze_df_maladie.withColumn("diagnosis_list", F.expr(diagnosis_expr))

display(silver_df_maladie.head(5))

silver_df_maladie = silver_df_maladie.select('patient_id','diagnosis_list')

display(silver_df_maladie.head(5))

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 37, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 8392d5d5-37b0-4abc-bcd8-05756c5ea163)

SynapseWidget(Synapse.DataFrame, ea0ef928-a0c8-462f-bb11-f6b1d75e33dc)

SynapseWidget(Synapse.DataFrame, 8ea838d4-e288-4952-badf-d20b534cb51f)

In [36]:
# Lecture Bronze
bronze_df_medicament = spark.read.format("delta").load("Files/bronze/medicament")


# Récupérer le nom des colonnes
old_columns = bronze_df_medicament.columns

# Générer de nouveaux noms en supprimant le préfixe "meds_" si présent
new_columns = [col.replace("meds_", "") if col.startswith("meds_") else col for col in old_columns]

# Appliquer le renommage
df_medicament = bronze_df_medicament.toDF(*new_columns)

# Vérifier le résultat
print(df_medicament.columns)


StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 38, Finished, Available, Finished)

['patient_id', 'analgesicandantihistaminecombination', 'analgesics', 'anesthetics', 'anti_obesitydrugs', 'antiallergy', 'antiarthritics', 'antiasthmatics', 'antibiotics', 'anticoagulants', 'antidotes', 'antifungals', 'antihistamineanddecongestantcombination', 'antihistamines', 'antihyperglycemics', 'antiinfectives', 'antiinfectives_miscellaneous', 'antineoplastics', 'antiparkinsondrugs', 'antiplateletdrugs', 'antivirals', 'autonomicdrugs', 'biologicals', 'blood', 'cardiacdrugs', 'cardiovascular', 'cnsdrugs', 'colonystimulatingfactors', 'contraceptives', 'cough_coldpreparations', 'diagnostic', 'diuretics', 'eentpreps', 'elect_caloric_h2o', 'gastrointestinal', 'herbals', 'hormones', 'immunosuppressants', 'investigational', 'miscellaneousmedicalsupplies_devices_non_drug', 'musclerelaxants', 'pre_natalvitamins', 'psychotherapeuticdrugs', 'sedative_hypnotics', 'skinpreps', 'smokingdeterrents', 'thyroidpreps', 'unclassifieddrugproducts', 'vitamins']


In [37]:
# Construction de la liste des médicaments administrés
meds_cols = [
    'analgesicandantihistaminecombination', 'analgesics', 'anesthetics', 'anti_obesitydrugs', 'antiallergy', 'antiarthritics', 'antiasthmatics', 'antibiotics', 'anticoagulants', 'antidotes', 'antifungals', 'antihistamineanddecongestantcombination', 'antihistamines', 'antihyperglycemics', 'antiinfectives', 'antiinfectives_miscellaneous', 'antineoplastics', 'antiparkinsondrugs', 'antiplateletdrugs', 'antivirals', 'autonomicdrugs', 'biologicals', 'blood', 'cardiacdrugs', 'cardiovascular', 'cnsdrugs', 'colonystimulatingfactors', 'contraceptives', 'cough_coldpreparations', 'diagnostic', 'diuretics', 'eentpreps', 'elect_caloric_h2o', 'gastrointestinal', 'herbals', 'hormones', 'immunosuppressants', 'investigational', 'miscellaneousmedicalsupplies_devices_non_drug', 'musclerelaxants', 'pre_natalvitamins', 'psychotherapeuticdrugs', 'sedative_hypnotics', 'skinpreps', 'smokingdeterrents', 'thyroidpreps', 'unclassifieddrugproducts', 'vitamins']


meds_expr = "filter(array({}), x -> x is not null)".format(
    ','.join([f"if({c}=1, '{c}', null)" for c in meds_cols])
)

silver_df_medicament = df_medicament.withColumn("meds_list", F.expr(meds_expr))


display(silver_df_medicament.head(5))

silver_df_medicament = silver_df_medicament.select('patient_id','meds_list')

display(silver_df_medicament.head(5))

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 39, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 2a60af7b-32dc-482d-ab02-e4af3bee655c)

SynapseWidget(Synapse.DataFrame, 4b141425-6a48-4c44-8183-a28eba19076a)

In [38]:
# Lecture Bronze
bronze_df_motif_admission = spark.read.format("delta").load("Files/bronze/motif_admission")


# Récupérer le nom des colonnes
old_columns = bronze_df_motif_admission.columns

# Générer de nouveaux noms en supprimant le préfixe "meds_" si présent
new_columns = [col.replace("cc_", "") if col.startswith("cc_") else col for col in old_columns]

# Appliquer le renommage
df_motif_admission = bronze_df_motif_admission.toDF(*new_columns)

# Vérifier le résultat
print(df_motif_admission.columns)

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 40, Finished, Available, Finished)

['patient_id', 'abdominalcramping', 'abdominaldistention', 'abdominalpain', 'abdominalpainpregnant', 'abnormallab', 'abscess', 'addictionproblem', 'agitation', 'alcoholintoxication', 'alcoholproblem', 'allergicreaction', 'alteredmentalstatus', 'animalbite', 'ankleinjury', 'anklepain', 'anxiety', 'arminjury', 'armpain', 'armswelling', 'assaultvictim', 'asthma', 'backpain', 'bleeding_bruising', 'blurredvision', 'bodyfluidexposure', 'breastpain', 'breathingdifficulty', 'breathingproblem', 'burn', 'cardiacarrest', 'cellulitis', 'chestpain', 'chesttightness', 'chills', 'coldlikesymptoms', 'confusion', 'conjunctivitis', 'constipation', 'cough', 'cyst', 'decreasedbloodsugar_symptomatic', 'dehydration', 'dentalpain', 'depression', 'detoxevaluation', 'diarrhea', 'dizziness', 'drug_alcoholassessment', 'drugproblem', 'dyspnea', 'dysuria', 'earpain', 'earproblem', 'edema', 'elbowpain', 'elevatedbloodsugar_nosymptoms', 'elevatedbloodsugar_symptomatic', 'emesis', 'epigastricpain', 'epistaxis', 'expo

In [39]:
# Construction de la liste des motifs d'admissions
motif_cols = [
    'abdominalcramping', 'abdominaldistention', 'abdominalpain', 'abdominalpainpregnant', 'abnormallab', 'abscess', 'addictionproblem', 'agitation', 'alcoholintoxication', 'alcoholproblem', 'allergicreaction', 'alteredmentalstatus', 'animalbite', 'ankleinjury', 'anklepain', 'anxiety', 'arminjury', 'armpain', 'armswelling', 'assaultvictim', 'asthma', 'backpain', 'bleeding_bruising', 'blurredvision', 'bodyfluidexposure', 'breastpain', 'breathingdifficulty', 'breathingproblem', 'burn', 'cardiacarrest', 'cellulitis', 'chestpain', 'chesttightness', 'chills', 'coldlikesymptoms', 'confusion', 'conjunctivitis', 'constipation', 'cough', 'cyst', 'decreasedbloodsugar_symptomatic', 'dehydration', 'dentalpain', 'depression', 'detoxevaluation', 'diarrhea', 'dizziness', 'drug_alcoholassessment', 'drugproblem', 'dyspnea', 'dysuria', 'earpain', 'earproblem', 'edema', 'elbowpain', 'elevatedbloodsugar_nosymptoms', 'elevatedbloodsugar_symptomatic', 'emesis', 'epigastricpain', 'epistaxis', 'exposuretostd', 'extremitylaceration', 'extremityweakness', 'eyeinjury', 'eyepain', 'eyeproblem', 'eyeredness', 'facialinjury', 'faciallaceration', 'facialpain', 'facialswelling', 'fall', 'fall_65', 'fatigue', 'femaleguproblem', 'fever', 'fever_75yearsorolder', 'fever_9weeksto74years', 'feverimmunocompromised', 'fingerinjury', 'fingerpain', 'fingerswelling', 'flankpain', 'follow_upcellulitis', 'footinjury', 'footpain', 'footswelling', 'foreignbodyineye', 'fulltrauma', 'generalizedbodyaches', 'gibleeding', 'giproblem', 'groinpain', 'hallucinations', 'handinjury', 'handpain', 'headache', 'headache_newonsetornewsymptoms', 'headache_recurrentorknowndxmigraines', 'headachere_evaluation', 'headinjury', 'headlaceration', 'hematuria', 'hemoptysis', 'hippain', 'homicidal', 'hyperglycemia', 'hypertension', 'hypotension', 'influenza', 'ingestion', 'insectbite', 'irregularheartbeat', 'jawpain', 'jointswelling', 'kneeinjury', 'kneepain', 'laceration', 'leginjury', 'legpain', 'legswelling', 'lethargy', 'lossofconsciousness', 'maleguproblem', 'mass', 'medicalproblem', 'medicalscreening', 'medicationproblem', 'medicationrefill', 'migraine', 'modifiedtrauma', 'motorcyclecrash', 'motorvehiclecrash', 'multiplefalls', 'nasalcongestion', 'nausea', 'nearsyncope', 'neckpain', 'neurologicproblem', 'numbness', 'oralswelling', 'otalgia', 'other', 'overdose_accidental', 'overdose_intentional', 'pain', 'palpitations', 'panicattack', 'pelvicpain', 'poisoning', 'post_opproblem', 'psychiatricevaluation', 'psychoticsymptoms', 'rapidheartrate', 'rash', 'rectalbleeding', 'rectalpain', 'respiratorydistress', 'ribinjury', 'ribpain', 'seizure_newonset', 'seizure_priorhxof', 'seizures', 'shortnessofbreath', 'shoulderinjury', 'shoulderpain', 'sicklecellpain', 'sinusproblem', 'skinirritation', 'skinproblem', 'sorethroat', 'stdcheck', 'strokealert', 'suicidal', 'suture_stapleremoval', 'swallowedforeignbody', 'syncope', 'tachycardia', 'testiclepain', 'thumbinjury', 'tickremoval', 'toeinjury', 'toepain', 'trauma', 'unresponsive', 'uri', 'urinaryfrequency', 'urinaryretention', 'urinarytractinfection', 'vaginalbleeding', 'vaginaldischarge', 'vaginalpain', 'weakness', 'wheezing', 'withdrawal_alcohol', 'woundcheck', 'woundinfection', 'woundre_evaluation', 'wristinjury', 'wristpain']


motif_expr = "filter(array({}), x -> x is not null)".format(
    ','.join([f"if({c}=1, '{c}', null)" for c in motif_cols])
)

silver_df_motif_admission = df_motif_admission.withColumn("motif_list", F.expr(motif_expr))


display(silver_df_motif_admission.head(5))

silver_df_motif_admission = silver_df_motif_admission.select('patient_id','motif_list')

display(silver_df_motif_admission.head(5))


StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 41, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 6a3dbb06-3fc6-492e-b0ca-1b5421898ed4)

SynapseWidget(Synapse.DataFrame, 4ea9131b-6eab-45e6-a898-54faa9c85d7c)

In [40]:
# Jointure sur patient_id (left joins pour ne rien perdre côté patient)
silver_df_hopital = silver_df_patient \
    .join(silver_df_maladie, on="patient_id", how="left") \
    .join(silver_df_motif_admission, on="patient_id", how="left") \
    .join(silver_df_medicament, on="patient_id", how="left")

# Vérification du résultat
display(silver_df_hopital.head(10))


StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 42, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 3649b274-7a4b-43e7-a407-8550b00d0f16)

In [41]:
# Lecture Bronze
bronze_df_icd_code = spark.read.format("delta").load("Files/bronze/icd_code")

display(bronze_df_icd_code.head(5))

from pyspark.sql import Row

# Obtenir la première ligne
header = bronze_df_icd_code.first()

# Créer une nouvelle DataFrame sans la première ligne
df_icd_code = bronze_df_icd_code.rdd.zipWithIndex().filter(lambda x: x[1] > 0).map(lambda x: x[0])

# Créer un nouveau DataFrame avec les nouveaux noms de colonnes
silver_df_icd_code = spark.createDataFrame(df_icd_code, [str(cell) for cell in header])

def clean_column_name(col_name):
    # Remplace les espaces et caractères spéciaux par un _
    import re
    return re.sub(r"[ ,;{}\(\)\n\t=]", "_", col_name)

# Appliquer à toutes les colonnes
new_columns = [clean_column_name(c) for c in silver_df_icd_code.columns]
silver_df_icd_code = silver_df_icd_code.toDF(*new_columns)

# Voir le résultat
display(silver_df_icd_code.head(5))


StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 43, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 3d81ffd2-204d-490b-8d2c-d55cfa492886)

SynapseWidget(Synapse.DataFrame, cfd2eeb4-6931-4123-bea6-b78ab0231ef5)

In [42]:
# Lecture Bronze
bronze_df_meds_code = spark.read.format("delta").load("Files/bronze/meds_code")

display(bronze_df_meds_code.head(5))

from pyspark.sql import Row

# Obtenir la première ligne
header = bronze_df_meds_code.first()

# Créer une nouvelle DataFrame sans la première ligne
df_meds_code = bronze_df_meds_code.rdd.zipWithIndex().filter(lambda x: x[1] > 0).map(lambda x: x[0])

# Créer un nouveau DataFrame avec les nouveaux noms de colonnes
silver_df_meds_code = spark.createDataFrame(df_meds_code, [str(cell) for cell in header])

def clean_column_name(col_name):
    # Remplace les espaces et caractères spéciaux par un _
    import re
    return re.sub(r"[ ,;{}\(\)\n\t=]", "_", col_name)

# Appliquer à toutes les colonnes
new_columns = [clean_column_name(c) for c in silver_df_meds_code.columns]
silver_df_meds_code = silver_df_meds_code.toDF(*new_columns)

# Voir le résultat
display(silver_df_meds_code.head(5))

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 44, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 39dfd619-a635-43ac-95ac-36d1dc6b05ec)

SynapseWidget(Synapse.DataFrame, d5aa998b-6399-4137-84d4-cf1c7be61a0b)

In [43]:
# Lecture Bronze
bronze_df_motifs_code = spark.read.format("delta").load("Files/bronze/motifs_code")

display(bronze_df_motifs_code.head(5))

from pyspark.sql import Row

# Obtenir la première ligne
header = bronze_df_motifs_code.first()

# Créer une nouvelle DataFrame sans la première ligne
df_motifs_code = bronze_df_motifs_code.rdd.zipWithIndex().filter(lambda x: x[1] > 0).map(lambda x: x[0])

# Créer un nouveau DataFrame avec les nouveaux noms de colonnes
silver_df_motifs_code = spark.createDataFrame(df_motifs_code, [str(cell) for cell in header])

def clean_column_name(col_name):
    # Remplace les espaces et caractères spéciaux par un _
    import re
    return re.sub(r"[ ,;{}\(\)\n\t=]", "_", col_name)

# Appliquer à toutes les colonnes
new_columns = [clean_column_name(c) for c in silver_df_motifs_code.columns]
silver_df_motifs_code = silver_df_motifs_code.toDF(*new_columns)

# Voir le résultat
display(silver_df_motifs_code.head(5))

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 45, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 52cfa9d1-ef45-4801-9027-c157c2fcb0c6)

SynapseWidget(Synapse.DataFrame, 0675715d-f0da-4e46-8c4d-24471404d167)

In [46]:
#Sauvegarde couche Silver
silver_df_hopital.write.mode("overwrite").format("delta").save("Files/silver/hopital")
silver_df_icd_code.write.mode("overwrite").format("delta").save("Files/silver/icd_code")
silver_df_meds_code.write.mode("overwrite").format("delta").save("Files/silver/meds_code")
silver_df_motifs_code.write.mode("overwrite").format("delta").save("Files/silver/motifs_code")

print("✅ Couches enrichie, prête pour la Gold !")

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 48, Finished, Available, Finished)

✅ Couches enrichie, prête pour la Gold !


In [47]:
#Création dossier staging pour github

silver_df_hopital.write.mode("overwrite").parquet("Files/staging/silver_df_hopital.parquet")
silver_df_icd_code.write.mode("overwrite").parquet("Files/staging/silver_df_icd_code.parquet")
silver_df_meds_code.write.mode("overwrite").parquet("Files/staging/silver_df_meds_code.parquet")
silver_df_motifs_code.write.mode("overwrite").parquet("Files/staging/silver_df_motifs_code.parquet")

StatementMeta(, b7464dff-6254-4c7b-80ad-cae73edf623c, 50, Finished, Available, Finished)