# 📒 Notebook : 01_bronze_ingest_clean

---

## 📝 Objectif
 
- Lire les fichiers sources (bruts) depuis l’espace d’ingestion (`Files/raw/` ).
- Appliquer les premiers nettoyages, normalisation et contrôles qualité.
- Diviser le dataset hospital triage en 4 datasets (patient, maladie, motif admission et médicament) distincts pour un retraitement plus efficace.
- Sauvegarder les données propres dans la **couche bronze** (fichiers Delta).

---

## 📥 Inputs

| Dataset         | Format     | Emplacement                      | Description                   |
|-----------------|-----------|-----------------------------------|-------------------------------|
| icd_code        | CSV        | /health_lakehouse/Files/raw/icd_code.csv       | Données catégorie de maladie brutes       |
| hospital-triage        | CSV       | /health_lakehouse/Files/raw/hospital-triage.csv      | Données hôpitaux brutes       |
| meds_cat        | CSV        | /health_lakehouse/Files/raw/meds_cat.csv       | Données catégorie de médicaments brutes       |
| motif_categorie        | CSV        | /health_lakehouse/Files/raw/motif_categorie.csv       | Données catégorie de motifs d'admission brutes      |

---

## 📤 Outputs

| Dataset        | Format     | Emplacement                              | Description                        |
|----------------|-----------|-------------------------------------------|-------------------------------------|
| icd_code       | Delta      | /health_lakehouse/Files/bronze/icd_code         | Données catégorie de maladie après nettoyage    |
| patient      | Delta      | /health_lakehouse/Files/bronze/patient         | Données patient après nettoyage    |
| maladie       | Delta      | /health_lakehouse/Files/bronze/maladie         | Données maladie après nettoyage    |
| motif_admission       | Delta      | /health_lakehouse/Files/bronze/motif_admission         | Données motif admission après nettoyage    |
| medicament       | Delta      | /health_lakehouse/Files/bronze/medicament         | Données médicament après nettoyage    |
| meds_code        | Delta        | /health_lakehouse/Files/bronze/meds_code       | Données catégorie de médicaments après nettoyage       |
| motifs_code        | Delta        | /health_lakehouse/Files/bronze/motifs_code       | Données catégorie de motifs d'admission après nettoyage     |

---

## 👤 Auteur(s) / Contact

- SEKARI Inès — [ines.sekari@efrei.net]
- NKUIDA Malaïka - [malaika.nkuida@efrei.net]

---

## 🗓️ Versioning & Mise à jour

| Version | Date        | Modifications           |
|---------|-------------|------------------------|
| 1.0     | 2025-04-26  | Création               |
| 1.1     | 2025-04-27  | Ajout datasets + nettoyage + sauvegarde Delta    |

---


In [1]:
# Lecture des données sources
df_hospital = spark.read.csv(
    'Files/raw/hospital-triage.csv',
    header=True,
    inferSchema=True,
    sep=',',        
    quote='"',      
    escape='"' 
)

display(df_hospital.head(5))
print(df_hospital.columns)


StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 3, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 141d044b-037a-4322-abf1-6b18dc83daed)

['dep_name', 'esi', 'age', 'gender', 'ethnicity', 'race', 'lang', 'religion', 'maritalstatus', 'employstatus', 'insurance_status', 'disposition', 'arrivalmode', 'arrivalmonth', 'arrivalday', 'arrivalhour_bin', 'previousdispo', '2ndarymalig', 'abdomhernia', 'abdomnlpain', 'abortcompl', 'acqfootdef', 'acrenlfail', 'acutecvd', 'acutemi', 'acutphanm', 'adjustmentdisorders', 'adltrespfl', 'alcoholrelateddisorders', 'allergy', 'amniosdx', 'analrectal', 'anemia', 'aneurysm', 'anxietydisorders', 'appendicitis', 'artembolism', 'asppneumon', 'asthma', 'attentiondeficitconductdisruptivebeha', 'backproblem', 'biliarydx', 'birthasphyx', 'birthtrauma', 'bladdercncr', 'blindness', 'bnignutneo', 'bonectcncr', 'bph', 'brainnscan', 'breastcancr', 'breastdx', 'brnchlngca', 'bronchitis', 'burns', 'cardiaarrst', 'cardiacanom', 'carditis', 'cataract', 'cervixcancr', 'chestpain', 'chfnonhp', 'chrkidneydisease', 'coaghemrdx', 'coloncancer', 'comabrndmg', 'complicdevi', 'complicproc', 'conduction', 'contracept

In [2]:
from pyspark.sql import functions as F
from pyspark.sql.functions import monotonically_increasing_id

# Ajoute les ID en concaténant les infos profil
df_hospital = (
    df_hospital
        .withColumn(
            "profile_id", 
            F.concat_ws("_", F.col("age"), F.col("gender"), F.col("ethnicity"), F.col("lang"), F.col("maritalstatus"))
        )
)


df_hospital = df_hospital.withColumn("patient_id", monotonically_increasing_id())

display(df_hospital.head(5))

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 4, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 6be2ad8a-32dc-4158-b7f9-9def72267649)

In [3]:
import re

def categorize_column(colname):
    col = colname.lower()

    # 1. Médicaments
    if col.startswith('meds_'):
        return 'traitement/médicament'
    
    # 2. Examens biologiques
    bio_patterns = [
        r'_last$', r'_min$', r'_max$', r'_median$'
    ]
    if any(re.search(pat, col) for pat in bio_patterns):
        return 'examen biologique/mesure'
    
    # 3. Examens urinaires/cultures
    if ('ua_' in col) or ('culture' in col):
        return 'examen urinaire/culture'
    
    # 4. Signe vital
    vital_keywords = [
        'triage_vital', 'pulse', 'resp', 'spo2', 'temp', 'sbp', 'dbp', 'o2_device'
    ]
    if any(k in col for k in vital_keywords):
        return 'signe vital'
    
    # 5. Examens d’imagerie ou actes
    img_keywords = [
        'cxr', 'ekg', 'echo', 'ct', 'xr', 'mri', 'us', 'img'
    ]
    if any(col.endswith('_count') and k in col for k in img_keywords):
        return "examen imagerie/acte médical"
    
    # 6. Mesures de parcours patient
    parcours_keywords = [
        'n_edvisits', 'n_admissions', 'n_surgeries'
    ]
    if col in parcours_keywords:
        return 'mesure parcours patient'
    
    # 7. Examens d’urines/cultures divers (nombre de positifs/total)
    if any(key in col for key in ['_npos', '_count']):
        # Si déjà classé par ailleurs, on n'y arrive pas ici
        if ('ua_' in col) or ('culture' in col):
            return 'examen urinaire/culture'
        else:
            return 'autre comptage'
    
    # Par défaut
    return 'autre'

# Ex d'utilisation :
cols = [
    'n_edvisits', 'triage_vital_hr', 'pulse_last', 'hemoglobin_max', 'cxr_count',
    'meds_antibiotics', 'bloodua_npos', 'urineculture,routine_last', 'sbp_median', 'poctroponini._median'
]

for col in cols:
    print(f"{col:35s} => {categorize_column(col)}")

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 5, Finished, Available, Finished)

n_edvisits                          => mesure parcours patient
triage_vital_hr                     => signe vital
pulse_last                          => examen biologique/mesure
hemoglobin_max                      => examen biologique/mesure
cxr_count                           => examen imagerie/acte médical
meds_antibiotics                    => traitement/médicament
bloodua_npos                        => examen urinaire/culture
urineculture,routine_last           => examen biologique/mesure
sbp_median                          => examen biologique/mesure
poctroponini._median                => examen biologique/mesure


In [4]:
df_patient = df_hospital.select('patient_id', 'profile_id','dep_name', 'age', 'gender', 'ethnicity', 'race', 'lang', 'religion', 'maritalstatus', 'employstatus', 'insurance_status', 'disposition', 'arrivalmode', 'arrivalmonth', 'arrivalday', 'arrivalhour_bin', 'previousdispo','n_surgeries','n_edvisits', 'n_admissions')
df_maladie = df_hospital.select('patient_id','2ndarymalig', 'abdomhernia', 'abdomnlpain', 'abortcompl', 'acqfootdef', 'acrenlfail', 'acutecvd', 'acutemi', 'acutphanm', 'adjustmentdisorders', 'adltrespfl', 'alcoholrelateddisorders', 'allergy', 'amniosdx', 'analrectal', 'anemia', 'aneurysm', 'anxietydisorders', 'appendicitis', 'artembolism', 'asppneumon', 'asthma', 'attentiondeficitconductdisruptivebeha', 'backproblem', 'biliarydx', 'birthasphyx', 'birthtrauma', 'bladdercncr', 'blindness', 'bnignutneo', 'bonectcncr', 'bph', 'brainnscan', 'breastcancr', 'breastdx', 'brnchlngca', 'bronchitis', 'burns', 'cardiaarrst', 'cardiacanom', 'carditis', 'cataract', 'cervixcancr', 'chestpain', 'chfnonhp', 'chrkidneydisease', 'coaghemrdx', 'coloncancer', 'comabrndmg', 'complicdevi', 'complicproc', 'conduction', 'contraceptiv', 'copd', 'coronathero', 'crushinjury', 'cysticfibro', 'deliriumdementiaamnesticothercognitiv', 'developmentaldisorders', 'diabmelnoc', 'diabmelwcm', 'disordersusuallydiagnosedininfancych', 'diverticulos', 'dizziness', 'dminpreg', 'dysrhythmia', 'earlylabor', 'ecodesadverseeffectsofmedicalcare', 'ecodesadverseeffectsofmedicaldrugs', 'ecodescutpierce', 'ecodesdrowningsubmersion', 'ecodesfall', 'ecodesfirearm', 'ecodesfireburn', 'ecodesmachinery', 'ecodesmotorvehicletrafficmvt', 'ecodesnaturalenvironment', 'ecodesotherspecifiedandclassifiable', 'ecodesotherspecifiednec', 'ecodespedalcyclistnotmvt', 'ecodesplaceofoccurrence', 'ecodespoisoning', 'ecodesstruckbyagainst', 'ecodessuffocation', 'ecodestransportnotmvt', 'ecodesunspecified', 'ectopicpreg', 'encephalitis', 'endometrios', 'epilepsycnv', 'esophcancer', 'esophgealdx', 'exameval', 'eyeinfectn', 'fatigue', 'femgenitca', 'feminfertil', 'fetaldistrs', 'fluidelcdx', 'fuo', 'fxarm', 'fxhip', 'fxleg', 'fxskullfac', 'gangrene', 'gasduoulcer', 'gastritis', 'gastroent', 'giconganom', 'gihemorrhag', 'giperitcan', 'glaucoma', 'goutotcrys', 'guconganom', 'hdnckcancr', 'headachemig', 'hemmorhoids', 'hemorrpreg', 'hepatitis', 'hivinfectn', 'hodgkinsds', 'hrtvalvedx', 'htn', 'htncomplicn', 'htninpreg', 'hyperlipidem', 'immunitydx', 'immunizscrn', 'impulsecontroldisordersnec', 'inducabortn', 'infectarth', 'influenza', 'infmalegen', 'intestinfct', 'intobstruct', 'intracrninj', 'jointinjury', 'kidnyrnlca', 'lateeffcvd', 'leukemias', 'liveborn', 'liveribdca', 'longpregncy', 'lowbirthwt', 'lungexternl', 'lymphenlarg', 'maintchemr', 'malgenitca', 'maligneopls', 'malposition', 'meningitis', 'menopausldx', 'menstrualdx', 'miscellaneousmentalhealthdisorders', 'mooddisorders', 'mouthdx', 'ms', 'multmyeloma', 'mycoses', 'nauseavomit', 'neoplsmunsp', 'nephritis', 'nervcongan', 'nonepithca', 'nonhodglym', 'nutritdefic', 'obrelatedperintrauma', 'opnwndextr', 'opnwndhead', 'osteoarthros', 'osteoporosis', 'otacqdefor', 'otaftercare', 'otbnignneo', 'otbonedx', 'otcirculdx', 'otcomplbir', 'otconganom', 'otconntiss', 'otdxbladdr', 'otdxkidney', 'otdxstomch', 'otendodsor', 'otfemalgen', 'othbactinf', 'othcnsinfx', 'othematldx', 'othercvd', 'othereardx', 'otheredcns', 'othereyedx', 'othergidx', 'othergudx', 'otherinjury', 'otherpregnancyanddeliveryincludingnormal', 'otherscreen', 'othfracture', 'othheartdx', 'othinfectns', 'othliverdx', 'othlowresp', 'othmalegen', 'othnervdx', 'othskindx', 'othveindx', 'otinflskin', 'otitismedia', 'otjointdx', 'otnutritdx', 'otperintdx', 'otpregcomp', 'otprimryca', 'otrespirca', 'otupprresp', 'otuprspin', 'ovariancyst', 'ovarycancer', 'pancreascan', 'pancreasdx', 'paralysis', 'parkinsons', 'pathologfx', 'pelvicobstr', 'perintjaund', 'peripathero', 'peritonitis', 'personalitydisorders', 'phlebitis', 'pid', 'pleurisy', 'pneumonia', 'poisnnonmed', 'poisnotmed', 'poisonpsych', 'precereoccl', 'prevcsectn', 'prolapse', 'prostatecan', 'pulmhartdx', 'rctmanusca', 'rehab', 'respdistres', 'retinaldx', 'rheumarth', 'schizophreniaandotherpsychoticdisorde', 'screeningandhistoryofmentalhealthan', 'septicemia', 'septicemiaexceptinlabor', 'sexualinfxs', 'shock', 'sicklecell', 'skininfectn', 'skinmelanom', 'sle', 'socialadmin', 'spincorinj', 'spontabortn', 'sprain', 'stomchcancr', 'substancerelateddisorders', 'suicideandintentionalselfinflictedin', 'superficinj', 'syncope', 'teethdx', 'testiscancr', 'thyroidcncr', 'thyroiddsor', 'tia', 'tonsillitis', 'tuberculosis', 'ulceratcol', 'ulcerskin', 'umbilcord', 'unclassified', 'urinstone', 'urinyorgca', 'uteruscancr', 'uti', 'varicosevn', 'viralinfect', 'whtblooddx',)
df_motif_admission = df_hospital.select('patient_id','cc_abdominalcramping', 'cc_abdominaldistention', 'cc_abdominalpain', 'cc_abdominalpainpregnant', 'cc_abnormallab', 'cc_abscess', 'cc_addictionproblem', 'cc_agitation', 'cc_alcoholintoxication', 'cc_alcoholproblem', 'cc_allergicreaction', 'cc_alteredmentalstatus', 'cc_animalbite', 'cc_ankleinjury', 'cc_anklepain', 'cc_anxiety', 'cc_arminjury', 'cc_armpain', 'cc_armswelling', 'cc_assaultvictim', 'cc_asthma', 'cc_backpain', 'cc_bleeding/bruising', 'cc_blurredvision', 'cc_bodyfluidexposure', 'cc_breastpain', 'cc_breathingdifficulty', 'cc_breathingproblem', 'cc_burn', 'cc_cardiacarrest', 'cc_cellulitis', 'cc_chestpain', 'cc_chesttightness', 'cc_chills', 'cc_coldlikesymptoms', 'cc_confusion', 'cc_conjunctivitis', 'cc_constipation', 'cc_cough', 'cc_cyst', 'cc_decreasedbloodsugar-symptomatic', 'cc_dehydration', 'cc_dentalpain', 'cc_depression', 'cc_detoxevaluation', 'cc_diarrhea', 'cc_dizziness', 'cc_drug/alcoholassessment', 'cc_drugproblem', 'cc_dyspnea', 'cc_dysuria', 'cc_earpain', 'cc_earproblem', 'cc_edema', 'cc_elbowpain', 'cc_elevatedbloodsugar-nosymptoms', 'cc_elevatedbloodsugar-symptomatic', 'cc_emesis', 'cc_epigastricpain', 'cc_epistaxis', 'cc_exposuretostd', 'cc_extremitylaceration', 'cc_extremityweakness', 'cc_eyeinjury', 'cc_eyepain', 'cc_eyeproblem', 'cc_eyeredness', 'cc_facialinjury', 'cc_faciallaceration', 'cc_facialpain', 'cc_facialswelling', 'cc_fall', 'cc_fall>65', 'cc_fatigue', 'cc_femaleguproblem', 'cc_fever', 'cc_fever-75yearsorolder', 'cc_fever-9weeksto74years', 'cc_feverimmunocompromised', 'cc_fingerinjury', 'cc_fingerpain', 'cc_fingerswelling', 'cc_flankpain', 'cc_follow-upcellulitis', 'cc_footinjury', 'cc_footpain', 'cc_footswelling', 'cc_foreignbodyineye', 'cc_fulltrauma', 'cc_generalizedbodyaches', 'cc_gibleeding', 'cc_giproblem', 'cc_groinpain', 'cc_hallucinations', 'cc_handinjury', 'cc_handpain', 'cc_headache', 'cc_headache-newonsetornewsymptoms', 'cc_headache-recurrentorknowndxmigraines', 'cc_headachere-evaluation', 'cc_headinjury', 'cc_headlaceration', 'cc_hematuria', 'cc_hemoptysis', 'cc_hippain', 'cc_homicidal', 'cc_hyperglycemia', 'cc_hypertension', 'cc_hypotension', 'cc_influenza', 'cc_ingestion', 'cc_insectbite', 'cc_irregularheartbeat', 'cc_jawpain', 'cc_jointswelling', 'cc_kneeinjury', 'cc_kneepain', 'cc_laceration', 'cc_leginjury', 'cc_legpain', 'cc_legswelling', 'cc_lethargy', 'cc_lossofconsciousness', 'cc_maleguproblem', 'cc_mass', 'cc_medicalproblem', 'cc_medicalscreening', 'cc_medicationproblem', 'cc_medicationrefill', 'cc_migraine', 'cc_modifiedtrauma', 'cc_motorcyclecrash', 'cc_motorvehiclecrash', 'cc_multiplefalls', 'cc_nasalcongestion', 'cc_nausea', 'cc_nearsyncope', 'cc_neckpain', 'cc_neurologicproblem', 'cc_numbness', 'cc_oralswelling', 'cc_otalgia', 'cc_other', 'cc_overdose-accidental', 'cc_overdose-intentional', 'cc_pain', 'cc_palpitations', 'cc_panicattack', 'cc_pelvicpain', 'cc_poisoning', 'cc_post-opproblem', 'cc_psychiatricevaluation', 'cc_psychoticsymptoms', 'cc_rapidheartrate', 'cc_rash', 'cc_rectalbleeding', 'cc_rectalpain', 'cc_respiratorydistress', 'cc_ribinjury', 'cc_ribpain', 'cc_seizure-newonset', 'cc_seizure-priorhxof', 'cc_seizures', 'cc_shortnessofbreath', 'cc_shoulderinjury', 'cc_shoulderpain', 'cc_sicklecellpain', 'cc_sinusproblem', 'cc_skinirritation', 'cc_skinproblem', 'cc_sorethroat', 'cc_stdcheck', 'cc_strokealert', 'cc_suicidal', 'cc_suture/stapleremoval', 'cc_swallowedforeignbody', 'cc_syncope', 'cc_tachycardia', 'cc_testiclepain', 'cc_thumbinjury', 'cc_tickremoval', 'cc_toeinjury', 'cc_toepain', 'cc_trauma', 'cc_unresponsive', 'cc_uri', 'cc_urinaryfrequency', 'cc_urinaryretention', 'cc_urinarytractinfection', 'cc_vaginalbleeding', 'cc_vaginaldischarge', 'cc_vaginalpain', 'cc_weakness', 'cc_wheezing', 'cc_withdrawal-alcohol', 'cc_woundcheck', 'cc_woundinfection', 'cc_woundre-evaluation', 'cc_wristinjury', 'cc_wristpain')
df_medicament = df_hospital.select('patient_id','meds_analgesicandantihistaminecombination', 'meds_analgesics', 'meds_anesthetics', 'meds_anti-obesitydrugs', 'meds_antiallergy', 'meds_antiarthritics', 'meds_antiasthmatics', 'meds_antibiotics', 'meds_anticoagulants', 'meds_antidotes', 'meds_antifungals', 'meds_antihistamineanddecongestantcombination', 'meds_antihistamines', 'meds_antihyperglycemics', 'meds_antiinfectives', 'meds_antiinfectives/miscellaneous', 'meds_antineoplastics', 'meds_antiparkinsondrugs', 'meds_antiplateletdrugs', 'meds_antivirals', 'meds_autonomicdrugs', 'meds_biologicals', 'meds_blood', 'meds_cardiacdrugs', 'meds_cardiovascular', 'meds_cnsdrugs', 'meds_colonystimulatingfactors', 'meds_contraceptives', 'meds_cough/coldpreparations', 'meds_diagnostic', 'meds_diuretics', 'meds_eentpreps', 'meds_elect/caloric/h2o', 'meds_gastrointestinal', 'meds_herbals', 'meds_hormones', 'meds_immunosuppressants', 'meds_investigational', 'meds_miscellaneousmedicalsupplies,devices,non-drug', 'meds_musclerelaxants', 'meds_pre-natalvitamins', 'meds_psychotherapeuticdrugs', 'meds_sedative/hypnotics', 'meds_skinpreps', 'meds_smokingdeterrents', 'meds_thyroidpreps', 'meds_unclassifieddrugproducts', 'meds_vitamins')


StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 6, Finished, Available, Finished)

In [5]:
#df_mesure_bio = df_hospital.select('patient_id','absolutelymphocytecount_last', 'acetonebld_last', 'alanineaminotransferase(alt)_last', 'albumin_last', 'alkphos_last', 'anc(absneutrophilcount)_last', 'aniongap_last', 'aspartateaminotransferase(ast)_last', 'b-typenatriureticpeptide,pro(probnp)_last', 'baseexcess(poc)_last', 'baseexcess,venous(poc)_last', 'basos_last', 'basosabs_last', 'benzodiazepinesscreen,urine,noconf._last', 'bilirubindirect_last', 'bilirubintotal_last', 'bun_last', 'bun/creatratio_last', 'calcium_last', 'calculatedco2(poc)_last', 'calculatedhco3(poc)i_last', 'calculatedo2saturation(poc)_last', 'chloride_last', 'cktotal_last', 'co2_last', 'co2calculated,venous(poc)_last', 'co2,poc_last', 'creatinine_last', 'd-dimer_last', 'egfr_last', 'egfr(nonafricanamerican)_last', 'egfr(aframer)_last', 'eos_last', 'eosinoabs_last', 'epithelialcells_last', 'globulin_last', 'glucose_last', 'glucose,meter_last', 'hco3calculated,venous(poc)_last', 'hematocrit_last', 'hemoglobin_last', 'immaturegrans(abs)_last', 'immaturegranulocytes_last', 'inr_last', 'lactate,poc_last', 'lipase_last', 'lymphs_last', 'magnesium_last', 'mch_last', 'mchc_last', 'mcv_last', 'monocytes_last', 'monosabs_last', 'mpv_last', 'neutrophils_last', 'nrbc_last', 'nrbcabsolute_last', 'o2satcalculated,venous(poc)_last', 'pco2(poc)_last', 'pco2,venous(poc)_last', 'ph,venous(poc)_last', 'phencyclidine(pcp)screen,urine,noconf._last', 'phosphorus_last', 'platelets_last', 'po2(poc)_last', 'po2,venous(poc)_last', 'pocbun_last', 'poccreatinine_last', 'pocglucose_last', 'pochematocrit_last', 'pocionizedcalcium_last', 'pocph_last', 'pocpotassium_last', 'pocsodium_last', 'poctroponini._last', 'potassium_last', 'proteintotal_last', 'prothrombintime_last', 'ptt_last', 'rbc_last', 'rbc/hpf_last', 'rdw_last', 'sodium_last', 'troponini(poc)_last', 'troponint_last', 'tsh_last', 'wbc_last', 'wbc/hpf_last', 'absolutelymphocytecount_min', 'acetonebld_min', 'alanineaminotransferase(alt)_min', 'albumin_min', 'alkphos_min', 'anc(absneutrophilcount)_min', 'aniongap_min', 'aspartateaminotransferase(ast)_min', 'b-typenatriureticpeptide,pro(probnp)_min', 'baseexcess(poc)_min', 'baseexcess,venous(poc)_min', 'basos_min', 'basosabs_min', 'benzodiazepinesscreen,urine,noconf._min', 'bilirubindirect_min', 'bilirubintotal_min', 'bun_min', 'bun/creatratio_min', 'calcium_min', 'calculatedco2(poc)_min', 'calculatedhco3(poc)i_min', 'calculatedo2saturation(poc)_min', 'chloride_min', 'cktotal_min', 'co2_min', 'co2calculated,venous(poc)_min', 'co2,poc_min', 'creatinine_min', 'd-dimer_min', 'egfr_min', 'egfr(nonafricanamerican)_min', 'egfr(aframer)_min', 'eos_min', 'eosinoabs_min', 'epithelialcells_min', 'globulin_min', 'glucose_min', 'glucose,meter_min', 'hco3calculated,venous(poc)_min', 'hematocrit_min', 'hemoglobin_min', 'immaturegrans(abs)_min', 'immaturegranulocytes_min', 'inr_min', 'lactate,poc_min', 'lipase_min', 'lymphs_min', 'magnesium_min', 'mch_min', 'mchc_min', 'mcv_min', 'monocytes_min', 'monosabs_min', 'mpv_min', 'neutrophils_min', 'nrbc_min', 'nrbcabsolute_min', 'o2satcalculated,venous(poc)_min', 'pco2(poc)_min', 'pco2,venous(poc)_min', 'ph,venous(poc)_min', 'phencyclidine(pcp)screen,urine,noconf._min', 'phosphorus_min', 'platelets_min', 'po2(poc)_min', 'po2,venous(poc)_min', 'pocbun_min', 'poccreatinine_min', 'pocglucose_min', 'pochematocrit_min', 'pocionizedcalcium_min', 'pocph_min', 'pocpotassium_min', 'pocsodium_min', 'poctroponini._min', 'potassium_min', 'proteintotal_min', 'prothrombintime_min', 'ptt_min', 'rbc_min', 'rbc/hpf_min', 'rdw_min', 'sodium_min', 'troponini(poc)_min', 'troponint_min', 'tsh_min', 'wbc_min', 'wbc/hpf_min', 'absolutelymphocytecount_max', 'acetonebld_max', 'alanineaminotransferase(alt)_max', 'albumin_max', 'alkphos_max', 'anc(absneutrophilcount)_max', 'aniongap_max', 'aspartateaminotransferase(ast)_max', 'b-typenatriureticpeptide,pro(probnp)_max', 'baseexcess(poc)_max', 'baseexcess,venous(poc)_max', 'basos_max', 'basosabs_max', 'benzodiazepinesscreen,urine,noconf._max', 'bilirubindirect_max', 'bilirubintotal_max', 'bun_max', 'bun/creatratio_max', 'calcium_max', 'calculatedco2(poc)_max', 'calculatedhco3(poc)i_max', 'calculatedo2saturation(poc)_max', 'chloride_max', 'cktotal_max', 'co2_max', 'co2calculated,venous(poc)_max', 'co2,poc_max', 'creatinine_max', 'd-dimer_max', 'egfr_max', 'egfr(nonafricanamerican)_max', 'egfr(aframer)_max', 'eos_max', 'eosinoabs_max', 'epithelialcells_max', 'globulin_max', 'glucose_max', 'glucose,meter_max', 'hco3calculated,venous(poc)_max', 'hematocrit_max', 'hemoglobin_max', 'immaturegrans(abs)_max', 'immaturegranulocytes_max', 'inr_max', 'lactate,poc_max', 'lipase_max', 'lymphs_max', 'magnesium_max', 'mch_max', 'mchc_max', 'mcv_max', 'monocytes_max', 'monosabs_max', 'mpv_max', 'neutrophils_max', 'nrbc_max', 'nrbcabsolute_max', 'o2satcalculated,venous(poc)_max', 'pco2(poc)_max', 'pco2,venous(poc)_max', 'ph,venous(poc)_max', 'phencyclidine(pcp)screen,urine,noconf._max', 'phosphorus_max', 'platelets_max', 'po2(poc)_max', 'po2,venous(poc)_max', 'pocbun_max', 'poccreatinine_max', 'pocglucose_max', 'pochematocrit_max', 'pocionizedcalcium_max', 'pocph_max', 'pocpotassium_max', 'pocsodium_max', 'poctroponini._max', 'potassium_max', 'proteintotal_max', 'prothrombintime_max', 'ptt_max', 'rbc_max', 'rbc/hpf_max', 'rdw_max', 'sodium_max', 'troponini(poc)_max', 'troponint_max', 'tsh_max', 'wbc_max', 'wbc/hpf_max', 'absolutelymphocytecount_median', 'acetonebld_median', 'alanineaminotransferase(alt)_median', 'albumin_median', 'alkphos_median', 'anc(absneutrophilcount)_median', 'aniongap_median', 'aspartateaminotransferase(ast)_median', 'b-typenatriureticpeptide,pro(probnp)_median', 'baseexcess(poc)_median', 'baseexcess,venous(poc)_median', 'basos_median', 'basosabs_median', 'benzodiazepinesscreen,urine,noconf._median', 'bilirubindirect_median', 'bilirubintotal_median', 'bun_median', 'bun/creatratio_median', 'calcium_median', 'calculatedco2(poc)_median', 'calculatedhco3(poc)i_median', 'calculatedo2saturation(poc)_median', 'chloride_median', 'cktotal_median', 'co2_median', 'co2calculated,venous(poc)_median', 'co2,poc_median', 'creatinine_median', 'd-dimer_median', 'egfr_median', 'egfr(nonafricanamerican)_median', 'egfr(aframer)_median', 'eos_median', 'eosinoabs_median', 'epithelialcells_median', 'globulin_median', 'glucose_median', 'glucose,meter_median', 'hco3calculated,venous(poc)_median', 'hematocrit_median', 'hemoglobin_median', 'immaturegrans(abs)_median', 'immaturegranulocytes_median', 'inr_median', 'lactate,poc_median', 'lipase_median', 'lymphs_median', 'magnesium_median', 'mch_median', 'mchc_median', 'mcv_median', 'monocytes_median', 'monosabs_median', 'mpv_median', 'neutrophils_median', 'nrbc_median', 'nrbcabsolute_median', 'o2satcalculated,venous(poc)_median', 'pco2(poc)_median', 'pco2,venous(poc)_median', 'ph,venous(poc)_median', 'phencyclidine(pcp)screen,urine,noconf._median', 'phosphorus_median', 'platelets_median', 'po2(poc)_median', 'po2,venous(poc)_median', 'pocbun_median', 'poccreatinine_median', 'pocglucose_median', 'pochematocrit_median', 'pocionizedcalcium_median', 'pocph_median', 'pocpotassium_median', 'pocsodium_median', 'poctroponini._median', 'potassium_median', 'proteintotal_median', 'prothrombintime_median', 'ptt_median', 'rbc_median', 'rbc/hpf_median', 'rdw_median', 'sodium_median', 'troponini(poc)_median', 'troponint_median', 'tsh_median', 'wbc_median', 'wbc/hpf_median', 'bloodua_last', 'glucoseua_last', 'ketonesua_last', 'leukocytesua_last', 'nitriteua_last', 'pregtestur_last', 'proteinua_last', 'bloodculture,routine_last', 'urineculture,routine_last','bloodua_npos', 'glucoseua_npos', 'ketonesua_npos', 'leukocytesua_npos', 'nitriteua_npos', 'pregtestur_npos', 'proteinua_npos', 'bloodculture,routine_npos', 'urineculture,routine_npos', 'bloodua_count', 'glucoseua_count', 'ketonesua_count', 'leukocytesua_count', 'nitriteua_count', 'pregtestur_count', 'proteinua_count', 'bloodculture,routine_count', 'urineculture,routine_count', 'triage_vital_hr', 'triage_vital_sbp', 'triage_vital_dbp', 'triage_vital_rr', 'triage_vital_o2', 'triage_vital_o2_device', 'triage_vital_temp', 'pulse_last', 'resp_last', 'spo2_last', 'temp_last', 'sbp_last', 'dbp_last', 'o2_device_last', 'pulse_min', 'resp_min', 'spo2_min', 'temp_min', 'sbp_min', 'dbp_min', 'o2_device_min', 'pulse_max', 'resp_max', 'spo2_max', 'temp_max', 'sbp_max', 'dbp_max', 'o2_device_max', 'pulse_median', 'resp_median', 'spo2_median', 'temp_median', 'sbp_median', 'dbp_median', 'o2_device_median', 'cxr_count', 'echo_count', 'ekg_count', 'headct_count', 'mri_count', 'otherct_count', 'otherimg_count', 'otherus_count', 'otherxr_count')


StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 7, Finished, Available, Finished)

In [6]:
import re

def make_pretty(colname):
    newname = re.sub(r'[^a-zA-Z0-9]', '_', colname)
    newname = re.sub(r'__+', '_', newname)
    newname = newname.strip('_')
    return newname.lower()

pretty_cols = [make_pretty(c) for c in df_patient.columns]
pretty_cols2 = [make_pretty(c) for c in df_maladie.columns]
pretty_cols3 = [make_pretty(c) for c in df_motif_admission.columns]
pretty_cols4 = [make_pretty(c) for c in df_medicament.columns]

# Renomme toutes les colonnes en une étape (méthode .toDF)
df_patient = df_patient.toDF(*pretty_cols)
df_maladie = df_maladie.toDF(*pretty_cols2)
df_motif_admission = df_motif_admission.toDF(*pretty_cols3)
df_medicament = df_medicament.toDF(*pretty_cols4)

display(df_patient.head(5))
display(df_maladie.head(5))
display(df_motif_admission.head(5))
display(df_medicament.head(5))

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 8, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 4d20a9c3-0186-4243-9a88-1a8fed868361)

SynapseWidget(Synapse.DataFrame, a88137a4-b4a4-411f-893b-628934a24bb8)

SynapseWidget(Synapse.DataFrame, 6b4a00f7-e388-4bd0-8b0d-7e5b8809ddc3)

SynapseWidget(Synapse.DataFrame, 12ca5302-b730-49b6-86f6-fdf7e1cc177e)

In [7]:
from pyspark.sql.functions import col, upper, sum as spark_sum

# 1. Suppression des doublons
df_patient = df_patient.dropDuplicates()
df_maladie = df_maladie.dropDuplicates()
df_motif_admission = df_motif_admission.dropDuplicates()
df_medicament = df_medicament.dropDuplicates()

# 2. Suppression des lignes trop vides
thresh = int(0.2 * len(df_patient.columns))
thresh = int(0.2 * len(df_maladie.columns))
thresh = int(0.2 * len(df_motif_admission.columns))
thresh = int(0.2 * len(df_medicament.columns))

df_patient = df_patient.dropna(thresh=thresh)
df_maladie = df_maladie.dropna(thresh=thresh)
df_motif_admission = df_motif_admission.dropna(thresh=thresh)
df_medicament = df_medicament.dropna(thresh=thresh)

# 3. Suppression si colonne-clé manquante
df_patient = df_patient.dropna(subset=['age', 'gender'])

# 4. Imputation générique
df_patient = df_patient.na.fill({'gender': 'UNKNOWN', 'age': -1})

# 5. Mise en majuscule du genre
df_patient = df_patient.withColumn('gender', upper(col('gender')))

# 6. Casting des colonnes principales
df_patient = df_patient.withColumn("age", col("age").cast("int"))

# 7. Suppression des colonnes vides
# On calcule le nombre de valeurs non nulles par colonne, proprement en Spark
summary = df_patient.select([
    spark_sum(col(c).isNotNull().cast("int")).alias(c)
    for c in df_patient.columns
]).toPandas()  # on convertit SEULEMENT le résumé en Pandas

summary = df_maladie.select([
    spark_sum(col(c).isNotNull().cast("int")).alias(c)
    for c in df_maladie.columns
]).toPandas()

summary = df_motif_admission.select([
    spark_sum(col(c).isNotNull().cast("int")).alias(c)
    for c in df_motif_admission.columns
]).toPandas()

summary = df_medicament.select([
    spark_sum(col(c).isNotNull().cast("int")).alias(c)
    for c in df_medicament.columns
]).toPandas()

empty_cols = [c for c in summary.columns if summary.iloc[0][c] == 0]
if empty_cols:
    df_patient = df_patient.drop(*empty_cols)

    empty_cols = [c for c in summary.columns if summary.iloc[0][c] == 0]
if empty_cols:
    df_maladie = df_maladie.drop(*empty_cols)

    empty_cols = [c for c in summary.columns if summary.iloc[0][c] == 0]
if empty_cols:
    df_motif_admission = df_motif_admission.drop(*empty_cols)

    empty_cols = [c for c in summary.columns if summary.iloc[0][c] == 0]
if empty_cols:
    df_medicament = df_medicament.drop(*empty_cols)
    
# Afficher un aperçu
display(df_patient.head(5))
display(df_maladie.head(5))
display(df_motif_admission.head(5))
display(df_medicament.head(5))

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 9, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 009b9974-1f3e-48ef-9327-e0bb85c45ffb)

SynapseWidget(Synapse.DataFrame, 5860381b-ad21-4956-80df-4229f3027a57)

SynapseWidget(Synapse.DataFrame, 7b5ecb44-9fc3-473e-a3f2-0e862a6b5c90)

SynapseWidget(Synapse.DataFrame, 13a25499-2065-4da8-9a81-2f1bf128bd40)

In [8]:
# Écriture des données Bronze
df_patient.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("Files/bronze/patient")
df_maladie.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("Files/bronze/maladie")
df_motif_admission.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("Files/bronze/motif_admission")
df_medicament.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("Files/bronze/medicament")

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 10, Finished, Available, Finished)

In [9]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

df_icd = spark.read.format("csv").load("Files/raw/icd_code.csv/",sep=";")


display(df_icd)

df_icd.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("Files/bronze/icd_code")

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 11, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 4a2a8eb2-edac-4ddd-a30c-60a8c2ffae52)

In [10]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

meds_code = spark.read.format("csv").load("Files/raw/meds_cat.csv/",sep=";")


display(meds_code)

meds_code.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("Files/bronze/meds_code")

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 12, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 9103d03b-5a4e-4e45-8c45-49128bfd3b79)

In [11]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

motifs_code = spark.read.format("csv").load("Files/raw/motif_categorie.csv/",sep=";")


display(motifs_code)

motifs_code.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("Files/bronze/motifs_code")

StatementMeta(, e33f2121-02ab-4656-abec-a103f3f6ebe2, 13, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 67b15523-5426-4971-a7d9-bc04cd978ef5)