# 02: Clean Patient Data

Instead of running everything in one big notebook, it's best to split every process into individual notebooks so that if something goes wrong with our workflow, we can see where the error was and easily correct it.

Therefore, this notebook is the first item that we will include within our data pipeline. 

We take the code from the exploration and add it here as a step in the pipeline.


The first step we apply here is by running the config file. You can see this notebook within the list of files. 

We need to copy the ABFS path again, navigate to the config, paste that in and come back!


In [None]:
%run 00_config

In [None]:
config

**Using this file means that if we ever change any values, such as the lakehouse name, we only need to change one file, rather than all files.**

We now just run our script that we created previously and re-upload the lakehouse.

Note the use of `f"""` in our script so that we can insert variables via `{...}`. 

In [None]:
# Read the data from the lakehouse
df = spark.sql(f"""
SELECT
    patient_id,
    NULLIF(UPPER(REGEXP_EXTRACT(name, '^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)', 1)), '') AS title,
    UPPER(REGEXP_REPLACE(name, '^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)', '')) AS name,
    TO_DATE(date_of_birth) AS date_of_birth,
    NULLIF(UPPER(REGEXP_REPLACE(TRIM(CONCAT_WS(',', SLICE(SPLIT(address, ','), 1, SIZE(SPLIT(address, ',')) - 1))), '[^a-zA-Z0-9]', ' ')), '') AS address,
    NULLIF(TRIM(REGEXP_REPLACE(ELEMENT_AT(SPLIT(address, ','), -1), ' ', '')), '') AS postcode,
    NULLIF(REGEXP_REPLACE(phone_number, '(\\\\+44\\\\(0\\\\)|\\\\+44|\\\\)|\\\\()| ', ''), 'N/A') AS phone_number,
    is_public_patient
FROM 
    {config['lakehouse_name']}.personal_df
WHERE
    date_of_birth <= current_date()
""")

NOTE: We don't need to specify the path here as it's in our config and don't need to convert to Spark dataframe.

In [None]:
# Save the data to the lakehouse
(
df
    .write
    .mode('overwrite')
    .format('delta')
    .option('overwriteSchema', 'true')
    .save(f"{config['lakehouse_path']}/personal_df_clean")
)