# 01 (02): Explore Data

In this notebook, we will explore the data to see which columns require cleaning, as well as potentially creating new columns.

First, we import some packages and print the full datasets. This can be done by first adding the lakehouse on the left pane and dragging + dropping the table onto the notebook.

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
# Read the data from the lakehouse
df = spark.sql("SELECT * FROM coffee_lakehouse.personal_df")
display(df)

By viewing the data, there are Some initial things we can see:

* The `name` contains titles.
* The `phone_number` contains +44 and some area codes.

We will tackle these first.

In [None]:
# Read the data from the lakehouse
df_app = spark.sql("SELECT * FROM coffee_lakehouse.appointment_df")
display(df_app)

Similarly, the `doctor_seen` column also contains titles which we can extract.

## Patient Data Exploration

We will start by exploring and cleaning the patient data first. 

There are 3 primary ways that we can work with our data inside notebooks. These are:

* Pandas
    * Best for: Small data, quick analysis.
    * Pros: Easy, fast for small datasets and uses standard Python.
    * Cons: Doesn't scale well and not good for big data.
* PySpark
    * Best for: Big data, distributed computing.
    * Pros: Scales massively, fault-tolerant.
    * Cons: More complex setup and usage. 
* Spark SQL
    * Best for: SQL-style queries on big data.
    * Pros: Allows usage of SQL code while using the power of Spark.
    * Cons: Less flexible than PySpark.

In addition to this, as we'll see below, any combination of the three can be used in combination to achieve great things!


### Pandas


To start us off, we will explore and transform the data using Pandas.

First, because the dataframe by default is a Spark dataframe, we must convert it to Pandas.

In [None]:
df_pandas = df.toPandas()

First, we will tackle where the `name` contains titles and `phone_number` contains +44 and some area codes as we saw above.

For the former, we extract the title into a new column which we can use later. For the latter, we just simply remove these.

In [None]:
# Extract titles
df_pandas['title'] = df_pandas['name'].str.extract(r'^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)')

# Remove title from the original name
df_pandas['name'] = df_pandas['name'].str.replace(r'^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)', '', regex=True)

# Remove +44 and brackets from phone numbers
df_pandas['phone_number'] = df_pandas['phone_number'].str.replace(r'(\+44\(0\)|\+44|\)|\()', '', regex=True)

# Remove spaces from numbers
df_pandas['phone_number'] = df_pandas['phone_number'].str.replace(' ', '')

We then do a quick check to make sure these changes took effect.

In [None]:
df_pandas[~df_pandas['title'].isna()]

We now quickly loop through all of the columns, performing a quick count per value to identify other anomalies. 

Ordering by count we see:

* Address contains some blank values ('')
* Phone number contains some 'N/A' values.

Order by the column, we also see:

* There is some DOBs that are in the future

In [None]:
for col_name in df.columns:
    print(col_name)
    display(df_pandas[col_name].value_counts().reset_index())

To correct all blank and NULL values, we use the following:

In [None]:
import numpy as np

# Replace placeholder values with NaN
placeholders = ['N/A', '', 'None', 'none', 'null', 'NULL']
df_pandas.replace(placeholders, np.nan, inplace=True)

In [None]:
df_pandas[df_pandas['address'].isna()]

Where the date of birth is in the future, we simply remove these cases using the below:

In [None]:
# Remove unrealistic dates of birth (e.g., future dates)
today = pd.Timestamp.today()
df_pandas = df_pandas[df_pandas['date_of_birth'] <= today].copy()

In [None]:
df_pandas[df_pandas['date_of_birth'] >= today]

Finally, for cosmetic purposes, we will:
* Extract the postcode from the address field, remove from the address field and remove spaces in the postcode
* Remove special characters from the address
* Uppercase all string fields

In [None]:
# Extract postcode
postcode_pattern = r'(\b[A-Z]{1,2}[0-9][0-9A-Z]?\s?[0-9][A-Z]{2}\b)'

df_pandas['postcode'] = df_pandas['address'].str.extract(postcode_pattern, flags=re.IGNORECASE)

df_pandas['address'] = df_pandas['address'].str.replace(postcode_pattern, '', regex = True)

In [None]:
df_pandas['postcode'] = df_pandas['postcode'].str.replace(' ', '')

In [None]:
# Apply uppercase and remove special characters from all string columns
for col in df_pandas.select_dtypes(include='object').columns:
    df_pandas[col] = df_pandas[col].apply(
        lambda x: re.sub(r'[^A-Z0-9 ]', '', x.upper()) if isinstance(x, str) else x
    )

In [None]:
df_pandas

### PySpark

To be able to do the same operations in PySpark requires knowledge of it's syntax, which can be trickly to get to grips with. 

As discussed above, this initial learning curve allows us to run the data via Spark operations, which splits jobs between multiple computes. In effect, this is way quicker than using Pandas.

The following code repeats the operations above using Pyspark.

In [None]:
from pyspark.sql.functions import (
    col, regexp_extract, regexp_replace, upper, trim, when,
    to_date, current_date, lit
)

# Start with the base DataFrame
df = spark.table("coffee_lakehouse.personal_df")

# Extract title from name
df = df.withColumn(
    "title",
    regexp_extract(col("name"), r"^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)", 1)
)

# Remove title from name
df = df.withColumn(
    "name",
    regexp_replace(col("name"), r"^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)", "")
)

# Clean phone number: remove +44, brackets, and spaces
df = df.withColumn(
    "phone_number",
    regexp_replace(col("phone_number"), r"(\+44\(0\)|\+44|\)|\(| )", "")
)

# Replace placeholder values with nulls
for placeholder in ['N/A', '', 'None', 'none', 'null', 'NULL']:
    df = df.replace(placeholder, None)

# Remove future dates of birth
df = df.filter(col("date_of_birth") <= current_date())

# Extract postcode from address
postcode_pattern = r"(\b[A-Z]{1,2}[0-9][0-9A-Z]?\s?[0-9][A-Z]{2}\b)"
df = df.withColumn(
    "postcode",
    regexp_extract(col("address"), postcode_pattern, 1)
)

# Remove postcode from address
df = df.withColumn(
    "address",
    regexp_replace(col("address"), postcode_pattern, "")
)

# Remove spaces from postcode
df = df.withColumn(
    "postcode",
    regexp_replace(col("postcode"), " ", "")
)

# Apply uppercase and remove special characters from all string columns
string_cols = [field.name for field in df.schema.fields if field.dataType.simpleString() == 'string']
for col_name in string_cols:
    df = df.withColumn(
        col_name,
        when(
            col(col_name).isNotNull(),
            regexp_replace(upper(col(col_name)), r"[^A-Z0-9 ]", "")
        ).otherwise(None)
    )

display(df)

### Spark SQL

Alternatively, we can use Spark SQL to perform these transformations. This way, we get to use SQL code, while retaining the power of Spark. 

To explore the data further, we can combine Pandas and Spark SQL logic here, using Python loops to run multiple SQL queries at once to explore all column counts.

In [None]:
# Read the data from the lakehouse
df_spark = spark.sql("""
SELECT
    patient_id,
    NULLIF(UPPER(REGEXP_EXTRACT(name, '^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)', 1)), '') AS title,
    UPPER(REGEXP_REPLACE(name, '^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)', '')) AS name,
    TO_DATE(date_of_birth) AS date_of_birth,
    NULLIF(UPPER(TRIM(CONCAT_WS(',', SLICE(SPLIT(address, ','), 1, SIZE(SPLIT(address, ',')) - 1)))), '') AS address,
    NULLIF(TRIM(REGEXP_REPLACE(ELEMENT_AT(SPLIT(address, ','), -1), ' ', '')), '') AS postcode,
    NULLIF(REGEXP_REPLACE(phone_number, '(\\\\+44\\\\(0\\\\)|\\\\+44|\\\\)|\\\\()| ', ''), 'N/A') AS phone_number,
    is_public_patient
FROM 
    coffee_lakehouse.personal_df
WHERE
    date_of_birth <= current_date()
""")
display(df)

## Appointments Data Exploration

Turning our attention to the appointments data, we only need to clean the `doctor_seen` column as discussed above.

In [None]:
app_df = spark.sql("SELECT * FROM coffee_lakehouse.appointment_df")
display(app_df)

Again, we can do this via Spark SQL like this:

In [None]:
# Read the data from the lakehouse
df_app = spark.sql("""
SELECT
    patient_id,
    appointment_date,
    NULLIF(UPPER(REGEXP_EXTRACT(doctor_seen, '^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)', 1)), '') AS doctor_title,
    UPPER(REGEXP_REPLACE(doctor_seen, '^(Mrs|Mr|Ms|Dr|Prof|Rev|Sir|Madam)', '')) AS doctor_seen
FROM 
    coffee_lakehouse.appointment_df
""")
display(df_app)

## Write to Lakehouse

Finally, we can write our results back to the lakehouse to save them. 

**First we must paste in our abfs path that we used before.**

In [None]:
# Specify lakehouse path
abfs_path = 'abfss://490a35a8-ffa1-4c26-8ad2-f394ba2aaefd@onelake.dfs.fabric.microsoft.com/ec5cde9a-5530-4099-ae29-d318b1970f64/Tables'

Then we use the following to upload the datasets:

In [None]:
(
df_spark
    .write
    .mode('overwrite')
    .format('delta')
    .option('overwriteSchema', 'true')
    .save(f"{abfs_path}/personal_df_clean")
)

In [None]:
(
df_app
    .write
    .mode('overwrite')
    .format('delta')
    .option('overwriteSchema', 'true')
    .save(f"{abfs_path}/appointment_df_clean")
)