<div id="colab_button">
  <h1>Data exploration of diabetes hospital admissions: Part I </h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/how-to-guides/diabetes_p1.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>
______________________________________________________

Despite major technological breakthroughs in cybersecurity and privacy in recent years, secure off-premises data science collaboration has remained out of reach. This is a major problem for the health sector which has so much to gain from the power of data but also so much at risk when it comes to patients' highly sensitive medical records.

We are on a mission to make remote data science collaboration safe for the health sector. Using BastionLab, data owners can set strict access policies on datasets for collaborators, allowing them to run privacy-friendly queries and train and deploy ML models on datasets whilst blocking access to raw data.

In this how-to guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.

In part I of this two-part data exploration. We will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**.

But before we can do that, we first need to get everything set up!

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Ensure we have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) and the [BastionLab server](https://pypi.org/project/bastionlab-server/0.3.7/) pip packages
- [Download the dataset](https://drive.google.com/file/d/1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI/view?usp=share_link) we will be using in this notebook.

You can download the BastionLab pip packages and the dataset by running the following code block.

>To find out about other ways you can install and run BastionLab, see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/).

In [None]:
# installing BastionLab client & server packages
!pip install bastionlab
!pip install bastionlab_server

# dowloading the dataset using Google Drive tool dgown
!pip install gdown
!pip install --upgrade --no-cache-dir gdown
!gdown --id "1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI"

The dataset we are using for this how-to guide is based on the Diabetes 130-US hospitals for years 1999-2008 dataset. It contains 10 years of data on diabetes admissions from 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.

>For more detailed information on the dataset, you can check out the description and full dataset by following this [link](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).

However, this dataset had already been pre-processed before publication which stopped us from showing you some key data cleaning steps. We therefore made a few modifications to replace some pre-grouped data columns with randomly populated data. You can check out exactly how we did this using Polars [here](https://colab.research.google.com/drive/174EJvK8u8mGGWb6ypLH9SKaeRnX-pEou?usp=share_link). 

## Data owner's POV
___________________________________________

### Launching the server

Let's start by putting ourselves in the shoes of the data owner.

But before we can do anything more, the BastionLab server must be running.

In production we recommend this is done using our Docker image, but for testing purposes you can use our `bastionlab_server` package, which removes the need for user authentication.

In [2]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

BastionLab server (version 0.3.7) already installed
Libtorch (version 1.13.1) already installed
TLS certificates already generated
Bastionlab server is now running on port 50056


>*For more details on how you can set up the server using our Docker image, check out our [Installation Tutorial](../getting-started/installation.md).*

### Connecting to the server
Next, we will connect to the server in order to be able to upload the dataset.

In [3]:
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

[2023-02-17T16:18:15Z INFO  bastionlab] Authentication is disabled.
[2023-02-17T16:18:15Z INFO  bastionlab] Telemetry is enabled.
[2023-02-17T16:18:15Z INFO  bastionlab] BastionLab server listening on 0.0.0.0:50056.
[2023-02-17T16:18:15Z INFO  bastionlab] Server ready to take requests


### Creating a custom privacy policy

We can now create a [custom access policy](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/) for the dataset which determines how much access collaborators will get to the dataset. 

In this example, we create a policy with the following configuration:

->  `Aggregation(min_agg_size=10):` Any data extracted from the dataset should be the result of an aggregation of at least ten rows.

->  `unsafe_handling=Reject()`: Any attempted query which breaches this policy will be rejected by the server.

->  `savable=True`: The data scientist can save changes made to the dataset in BastionLab (this will create a new dataset- it will not overwrite the original dataset).


In [4]:
from bastionlab.polars.policy import Policy, Aggregation, Reject

# defining the dataset's privacy policy
policy = Policy(Aggregation(min_agg_size=10), unsafe_handling=Reject(), savable=True)

### Uploading the dataset

Now that the policy has been created, we can upload the dataset to the BastionLab server instance.

Firstly, we need to convert our CSV file into a Polars DataFrame by using the Polars `read_csv` function, supplying the path to the CSV file as a string argument.

Next, we use BastionLab's `client.polars.send_df` to upload the dataframe with our custom policy.

Finally, we save the FetchableLazyFrame using the `save` method with no arguments. We can make a note of the FetchableLazyFrame's identifier to be shared with data scientists to help them to remotely access the FetchableLazyFrame!

>Note we need to save FetchableLazyFrames to avoid them being lost when the server is stopped and restarted or crashes.

In [None]:
import polars as pl

# converting the dataset into a Polars dataframe
df = pl.read_csv("updated_diabetes_data.csv")

# uploading the dataframe, the custom privacy policy
# and the column we want to forbid to BastionLab's server
rdf = client.polars.send_df(df, policy=policy)

# saving the RemoteLazyFrame
rdf.save()
# get and print out a copy of the RDF identifier string
ID = rdf.identifier
print(ID)

`send_df()` will return a FetchableLazyFrame instance, which we will work with directly from now on. 

>Note that we talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. 

A `RemoteLazyFrame` just means we have called some functions and not yet `collected` the results, which means the operations have not yet been run on the server-side. When we call `collect()` these operations are run server-side and the result of this is our `FetchableLazyFrame`!

Let's finish off by testing what happens if we breach our security policy by trying to display an entire column from our dataset with the `collect().fetch()` methods. 

>*You can learn more about how to use both of those methods in [our quick tour](https://bastionlab.readthedocs.io/en/latest/docs/quick-tour/quick-tour/#running-queries).*

In [None]:
rdf.select("age").collect().fetch()

[31mThe query has been rejected by the data owner.[37m


Instead of getting back the results of our query, we see an error message: `The query has been rejected by the data owner.`

We cannot view the output of the query because it does not aggregate at least 10 rows of data as specified in our privacy policy. It tries to print out individual rows instead!

Now that the dataset has been uploaded, it's time for our data scientists to get working... 

The data owner can now connection their connection to the server.

In [None]:
connection.close()

## Data scientist #1's POV
__________________________________________

### Connecting to the dataset

We'll now jump into the role of the data scientist responsible for cleaning the dataset for this data analysis project.

We first need to connect to the `bastion_lab` server and get a FetchableLazyFrame instance of the dataset. We'll use' the `get_df()` method and supply it with the id shared with us by the data owner to do this.

We store our FetchableLazyFrame in the `rdf` variable which we'll be working with from here on.

In [None]:
connection = Connection("localhost")
client = connection.client

# selecting the FetchableLazyFrame(s) we'll be working with
rdf = client.polars.get_df(ID)
rdf

FetchableLazyFrame(identifier=0c7f2bcc-5afc-4a0a-b10f-24d796195045)

Let's display the dataset's columns to confirm we are connected to the correct one.

In [None]:
print(rdf.columns)

['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']


Everything is as expected! We can now start our data exploration. 

## Data cleaning
__________________________________________


### Dropping columns
You may have noticed, this dataset contains a lot of columns! This is great as it it gives us a wide choice of correlations to explore. However, we will not have time to explore all of them in this analysis! We can therefore drop the columns that we won't be using- either because they are irrelavant, or because they didn't lead us to the most interesting correlations for this analysis!

We can do this by using the`drop` method, providing it with a list of the names of columns to be dropped. This is a RemoteLazyFrame method which corresponds directly to the [Polars drop() function](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.drop.html#polars.LazyFrame.drop).

In [None]:
# list of column names we wish to remove from our RemoteLazyFrame
to_drop = [
    "encounter_id",
    "patient_nbr",
    "weight",
    "discharge_disposition_id",
    "admission_source_id",
    "time_in_hospital",
    "payer_code",
    "medical_specialty",
    "num_lab_procedures",
    "num_procedures",
    "num_medications",
    "number_outpatient",
    "number_inpatient",
    "number_diagnoses",
    "diabetesMed",
]

# replace rdf with our updated RemoteLazyFrame with to_drop columns deleted
rdf = rdf.drop(to_drop)

There are now 36 columns to work with intead of 51- this will make the RemoteLazyFrame a little easier to work with!


### Checking for null values

We now want to assess how many null values we have in each column. This will help us to know if we have enough data to draw meaningful conclusions from each column and gives us the chance to fill or delete null values if relevant.

However, based on the description of the dataset shared with us by the data owner, we know that some column cells have been filled with '?' instead of being left blank.

Before we can get an accurate picture of null values, we first need to replace all these '?' values with null values. We will do this by using [Polars .when().then().otherwise()` functions](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html). 

One final hurdle is that we can only search and replace '?' strings in columns containing strings which will have the 'Utf8' datatype- otherwise an error will be produced. We must therefore only apply our search and replace operation to string columns!

In [None]:
# step one: getting a list of all Utf8/string columns
selects = []
for x in rdf.columns:
    if rdf.select(x).dtypes == [pl.datatypes.Utf8]:
        selects.append(x)

# step two: we replace all '? cells in these columns with null values
rdf = rdf.with_columns(
    [
        pl.when(pl.col(x) == "?").then(None).otherwise(pl.col(x)).keep_name()
        for x in selects
    ]
)

In step two, we use the Polars `with_columns` function to add our new columns with null values instead of question marks to our RemoteLazyFrame. By using the `keep_name` function, these columns keep their original column name and therefore replace the original columns in the dataset. We save the result as `rdf`, storing the updated version of the dataset in our `rdf` variable.

Now that this is done, we can go ahead and calculate how many null values each column contains.

We do this by iterating over all the columns and getting a percentage of the `sum` of all the value that return `True` to the `is_null` function.

In [None]:
# getting every columns percentage of null values in the RemoteLazyFrame
percent_missing = rdf.select(
    [
        pl.all().is_null().sum() * 100 / pl.all().count(),
    ]
)

We can then view the percentage of null values for each column as a two-column list by using Polars `melt` function to flip the query results from a 2 row by 5 column grid, to a 2 column by 5 row grid. We use the `sort` function to show the columns in order from the column with the highest percentage of null values to the lowest.

Finally, we remove any columns with no null values from our output since they are not of interest to us here.

In [None]:
# melt table to a two-column table with the column name 'column' and corresponding percetage of null values 'null values', sort in descending order and display
percent_missing = percent_missing.melt(
    variable_name="column name",
    value_name="null values (%)",
).sort(pl.col("null values (%)"), reverse=True)

# filter out columns with no null values and display
percent_missing.filter(pl.col("null values (%)") > 0).collect().fetch()

column name,null values (%)
str,f64
"""max_glu_serum""",94.746772
"""A1Cresult""",83.277322
"""readmitted""",53.911916
"""race""",2.233555
"""diag_3""",1.398306
"""diag_2""",0.351787
"""diag_1""",0.020636


There are several strategies for dealing with null values such as deleting these rows from the dataset with the `drop_nulls` method or filling null values with the `fill_null` method. But in our case, we are just happy to have visibility over which columns including null values and to what extent so that we can handle and analyse these columns with this in mind.

### Grouping data: ICD-9 medical codes
Grouping data is going to be the largest and most crucial task in this data cleaning job. This is a dataset with a low of wide-ranging numerical values which need to be grouped so that our data analysts can gain meaningul insights.

Let's start with our diagnoses columns: `diag_1`, `diag_2` and `diag_3`.

These columns contain the primary, secondary and terciary diagnoses given to patients. These diagnoses are given using [ICD-9 medical codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes) which are three digit codes ranging from 1 to 1000, as well as E800–E999 codes and V01–V82 codes.

By grabbing all the unique values in the `diag_1` column and counting them, we can see that we have over 700 different values in this column!

In [None]:
tmp = rdf.select("diag_1").unique()
tmp.select(pl.col("diag_1").count()).collect().fetch()

diag_1
u32
717


Standard groupings of these codes have already been designed. What we want to do is replace the hundreds of unique codes we have in our our diagnoses columns with these groupings!

To do this, we will again use Polars `when().then().otherwise()` functions to  perform a find and replace operation. We will use `when()` to check if the codes in each cell are either E or V codes or fall within a certain numerical range.

However, these diagnoses columns are currently string columns, since the E and V codes are not entirely numerical. This is problematic since we cannot perform numerical comparisons on these cells and we cannot convert the column type to a numerical one because of these 'E' and 'V' values!

We will solve this problem in three steps:

1) We will find and replace all E codes with a "-1" value and V codes with a "-2" value.

2) We will `select()` our columns and `cast()` all values in these columns to float values.

3) We will perform the find and replace operation to group all ICD-9 codes into their associated group- of which there are 17, plus E codes and V codes.

In [None]:
# iterate over the three diagnoses columns
for col in ["diag_1", "diag_2", "diag_3"]:
    # step one: replace troublesome E and V codes with temporary -1 and -2 codes
    rdf = rdf.with_columns(
        [
            pl.when(
                pl.col(col).str.starts_with("E")
            )  # use Polars str.starts_with method to identify E codes
            .then("-1")
            .when(pl.col(col).str.starts_with("V"))
            .then("-2")
            .otherwise(pl.col(col))
            .keep_name()
        ]
    )

    # step two: cast all values in column to float values
    rdf = rdf.with_columns([pl.col(col).cast(pl.Float64)])

    # step three: replace all codes with their corresponding group
    rdf = rdf.with_columns(
        [
            pl.when(pl.col(col) >= 800)
            .then("injury and poisoning")
            .when(pl.col(col) >= 780)
            .then("symptoms, signs & ill-defined")
            .when(pl.col(col) >= 760)
            .then("perinatal")
            .when(pl.col(col) >= 740)
            .then("congenital anomalies")
            .when(pl.col(col) >= 710)
            .then("musculoskeletal & connective tissue")
            .when(pl.col(col) >= 680)
            .then("skin")
            .when(pl.col(col) >= 630)
            .then("pregnancy, childbirth and peurperium")
            .when(pl.col(col) >= 580)
            .then("genitourinary")
            .when(pl.col(col) >= 520)
            .then("digestive")
            .when(pl.col(col) >= 460)
            .then("respiratory")
            .when(pl.col(col) >= 390)
            .then("circulatory")
            .when(pl.col(col) >= 320)
            .then("nervous system and sense organs")
            .when(pl.col(col) >= 290)
            .then("mental disorders")
            .when(pl.col(col) >= 280)
            .then("blood and blood-forming organs")
            .when(pl.col(col) >= 240)
            .then("neoplasms")
            .when(pl.col(col) >= 140)
            .then("endocrine, nutritional, metabolic and immunity")
            .when(pl.col(col) >= 1)
            .then("infectious and parasitic")
            .when(pl.col(col) == -1)
            .then("E code (injury")
            .when(pl.col(col) == -2)
            .then("V code (other)")
            .otherwise(
                pl.col(col)
            )  # otherwise (null values) keep original value from the column
            .alias(
                col
            )  # give resulting column same name as previously- therefore replacing old columns
        ]
    )

By performing the same query as previously to count `diag_1`'s unique values, we see there is now a much more manageable 19 labels in our data column! This will be similar for the `diag_2` and `diag_3` columns.

In [None]:
tmp = rdf.select("diag_1").unique()
tmp.select(pl.col("diag_1").count()).collect().fetch()

diag_1
u32
19


### Grouping data: A1C, max glucose levels and readmittance

We want to group together data in another three other columns using the same `.then().when().otherwise()` methods.

The first two are `A1Cresult`, which contains patients' HbA1c level, and `max_glu_serum`, which contains their blood glucose level. We want to group these into `very high`, `high`, `normal` groups based on levels defined in our project brief.

These columns are both currently string columns, so we will also need to convert them to float values in order to perform numerical comparisons on them.

In [None]:
# cast `max_glu_serum` and `A1Cresult` columns to float values
rdf = rdf.with_columns(
    [pl.col("max_glu_serum").cast(pl.Float64), pl.col("A1Cresult").cast(pl.Float64)]
)

# group values in A1Cresult column
rdf = rdf.with_columns(
    [
        pl.when(pl.col("A1Cresult") >= 8)
        .then("very high")
        .when(pl.col("A1Cresult") >= 7)
        .then("high")
        .when(pl.col("A1Cresult") >= 0)
        .then("normal")
        .otherwise(pl.col("A1Cresult"))
        .keep_name()
    ]
)

# group values in max_glu_serum column
rdf = rdf.with_columns(
    [
        pl.when(pl.col("max_glu_serum") >= 300)
        .then("very high")
        .when(pl.col("max_glu_serum") >= 200)
        .then("high")
        .when(pl.col("max_glu_serum") >= 0)
        .then("normal")
        .otherwise(pl.col("max_glu_serum"))
        .keep_name()
    ]
)

The final column we want to group is the `readmitted` column which records the number of days before any further re-hospitalization linked to the patients' diabetic condition.

We will group this column into `short-term` and `long-term` and `n/a` (not applicable) groups.

Simiar to in previous examples, we must first convert values in this column from strings to integer values.

In [None]:
# cast readmitted column to integer values
rdf = rdf.with_columns([pl.col("readmitted").cast(pl.Int64)])

# group values
rdf = rdf.with_columns(
    [
        pl.when(pl.col("readmitted") < 31)
        .then("short-term")
        .when(pl.col("readmitted") >= 31)
        .then("long-term")
        .otherwise("n/a")
        .keep_name()
    ]
)

### Grouping data: binning ages
The next grouping task we will perform is to group ages into intervals of 10 years. We do this both to increase data privacy and to more easily draw correlations linked to broader age groups.

We won't need to perform an `when().then().otherwise()` query here since BastionLab has its own `ApplyBins` tool.

`ApplyBins` is a PyTorch module and the grouping of numbers takes place in its `forward` function. We can pass PyTorch modules to BastionLab's `apply_udf` function which will apply the `forward` function to any specified columns.

All in all, we just three steps to bin our age column data:

1) We import `ApplyBins` from `bastionlab.polars.utils`.
1) We instantiate our `ApplyBins` PyTorch module class with our bins interval given as the only argument.
2) We use `apply_udf`, providing a list of the column we want to modify and the PyTorch module, `ApplyBins`, that we wish to apply to these columns.

In [None]:
from bastionlab.polars.utils import ApplyBins

# get an instance of ApplyBins module which will bin data into groups of 10
model = ApplyBins(10)

# apply bins to "age" column
rdf = rdf.apply_udf(["age"], model)

> Note, you can create your own custom PyTorch modules and apply them to columns using `apply_udf`. This is BastionLab's way of allowing you to apply custom functions on datasets, whilst restricting what you can do for security reasons. Functionality like `lambda`, `map` and `apply` are blocked by BastionLab as they are too permissive and could be misused.

### Adding columns

Up until this point we have been using the `.when().then().otherwise()` and `with_columns` methods to make changes to existing columns, but by providing a new column name to the `alias` method, we can create a new column.

In the following example, we will create a `is_readmitted` column which will store `False` for all the "n/a" values in our original `readmitted` column and `True` for any other values. This will allow us to quickly query whether certain groups of data have been readmitted or not!

In [None]:
rdf = rdf.with_columns(
    [
        pl.when(pl.col("readmitted") == "n/a")
        .then(False)
        .otherwise(True)
        .alias(
            "is_readmitted"
        )  # ending the .when().then().otherwise() pattern with .alias() allows us to provide a new column name
    ]
)

### Converting column types

We have already seen examples where we have `explicity` converted the datatype of our columns using the `cast` method. Here we will `implicity` convert the datatype by replacing the "yes" and "no" values in our `change` column, which represent whether a patient's medication has been changed, to a boolean True or False value. 

The datatype of this column will be changed automatically by this operation as we can see below.

In [None]:
# print out initial datatype of "change" column

rdf.select("change").dtypes

[polars.datatypes.Utf8]

In [None]:
# replaces Yes/No values with True/False
rdf = rdf.with_columns(
    [pl.when(pl.col("change") == "No").then(False).otherwise(True).keep_name()]
)

# print out datatype of column post find and replace operation
rdf.select("change").dtypes

[polars.datatypes.Boolean]

### Saving our RemoteLazyFrame and disconnecting

Our dataframe is all clean and ready for the next step: data analysis/ visualization. Data scientist #1 is going to be reassigned to another task. They will save their cleaned RemoteLazyFrame and make a note of the identifier to share with data scientist #2.

We need to perform `collect()` before saving or getting an identifier for our RemoteLazyFrame since the `save` method and `identifier` attribute are only available for FetchableLazyFrames.

>Note, the data owner must have set the `savable` option to `True` when uploading the dataframe for this operation to be possible!

In [None]:
rdf.collect().save()
saved_identifier = rdf.collect().identifier
saved_identifier

'49b66d7a-6c80-45fb-8278-9992c91f8666'

They can now close their connection to the BastionLab server.

In [None]:
connection.close()