# Building a machine learning model with a Covid-19 dataset

AI models can play a vital role in diagnosing and better understanding medical issues and have the potential to save lives.  However, data owners with sensitive medical data, such as hospitals with datasets of patients' data, have often been left unable to share this data with other parties because of data privacy concerns. BastionLab offers the tools needed to share datasets and train and share AI models whilst offering security guaranties. Data scientists are able to handle datasets remotely and train ML models without ever having access to the full data in clear.

In this tutorial, we are going to walk you through some examples of the ways in which you can clean datasets, run queries and visualize data, without needing access to the data in clear. We will be taking a real-world Covid-19 dataset as our example and handling it remotely via our RemoteLazyFrame instance.

## Pre-requisites

### Technical Requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager

## Pip packages and dataset

In order to run this notebook, you will also need to install Polars, Bastionlab which you can do by running the code block below.

In [167]:
#! pip install bastionlab

Finally, you will need to download the dataset by running the code block below. This dataset collects data on outcomes of Covid cases based on various pre-conditions such as asthma and diabetes. This is a version of a huge dataset provided by the Mexican government based on the population of Mexico and therefore insights gained may not be valid for other geographical areas in the world.

In [168]:
#!wget https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv

## Getting set-up


Next, we'll install and launch the server. For testing purposes, BastionLab server has been packaged as a pip wheel. In this tutorial, we will use this package to quickly set up a test server.

In [169]:
#!pip install bastionlab_server

The server exposes port 50056 for gRPC communication with clients and uses a default configuration (no authentication, default settings). For the purpose of this example, these settings are sufficient and we won't change them.

To run the server, we use the utility function provided by the bastionlab_server package.

In [170]:
import bastionlab_server

srv = bastionlab_server.start()

BastionLab server (version 0.3.5) already installed
Libtorch (version 1.12.1) already installed
TLS certificates already generated
Bastionlab server is now running on port 50056


Now that's all done, we can finally connect to our server, send over our CSV file and start analysing our data!

Firstly, we read in the dataset using Polar's read_csv() function, which returns a Polar's DataFrame instance containing the dataset.
Secondly, we connect to the server using Bastionlab's Connection() method.

In [171]:
from bastionlab import Connection
import polars as pl

df = pl.read_csv("covid.csv")

connection = Connection("localhost", 50056)

Finally, we send the Polar's DataFrame instance to the server using Bastionlab's send_df() method which will return a RemoteLazyFrame instance, a reference to the DataFrame uploaded which we will be working with throughout this tutorial. For the sake of this tutorial, we specify an unsafe policy which disables the need for the data owner to approve any requests that don't pass our data safety rules. This is purely done so that we can focus on demonstrating data cleaning and analysis in BastionLab without having to worry about approving any data access requests.

We then create a custom policy which disables all data request/query checks. We do this creating a policy where the `safe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), we set it to `Log()` in the following example. Note that this is purely done so that we can focus on demonstrating our visualization functions in this tutorial, but this policy is not suited for production.

In [172]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())
rdf = connection.client.polars.send_df(df, policy=policy)

[2022-12-15T11:14:35Z INFO  bastionlab] Authentication is disabled.
[2022-12-15T11:14:35Z INFO  bastionlab] Telemetry is enabled.
[2022-12-15T11:14:35Z INFO  bastionlab] BastionLab server listening on 0.0.0.0:50056.
[2022-12-15T11:14:35Z INFO  bastionlab] Server ready to take requests
Error: transport error

Caused by:
    0: error creating server listener: Address already in use (os error 98)
    1: Address already in use (os error 98)


{"safe_zone":"True","unsafe_handling":"Log"}


[2022-12-15T11:14:40Z INFO  bastionlab_polars] Succesfully sent dataframe 60f8e17d-ad60-4e4c-b069-8c2b59530ac7 to server


We can now see a schema with all the RemoteLazyFrame's column names and types by displaying rdf's schema attribute.

In [173]:
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Utf8,
 'date_symptoms': polars.datatypes.Utf8,
 'date_died': polars.datatypes.Utf8,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

# Data cleaning: Column dtypes

There are a few column dtypes that we might want to change here. 

Firstly, all the data columns ('entry_date', 'date_symptoms' and 'date_died') are Utf8 values. We can change the dtype of these to Polars Date type which allows us to access the day, month and year values individually and sort dates more easily if needed.

To do this, I use the with_columns method to replace our current date columns with a new version where the dates are converted to a pl.Date. I set rdf to equal the output of this function and then check the schema attribute has updated as expected.

In [174]:
rdf = rdf.with_columns(
    [
        pl.col("entry_date").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
        pl.col("date_symptoms").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
        pl.col("date_died").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
    ]
)
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

## Replacing values

The next thing I'd like to look at is the 'sex' column. This column uses the integer 1 for females and 2 for males. We can change this to 'female' for females, 'male' for males and 'unknown' for any other values.

We do this by using the with_column method with a when.then.otherwise clause which acts as a find and replace tool. We set the .alias to "sex" which will then replace our current 'sex' column with our new updated one.

When we print out the schema, we can see this has changed the dtype of the column to Utf8.

In [175]:
rdf = rdf.with_column(
    pl.when(pl.col("sex") == 2)
    .then("male")
    .when(pl.col("sex") == 1)
    .then("female")
    .otherwise("unknown")
    .alias("sex")
)
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

Similarly, the 'patient_type' column use the integer 1 for outpatients and 2 for inpatients. We can change this to 'outpatient' and 'inpatient' using the same method as previous.

In [176]:
rdf = rdf.with_column(
    pl.when(pl.col("patient_type") == 2)
    .then("inpatiente")
    .when(pl.col("patient_type") == 1)
    .then("outpatient")
    .otherwise("unknown")
    .alias("patient_type")
)
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Utf8,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

Finally for all our columns refering to pre-conditions, 2 is used to mean False and 1 is used to mean True. We can replace 2 with 0 which is the more standard numerical equivalent to False in computer science. We will do this using the same logic as in the previous query but I use a for loop to iterate over all the columns I want to apply this query to.

In [177]:
for x in [
    "intubed",
    "pneumonia",
    "pregnancy",
    "diabetes",
    "copd",
    "asthma",
    "inmsupr",
    "hypertension",
    "other_disease",
    "cardiovascular",
    "obesity",
    "renal_chronic",
    "covid_res",
    "icu",
]:
    rdf = rdf.with_columns(
        pl.when(pl.col(x) == 2).then(0).otherwise(pl.col(x)).alias(x)
    )

# Data cleaning: Null values

Firstly, we want to see where we might have problems with missing values in our dataset, so we will calculate what percentage of values are missing from each column by taking the sum of is_null() values in each column, multiplying them by 100 and dividing them by the total rows of that same column.

In [178]:
percent_missing = rdf.select(
    pl.col(x).is_null().sum() * 100 / pl.col(x).count() for x in rdf.columns
)
percent_missing.collect().fetch()

[2022-12-15T11:14:41Z INFO  bastionlab_polars] Succesfully ran query on 45dbced6-c0dd-4eaf-8b4a-5a37b7006c8f


id,sex,patient_type,entry_date,date_symptoms,date_died,intubed,pneumonia,age,pregnancy,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,contact_other_covid,covid_res,icu
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.0,0.0,0.0,0.0,0.0,93.615271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We discover that there are acutally no Null values apart from in the 'date_died' column. If we take a look at our .csv file, we see that this is because Null values in our integer columns have been given the value '97', '98' or '99'. We can replace these with 'unknown' for clarity.

In [180]:
for x in [
    "intubed",
    "pneumonia",
    "pregnancy",
    "diabetes",
    "copd",
    "asthma",
    "inmsupr",
    "hypertension",
    "other_disease",
    "cardiovascular",
    "obesity",
    "renal_chronic",
    "covid_res",
    "icu",
]:
    rdf = rdf.with_columns(
        pl.when(pl.col(x) == 97)
        .then("n/a")
        .when(pl.col(x) == 98)
        .then("n/a")
        .when(pl.col(x) == 99)
        .then("n/a")
        .otherwise(pl.col(x))
        .alias(x)
    )

The Null values in the date_died column correspond to dates that were originally stored as 9999-99-99 to indicate where patients did not die and which could not be converted to the pl.Date format since they do not pass parsing rules. One option would be to add a new false date to indicate these cases, but we will leave these as Null for now to clearly indicate that these patients do not have a date of death.

## Dropping columns

And the very last thing we will do is drop the 'id' column since we have no use for our training model.

In [182]:
rdf = rdf.drop("id")

So that brings our demo to an end. All that's left to do is close our connection!

In [183]:
connection.close()