# Building a machine learning model with a Covid-19 dataset

***#Rework that part (note to self)***

AI models can play a vital role in diagnosing and better understanding medical issues and have the potential to save lives.  However, data owners with sensitive medical data, such as hospitals with datasets of patients' data, have often been left unable to share this data with other parties because of data privacy concerns. BastionLab offers the tools needed to share datasets and train and share AI models whilst offering security guaranties. Data scientists are able to handle datasets remotely and train ML models without ever having access to the full data in clear.

In this notebook, we'll use a real-world Covid-19 dataset to show you how you could clean datasets, run queries and visualize data - without needing access to the data in clear.

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv) we will be using in this tutorial.

We'll do so by running the code block below. 

>You can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
# pip packages
#!pip install bastionlab
#!pip install bastionlab_server

# download the dataset
#!wget 'https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv'

This dataset collects data on outcomes of Covid cases based on various pre-conditions such as asthma and diabetes. This is a version of a huge dataset provided by the Mexican government based on the population of Mexico, so the insights gained from it may not be valid for other geographical areas in the world.

### Launch and connect to the server

In [None]:
# launch bastionlab_server test package
# import bastionlab_server

# srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [3]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

### Upload the dataframe to the server

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).

In [4]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("covid.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

FetchableLazyFrame(identifier=c69a4f5c-13d7-4e2e-abb0-6a2643e1b7f8)

<div class="warning">
<b>This policy is not suitable for production.</b> Please note that we <i>only</i> use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial. <br></div> <br>

We'll check that we're properly connected and that we have the authorizations by running a simple query:

In [5]:
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Utf8,
 'date_symptoms': polars.datatypes.Utf8,
 'date_died': polars.datatypes.Utf8,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

## Data cleaning
_______________________________________________________________

### Column dtypes

There are a few column dtypes that we might want to change here. 

Firstly, all the data columns ('entry_date', 'date_symptoms' and 'date_died') are Utf8 values. We can change the dtype of these to Polars Date type which allows us to access the day, month and year values individually and sort dates more easily if needed.

To do this, I use the with_columns method to replace our current date columns with a new version where the dates are converted to a pl.Date. I set rdf to equal the output of this function and then check the schema attribute has updated as expected.

In [6]:
rdf = rdf.with_columns(
    [
        pl.col("entry_date").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
        pl.col("date_symptoms").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
        pl.col("date_died").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
    ]
)
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

### Replacing values

The next thing I'd like to look at is the 'sex' column. This column uses the integer 1 for females and 2 for males. We can change this to 'female' for females, 'male' for males and 'unknown' for any other values.

We do this by using the with_column method with a when.then.otherwise clause which acts as a find and replace tool. We set the .alias to "sex" which will then replace our current 'sex' column with our new updated one.

When we print out the schema, we can see this has changed the dtype of the column to Utf8.

In [7]:
rdf = rdf.with_column(
    pl.when(pl.col("sex") == 2)
    .then("male")
    .when(pl.col("sex") == 1)
    .then("female")
    .otherwise("unknown")
    .alias("sex")
)
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

Similarly, the 'patient_type' column use the integer 1 for outpatients and 2 for inpatients. We can change this to 'outpatient' and 'inpatient' using the same method as previous.

In [8]:
rdf = rdf.with_column(
    pl.when(pl.col("patient_type") == 2)
    .then("inpatiente")
    .when(pl.col("patient_type") == 1)
    .then("outpatient")
    .otherwise("unknown")
    .alias("patient_type")
)
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Utf8,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

Finally for all our columns refering to pre-conditions, 2 is used to mean False and 1 is used to mean True. We can replace 2 with 0 which is the more standard numerical equivalent to False in computer science. We will do this using the same logic as in the previous query but I use a for loop to iterate over all the columns I want to apply this query to.

In [9]:
for x in [
    "intubed",
    "pneumonia",
    "pregnancy",
    "diabetes",
    "copd",
    "asthma",
    "inmsupr",
    "hypertension",
    "other_disease",
    "cardiovascular",
    "obesity",
    "renal_chronic",
    "covid_res",
    "icu",
]:
    rdf = rdf.with_columns(pl.when(pl.col(x) == 2).then(0).otherwise(1).alias(x))

rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Utf8,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

### Null values

Firstly, we want to see where we might have problems with missing values in our dataset, so we will calculate what percentage of values are missing from each column by taking the sum of is_null() values in each column, multiplying them by 100 and dividing them by the total rows of that same column.

In [10]:
percent_missing = rdf.select(
    pl.col(x).is_null().sum() * 100 / pl.col(x).count() for x in rdf.column_names
)
percent_missing.collect().fetch()

id,sex,patient_type,entry_date,date_symptoms,date_died,intubed,pneumonia,age,pregnancy,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,contact_other_covid,covid_res,icu
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.0,0.0,0.0,0.0,0.0,93.615271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We discover that there are acutally no Null values apart from in the 'date_died' column. If we take a look at our .csv file, we see that this is because Null values in our integer columns have been given the value '97', '98' or '99'. We can replace these with 'unknown' for clarity.

In [11]:
for x in [
    "intubed",
    "pneumonia",
    "pregnancy",
    "diabetes",
    "copd",
    "asthma",
    "inmsupr",
    "hypertension",
    "other_disease",
    "cardiovascular",
    "obesity",
    "renal_chronic",
    "covid_res",
    "icu",
]:
    rdf = rdf.with_columns(
        pl.when(pl.col(x) == 97)
        .then(-1)
        .when(pl.col(x) == 98)
        .then(-1)
        .when(pl.col(x) == 99)
        .then(-1)
        .otherwise(pl.col(x))
        .alias(x)
    )

The Null values in the date_died column correspond to dates that were originally stored as 9999-99-99 to indicate where patients did not die and which could not be converted to the pl.Date format since they do not pass parsing rules. One option would be to add a new false date to indicate these cases, but we will leave these as Null for now to clearly indicate that these patients do not have a date of death.

### Dropping columns

And the very last thing we will do is drop the 'id' column since we have no use for our training model.

In [12]:
rdf = rdf.drop("id")

Also, unconcerned columns (`date_died`, `entry_date`, `date_symptoms`) are removed from the RemoteDataFrame using `drop`.

In [13]:
rdf = rdf.drop(
    ["entry_date", "date_symptoms", "date_died", "patient_type", "sex", "id"]
).collect()

## Data conversion: from DataFrame to RemoteArray

In this section of the this tutorial, you will convert the Covid dataset to a trainable dataset on BastionLab.

First, we convert all the columns into `RemoteSeries`. RemoteSeries is a pointer to a column in the RemoteDataFrame. This step is useful because the columns can then be individually converted into **RemoteArray**.

- RemoteArray is akin to ndarrays in pandas, but they are a lazy representation of the ndarray on the server. Once `Model.fit` is called, we then transform the `RemoteDataFrame` into the actual `ndarray` representation.

Deep learning models only accept tensors.

In [14]:
from bastionlab.polars import train_test_split

# The RemoteDataFrame is shuffled and split into training and testing dataset.
train_rdf, test_rdf = train_test_split(
    rdf,
    test_size=0.2,
    shuffle=True,
)

In [15]:
import torch

# Results column (covdi_res, 1-covid present, 0-covid absent)
label_name = "covid_res"

# Get all columns of the RemoteDataFrame
cols = list(filter(lambda a: a != label_name, rdf.column_names))

# Transform RemoteDataFrame -> List[RemoteSeries] -> RemoteArray
train_X = train_rdf.columns(cols).to_array()
test_X = test_rdf.columns(cols).to_array()

# Transform RemoteDataFrame -> RemoteSeries -> RemoteArray
train_y = train_rdf.column(label_name).to_array()
test_y = test_rdf.column(label_name).to_array()

So that brings our demo to an end. All that's left to do is close our connection!

In [16]:
from bastionlab.linfa.trainers import LinearRegression

lr = LinearRegression()
lr.fit(train_X, train_y)

FittedModel(identifier=892689f7-c7a9-45dc-9c94-e079d9752ca9)
  └── LinearRegression(identifier='892689f7-c7a9-45dc-9c94-e079d9752ca9', fit_intercept=True)

In [17]:
from bastionlab.linfa import cross_validate

rdf = cross_validate(lr, train_X, train_y, 5)

rdf.collect().fetch()

cross_validation
f64
0.074803


In [18]:
connection.close()