# Building a machine learning model with a Covid-19 dataset

***#Rework that part (note to self)***

AI models can play a vital role in diagnosing and better understanding medical issues and have the potential to save lives.  However, data owners with sensitive medical data, such as hospitals with datasets of patients' data, have often been left unable to share this data with other parties because of data privacy concerns. BastionLab offers the tools needed to share datasets and train and share AI models while offering security guaranties. Data scientists are able to handle datasets remotely and train ML models without ever having access to the full data in clear.

In this notebook, we'll use a real-world Covid-19 dataset to show you how you could clean datasets, run queries and visualize data - without needing access to the data in clear.

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv) we will be using in this tutorial.

We'll do so by running the code block below. 

>You can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# download the dataset
!wget 'https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv'

: 

This dataset collects data on outcomes of Covid cases based on various pre-conditions such as asthma and diabetes. This is a version of a huge dataset provided by the Mexican government based on the population of Mexico, so the insights gained from it may not be valid for other geographical areas in the world.

### Launch and connect to the server

In [None]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [3]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

### Upload the dataframe to the server

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).

In [6]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("covid.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

<div class="warning">
<b>This policy is not suitable for production.</b> Please note that we <i>only</i> use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial. <br></div> <br>

We'll check that we're properly connected and that we have the authorizations by running a simple query:

In [7]:
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Utf8,
 'date_symptoms': polars.datatypes.Utf8,
 'date_died': polars.datatypes.Utf8,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

## Data cleaning
_______________________________________________________________

Let's start by preparing our dataset.

### Column dtypes

There are a few column dtypes that we'll want to change here. 

First, all the data columns (`'entry_date'`, `'date_symptoms'` and `'date_died'`) are `Utf8` (unicode) values. We can change the dtype of these to Polars `Date` type which allows us to access the day, month and year values individually and sort dates more easily.

To do this, we'll use the `with_columns` method to replace our current date columns with a new version where the dates are converted to a `pl.Date`. We'll set `rdf` to equal the output of this function and then check that the schema attribute has updated as expected.

In [8]:
# Converting our Utf8 type by using Pythong strptime() to create a datetime object
rdf = rdf.with_columns(
    [
        pl.col("entry_date").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
        pl.col("date_symptoms").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
        pl.col("date_died").str.strptime(pl.Date, fmt="%d-%m-%Y", strict=False),
    ]
)

# Testing that we get the expected result
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

### Replacing values

The next thing we'd like clarify is the `'sex'` column. 

It uses the integer `1` for females and `2` for males. We want to change this to `'female'` for females, `'male'` for males and `'unknown'` for any other values.

We can do this by using the `with_column` method with a `when.then.otherwise` clause which acts as a 'find and replace' tool. Then we set the `.alias` to `"sex"`, which will replace our current `'sex'` column with the updated one.

When we print out the `schema`, we can see this has changed the dtype of the column to `Utf8`.

In [9]:
# Replacing '2' and '1' values by 'male' and 'female' and write 
# 'unknown' everywhere else. .alias replaces the current column with
# the modified one. 
rdf = rdf.with_column(
    pl.when(pl.col("sex") == 2)
    .then("male")
    .when(pl.col("sex") == 1)
    .then("female")
    .otherwise("unknown")
    .alias("sex")
)

# Testing that we get the expected result
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

We have a similar issue with the `'patient_type'` column. It use the integer `1` for outpatients and `2` for inpatients. We can change this to `'out_patient'` and `'in_patient'` using the same method as previously.

In [10]:
# Replacing patient type column integers by strings for clarity
rdf = rdf.with_column(
    pl.when(pl.col("patient_type") == 2)
    .then("in_patient")
    .when(pl.col("patient_type") == 1)
    .then("out_patient")
    .otherwise("unknown")
    .alias("patient_type")
)

# Testing that we get the expected result
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Utf8,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

In all the columns refering to pre-conditions, `2` is used for `False` and `1` is used for `True`. We can replace `2` with `0` which is the more standard numerical equivalent to `False` in computer science.

We'll do this by using the same logic as previously - but we'll use a `for` loop to iterate over all the columns we want to apply this query to.

In [11]:
# Using a for loop to iterate over all the columns we want to change
for x in [
    "intubed",
    "pneumonia",
    "pregnancy",
    "diabetes",
    "copd",
    "asthma",
    "inmsupr",
    "hypertension",
    "other_disease",
    "cardiovascular",
    "obesity",
    "renal_chronic",
    "covid_res",
    "icu",
]:
    rdf = rdf.with_columns(pl.when(pl.col(x) == 2).then(0).otherwise(1).alias(x))

# Testing the result is as expected
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Utf8,
 'patient_type': polars.datatypes.Utf8,
 'entry_date': polars.datatypes.Date,
 'date_symptoms': polars.datatypes.Date,
 'date_died': polars.datatypes.Date,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

### Null values

Now, let's see where we might have problems with missing values in our dataset. 

We'll calculate what percentage of values are missing from each column by taking the sum of `is_null()` values in each column, multiplying them by 100 and dividing them by the total rows of that same column.

In [12]:
# Calculating the percentage of missing values from each column
percent_missing = rdf.select(
    pl.col(x).is_null().sum() * 100 / pl.col(x).count() for x in rdf.column_names
)

# Fetching the resulting RemoteDataFrame
percent_missing.collect().fetch()

id,sex,patient_type,entry_date,date_symptoms,date_died,intubed,pneumonia,age,pregnancy,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,contact_other_covid,covid_res,icu
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.0,0.0,0.0,0.0,0.0,89.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


There are acutally no `null` values, except in the `'date_died'` column! 

This is weird. Taking a look at our `.csv` file, we see that this is because `null` values in the integer columns have been given the value `97`, `98` and `99`. We can replace these with `-1` for more clarity.

In [13]:
# Using a for loop to select all the column we want to change
for x in [
    "intubed",
    "pneumonia",
    "pregnancy",
    "diabetes",
    "copd",
    "asthma",
    "inmsupr",
    "hypertension",
    "other_disease",
    "cardiovascular",
    "obesity",
    "renal_chronic",
    "covid_res",
    "icu",
]:
    rdf = rdf.with_columns(
        pl.when(pl.col(x) == 97)
        .then(-1)
        .when(pl.col(x) == 98)
        .then(-1)
        .when(pl.col(x) == 99)
        .then(-1)
        .otherwise(pl.col(x))
        .alias(x)
    )

The `null` values in the `date_died` column correspond to dates that were originally stored as `9999-99-99` to indicate when patients did not die. They couldn't be converted to the `pl.Date` format since they do not pass parsing rules. 

One option would be to add a new fake date to point out these cases, but we will leave them as `null` for now, to clearly indicate that these patients didn't die.

### Dropping columns

This is our last cleaning step: we will drop the `'id'` column since we have no use for it in our training model.

In [14]:
# Deleting the 'id' column
rdf = rdf.drop("id")

Also, unconcerned columns (`date_died`, `entry_date`, `date_symptoms`) are removed from the RemoteDataFrame using `drop`.

***# ????***

In [15]:
rdf = rdf.drop(
    ["entry_date", "date_symptoms", "date_died", "patient_type", "sex", "id"]
).collect()

## Data conversion: from DataFrame to Dataset

Now that our data is clean, let's convert the Covid dataset to a trainable dataset on BastionLab. 

Our first step will be to transform all the columns so they can be individually converted into tensors. We'll make them into `RemoteSeries`, a pointer to a column in the `RemoteDataFrame`. 

***# You talk about this, but where to we do it?***

In [16]:
from bastionlab.polars import train_test_split

# The RemoteDataFrame is shuffled and split into training and testing dataset.
train_rdf, test_rdf = train_test_split(
    rdf,
    test_size=0.2,
    shuffle=True,
)

In [17]:
import torch
from bastionlab.torch.remote_torch import RemoteDataset

# Get all columns of the RemoteDataFrame
cols = rdf.column_names

# Transform RemoteDataFrame to RemoteTensor
train_inputs = train_rdf.columns(cols).to_tensor()
test_inputs = test_rdf.columns(cols).to_tensor()

# Cast RemoteTensor from `int64` to `float32`
train_inputs = train_inputs.to(torch.float32)
test_inputs = train_inputs.to(torch.float32)

label = "covid_res"

# The label tensor is selected from the remote_tensors storage
train_label = train_rdf.column(label).to_tensor()
test_label = test_rdf.column(label).to_tensor()

train_dataset = RemoteDataset(inputs=[train_inputs], label=train_label)
test_dataset = RemoteDataset(inputs=[test_inputs], label=test_label)

In [18]:
import torch
from torch.nn import Module, Linear

# A simple LinearRegression model is created to train on the Covid Dataset
# (which has 17 features features and 2 output features
# [1-`has_covid`, 0-`does_not_have_covid`])
class LinearRegression(Module):
    def __init__(self) -> None:
        super().__init__()
        self.layer1 = Linear(17, 2)

    def forward(self, tensor):
        return self.layer1(tensor)


# An instance of our Covid LinearRegression model is created
model = LinearRegression()

In [19]:
from bastionlab.torch.optimizer_config import Adam

# The model is uploaded to the BastionLab Torch service.
remote_learner = connection.client.torch.RemoteLearner(
    model,
    train_dataset,
    max_batch_size=2,
    loss="cross_entropy",
    optimizer=Adam(lr=5e-5),
    model_name="LinearRegression",
)

Sending LinearRegression: 100%|████████████████████| 3.28k/3.28k [00:00<00:00, 1.49MB/s]


In [20]:
# The linear regression model is trained on the dataset here.
remote_learner.fit(nb_epochs=1)

Epoch 1/1 - train: 100%|████████████████████| 25/25 [00:00<00:00, 121.06batch/s, cross_entropy=10.0000 (+/- 563.7264)]  


In [21]:
# The linear regression model is validated with the `test_dataset`
remote_learner.test(test_dataset)

Epoch 1/1 - test: 100%|████████████████████| 25/25 [00:00<00:00, 121.06batch/s, cross_entropy=0.0000 (+/- 563.7264)]  


In [22]:
# The trained is fetched from the BastionLab Torch service.
remote_learner.get_model()

LinearRegression(
  (layer1): Linear(in_features=17, out_features=2, bias=True)
)

Our deep learning model has been trained with privacy garantees! All that's left to do is close our connection:

In [23]:
connection.close()