# Building a deep learning model with a Covid-19 dataset

AI models are very good at preventing accidents or diagnosing medical conditions. They can analyse pixels systematically in ways our eyes can't and avoid many human errors. But for them to be trained at all, they need massive amounts of data that are often very sensitive since they hold lots of private informations (*name, age, sex, adress, previous medical conditions...*). 

Take the Covid-19 crisis for example: many collaborations couldn't happen because of privacy concerns despite the general urgency.

This is where BastionLab comes in. Our framework offers tools to share datasets and train AI models with security garantees. It lets data scientists handle datasets remotely and train ML models without ever having access to the full data in clear.

In this notebook, we'll use a **real-world Covid-19 dataset** to show you how you could **clean datasets**, **run queries** and **visualize data** with BastionLab and Torch. 

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv) we will be using in this tutorial.

We'll do so by running the code block below. 

>You can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# download the dataset
!wget 'https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv'

This dataset collects data on outcomes of Covid cases based on various pre-conditions such as asthma and diabetes. This is a version of a huge dataset provided by the Mexican government based on the population of Mexico, so the insights gained from it may not be valid for other geographical areas in the world.

### Launch and connect to the server

In [269]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

BastionLab server (version 0.3.5) already installed
Libtorch (version 1.12.1) already installed
TLS certificates already generated
Bastionlab server is now running on port 50056


>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [270]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

### Upload the dataframe to the server

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).

In [None]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("covid.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

<div class="warning">
<b>This policy is not suitable for production.</b> Please note that we <i>only</i> use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial. <br></div> <br>

We'll check that we're properly connected and that we have the authorizations by running a simple query:

In [272]:
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Utf8,
 'date_symptoms': polars.datatypes.Utf8,
 'date_died': polars.datatypes.Utf8,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

## Data cleaning
_______________________________________________________________

Let's start by preparing our dataset.

### Dropping columns

Firstly let's use the `drop` method to remove the columns we don't need for our training model: `entry_date`, `date_symptoms`, `date_died`, `patient_type`, `sex`, `id` and `date`.

In [273]:
# Dropping columns that don't help our model inference
rdf = rdf.drop(
    ["entry_date", "date_symptoms", "date_died", "patient_type", "sex", "id", "date"]
)

## Checking dtypes

We want to make sure that all categorical columns like `diabetes` have an integer dtype. These are columns that contain either `2` to represent true, the patient did have diabetes, `1` to represent false, the patient didn't have diabetes or `97`, `98` or `99` which are used to represent `unknown`. Any continuous value such as `age` should be represented by a float.

By printing out the schema of our RemoteLazyFrame, we see that `age` is an integer value.

In [274]:
rdf.schema

{'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

We will convert the age column t to a float value using Polars `cast` method.

In [275]:
rdf = rdf.with_column(pl.col("age").cast(pl.Float64, strict=False))

### Handling null/unknown values

In the case of this dataset, we don't have null values as such, but we do have the use of `97`, `98` and `99` to signify "unknown" in our categorical columns.

To decide the best strategy for handling these values, let's first get a sense of the scale of these unknown values!

Firstly, we will store the names of all these categorical columns in a list. Then we will get the sum of values in these columns which are 97,98 or 99 by using Polars `is_between` function. We will get a percentage of this value against the total values in the columns.

In [276]:
#Get list of all int64 columns
int_cols = []
for x in rdf.columns:
    if rdf.schema[x] == pl.datatypes.Int64:
        int_cols.append(x)

#get percentage of values in column between 96 and 100
percent_missing = rdf.select(
        pl.col(x).is_between(96,100).sum().alias(x) * 100 / pl.col(x).count() for x in int_cols
)
percent_missing.collect().fetch()

intubed,pneumonia,pregnancy,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,contact_other_covid,covid_res,icu
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
78.505371,0.001941,50.952697,0.349628,0.308682,0.309212,0.349452,0.321919,0.458523,0.321566,0.31433,0.316271,0.336568,30.891349,0.0,78.505547


Since the `intubed`, `pregnancy`, `icu` and `contact_other_covid` columns contain significant amounts of "unknown" values, we will `drop` them from our model.

In [277]:
rdf = rdf.drop(
    ["intubed", "pregnancy", "contact_other_covid", "icu"]
)

Next we will delete any rows which have a value which is not `1` or `2` for our categorical columns, essentially deleting all of these unknown 97, 98 and 99 values, whilst  also ensuring there are no other unexpected values.

In [288]:
int_cols = []
for x in rdf.columns:
    if rdf.schema[x] == pl.datatypes.Int64:
        int_cols.append(x)

rdf = rdf.filter(pl.col([x for x in int_cols]).is_between(0, 3))

### View dataset size and columns

So now we have finished cleaning our dataset, we can take a look again at some information about our dataset so we can confirm our dataset is still sufficiently large for training our model.

In [299]:
rdf.schema

{'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Float64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64}

In [302]:
#get percentage of values in column between 96 and 100
size = rdf.select(
        pl.col(x).count().alias(x) for x in rdf.columns
)
size.collect().fetch()

pneumonia,age,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,covid_res
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
496291,496291,496291,496291,496291,496291,496291,496291,496291,496291,496291,496291,496291


## Data conversion: from DataFrame to Dataset

Now that our data is clean, let's convert the Covid dataset to a trainable dataset on BastionLab. 

Our first step will be to transform all the columns so they can be individually converted into tensors. We'll make them into `RemoteSeries`, a pointer to a column in the `RemoteDataFrame`. 

***# You talk about this, but where to we do it?***

***# And then what's the next code block?***

In [None]:
from bastionlab.polars import train_test_split

# The RemoteDataFrame is shuffled and split into training and testing dataset.
train_rdf, test_rdf = train_test_split(
    rdf,
    test_size=0.2,
    shuffle=True,
)

***# What are we doing here?***

In [None]:
import torch
from bastionlab.torch.remote_torch import RemoteDataset

# Get all columns of the RemoteDataFrame
cols = rdf.column_names

# Transform RemoteDataFrame to RemoteTensor
train_inputs = train_rdf.columns(cols).to_tensor()
test_inputs = test_rdf.columns(cols).to_tensor()

# Cast RemoteTensor from `int64` to `float32`
train_inputs = train_inputs.to(torch.float32)
test_inputs = train_inputs.to(torch.float32)

label = "covid_res"

# The label tensor is selected from the remote_tensors storage
train_label = train_rdf.column(label).to_tensor()
test_label = test_rdf.column(label).to_tensor()

train_dataset = RemoteDataset(inputs=[train_inputs], label=train_label)
test_dataset = RemoteDataset(inputs=[test_inputs], label=test_label)

***# Here we're demonstrating?***

In [None]:
import torch
from torch.nn import Module, Linear

# A simple LinearRegression model is created to train on the Covid Dataset
# (which has 17 features features and 2 output features
# [1-`has_covid`, 0-`does_not_have_covid`])
class LinearRegression(Module):
    def __init__(self) -> None:
        super().__init__()
        self.layer1 = Linear(17, 2)

    def forward(self, tensor):
        return self.layer1(tensor)


# An instance of our Covid LinearRegression model is created
model = LinearRegression()

***# Here we need some text to say that we're explaining how to upload it to BastionLab Torch service and how we're doing it (method, arguments etc)***

In [None]:
from bastionlab.torch.optimizer_config import Adam

# The model is uploaded to the BastionLab Torch service.
remote_learner = connection.client.torch.RemoteLearner(
    model,
    train_dataset,
    max_batch_size=2,
    loss="cross_entropy",
    optimizer=Adam(lr=5e-5),
    model_name="LinearRegression",
)

Sending LinearRegression: 100%|████████████████████| 3.28k/3.28k [00:00<00:00, 1.49MB/s]


***# 'here' being where?***

In [None]:
# The linear regression model is trained on the dataset here.
remote_learner.fit(nb_epochs=1)

Epoch 1/1 - train: 100%|████████████████████| 25/25 [00:00<00:00, 121.06batch/s, cross_entropy=10.0000 (+/- 563.7264)]  


***# Ok but needs explanations***

In [None]:
# The linear regression model is validated with the `test_dataset`
remote_learner.test(test_dataset)

Epoch 1/1 - test: 100%|████████████████████| 25/25 [00:00<00:00, 121.06batch/s, cross_entropy=0.0000 (+/- 563.7264)]  


***# same here, let's introduce what we're doing ^^***

In [None]:
# The trained model is fetched from the BastionLab Torch service.
remote_learner.get_model()

LinearRegression(
  (layer1): Linear(in_features=17, out_features=2, bias=True)
)

Our deep learning model has been trained with privacy garantees! All that's left to do is close our connection:

In [None]:
connection.close()