# Building a machine learning model with a Covid-19 dataset

AI models are very good at preventing accidents or diagnosing medical conditions. They can analyse pixels systematically in ways our eyes can't and avoid many human errors. But for them to be trained at all, they need massive amounts of data that are often very sensitive since they hold lots of private informations (*name, age, sex, adress, previous medical conditions...*). 

Take the Covid-19 crisis for example: many collaborations couldn't happen because of privacy concerns despite the general urgency.

This is where BastionLab comes in. Our framework offers tools to share datasets and train AI models with security garantees. It lets data scientists handle datasets remotely and train ML models without ever having access to the full data in clear.

In this notebook, we'll use a **real-world Covid-19 dataset** to show you how you could **clean datasets**, **run queries** and **visualize data** with BastionLab and Torch. 

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv) we will be using in this tutorial.

We'll do so by running the code block below. 

>You can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [1]:
# # pip packages
# !pip install bastionlab
# !pip install bastionlab_server

# # download the dataset
# !wget 'https://raw.githubusercontent.com/rinbaruah/COVID_preconditions_Kaggle/master/Data/covid.csv'

This dataset collects data on outcomes of Covid cases based on various pre-conditions such as asthma and diabetes. This is a version of a huge dataset provided by the Mexican government based on the population of Mexico, so the insights gained from it may not be valid for other geographical areas in the world.

### Launch and connect to the server

In [2]:
# # launch bastionlab_server test package
# import bastionlab_server

# srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [3]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

  from .autonotebook import tqdm as notebook_tqdm


### Upload the dataframe to the server

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).

In [4]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("covid.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

FetchableLazyFrame(identifier=01a46f98-8e57-4030-a7d8-f03e8da486f3)

<div class="warning">
<b>This policy is not suitable for production.</b> Please note that we <i>only</i> use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial. <br></div> <br>

We'll check that we're properly connected and that we have the authorizations by running a simple query:

In [5]:
rdf.schema

{'id': polars.datatypes.Utf8,
 'sex': polars.datatypes.Int64,
 'patient_type': polars.datatypes.Int64,
 'entry_date': polars.datatypes.Utf8,
 'date_symptoms': polars.datatypes.Utf8,
 'date_died': polars.datatypes.Utf8,
 'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

## Data cleaning
_______________________________________________________________

Let's start by preparing our dataset.

### Dropping columns

Firstly let's use the `drop` method to remove the columns we don't need for our training model: `entry_date`, `date_symptoms`, `date_died`, `patient_type`, `sex`, `id` and `date`.

In [6]:
# Dropping columns that don't influence our model
rdf = rdf.drop(
    ["entry_date", "date_symptoms", "date_died", "patient_type", "sex", "id", "date"]
)

## Checking dtypes

We want to make sure that all categorical columns like `diabetes` have an integer dtype. These are columns that contain either `2` to represent true, the patient did have diabetes, `1` to represent false, the patient didn't have diabetes or `97`, `98` or `99` which are used to represent `unknown`. Any continuous value such as `age` should be represented by a float.

By printing out the schema of our RemoteLazyFrame, we see that `age` is an integer value.

In [7]:
rdf.schema

{'intubed': polars.datatypes.Int64,
 'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'pregnancy': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'contact_other_covid': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64,
 'icu': polars.datatypes.Int64}

### Handling null/unknown values

In the case of this dataset, we don't have null values as such, but we do have the use of `97`, `98` and `99` to signify "unknown" in our categorical columns.

To decide the best strategy for handling these values, let's first get a sense of the scale of these unknown values!

Firstly, we will store the names of all these categorical columns in a list. Then we will get the sum of values in these columns which are 97,98 or 99 by using Polars `is_between` function. We will get a percentage of this value against the total values in the columns.

In [8]:
# Get percentage of values in column between 96 and 100
percent_missing = rdf.select(
    pl.col(x).is_between(96, 100).sum().alias(x) * 100 / pl.col(x).count()
    for x in rdf.column_names
)
percent_missing.collect().fetch()

intubed,pneumonia,age,pregnancy,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,contact_other_covid,covid_res,icu
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
78.505371,0.001941,0.036534,50.952697,0.349628,0.308682,0.309212,0.349452,0.321919,0.458523,0.321566,0.31433,0.316271,0.336568,30.891349,0.0,78.505547


Since the `intubed`, `pregnancy`, `icu` and `contact_other_covid` columns contain significant amounts of "unknown" values, we will `drop` them from our model.

In [9]:
rdf = rdf.drop(["intubed", "pregnancy", "contact_other_covid", "icu"])

Next we will delete any rows which have a value which is not `1` or `2` for our categorical columns, essentially deleting all of these unknown 97, 98 and 99 values, whilst  also ensuring there are no other unexpected values.

In [10]:
rdf = rdf.filter(
    pl.col(
        [x for x in list(filter(lambda a: a != "age", rdf.column_names))]
    ).is_between(0, 3)
)

### View dataset size and columns

So now we have finished cleaning our dataset, we can take a look again at some information about our dataset so we can confirm our dataset is still sufficiently large for training our model.

In [11]:
rdf.schema

{'pneumonia': polars.datatypes.Int64,
 'age': polars.datatypes.Int64,
 'diabetes': polars.datatypes.Int64,
 'copd': polars.datatypes.Int64,
 'asthma': polars.datatypes.Int64,
 'inmsupr': polars.datatypes.Int64,
 'hypertension': polars.datatypes.Int64,
 'other_disease': polars.datatypes.Int64,
 'cardiovascular': polars.datatypes.Int64,
 'obesity': polars.datatypes.Int64,
 'renal_chronic': polars.datatypes.Int64,
 'tobacco': polars.datatypes.Int64,
 'covid_res': polars.datatypes.Int64}

In [12]:
# get percentage of values in column between 96 and 100
size = rdf.select(pl.col(x).count().alias(x) for x in rdf.column_names)
size.collect().fetch()

pneumonia,age,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,covid_res
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
496291,496291,496291,496291,496291,496291,496291,496291,496291,496291,496291,496291,496291


### Transform label column to binary data

In the original dataset, the label column used (2) for covid present and (1) for covid absent. The code below transforms this to a binary space (1) for covid present and (0) for covid absent.

In [13]:
label = "covid_res"
rdf = rdf.with_column(pl.when(pl.col(label) == 2).then(1).otherwise(0).alias(label))

Here, collect is called to run all the data pre-processing applied to the RemoteDataFrame

In [14]:
rdf = rdf.collect()

## Data conversion: from DataFrame to Dataset

Now that our data is clean, let's convert the Covid dataset to a trainable dataset on BastionLab. 

Here, the Covid dataset is split into train and test dataset using `train_test_split` method. 
The split is (80-train set, 20-test_set).

In [15]:
from bastionlab.polars import train_test_split

# The RemoteDataFrame is shuffled and split into training and testing dataset.
train_rdf, test_rdf = train_test_split(
    rdf,
    test_size=0.2,
    shuffle=True,
)

### RemoteDataFrame to RemoteArray conversion

Below, both the train and test input RemoteDataFrames need to be converted to `RemoteArrays`s. This is done because machine learning models only accept arrays (`ndarrays`).

In [16]:
label = "covid_res"

# Get all columns of the RemoteDataFrame and filter out the label column
cols = list(filter(lambda a: a != label, rdf.column_names))

# Transform RemoteDataFrame to RemoteTensor
train_X = train_rdf.select(cols).to_array()
test_X = test_rdf.select(cols).to_array()

The train and test label RemoteDataFrames are also converted to RemoteArrays.

In [17]:
train_y = train_rdf.select(label).to_array()
test_y = test_rdf.select(label).to_array()

## Training a machine learning model with BastionLab

In this section, a simple linear regression model is fitted over the remote covid dataset.

The input data has 13 features (age, asthma, etch) and outputs 2 features (1-covid present, 0-covid absent).

In [18]:
from bastionlab.linfa.trainers import LinearRegression


lr = LinearRegression()

lr.fit(train_set=train_X, target_set=train_y)

FittedModel(identifier=627d7930-3914-44fd-a97f-9819cdb1d21f)
  └── LinearRegression(fit_intercept=True)

In order to validate our linear regression model, we use `cross_validate` using the `mean_absolute_error` parameter, which compares the predicted outputs to the targets in the original dataset.

In [19]:
from bastionlab.linfa import cross_validate

cross_validate(
    lr, lr.predict(train_X), train_y, cv=5, scoring="mean_absolute_error"
).fetch()

mean_absolute_error
f64
0.465553


In [20]:
connection.close()