# Data Cleaning with BastionLab

In this tutorial, we are going to see how to remove unwanted columns, clean null values and duplicates, and replace values in our dataset.

## Pre-requisites

### Technical Requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager
- [Docker](https://www.docker.com/) 

*Here's the [Docker official tutorial](https://docker-curriculum.com/) to set it up on your computer.*

## Pip packages and dataset

In order to run this notebook, you will also need to install Polars, Bastionlab and download the dataset we will be using in this tutorial. You can download all of these by running the following code block, or alternatively, you can download download the dataset from Kaggle by following this link: https://www.kaggle.com/competitions/titanic/data and creaing a free user account.

This dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic. However, it has been modified by data scientist XX to contain some values that need cleaning up before we can start running queries!

In [17]:
! pip install polars
! pip install bastionlab
!wget 'https://raw.githubusercontent.com/chingjunetao/medium-article/blob/master/simple-guide-to-data-cleaning/modified_titanic_data.csv'

--2022-11-25 23:22:03--  https://raw.githubusercontent.com/chingjunetao/medium-article/blob/master/simple-guide-to-data-cleaning/modified_titanic_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-11-25 23:22:03 ERROR 404: Not Found.



## Getting set-up
First we need to create a key which will be used for authentication purposes. This code block will generate a private and public key and store them in the relevant directory: pubkey or privkey.

In [18]:
from bastionlab import SigningKey

!mkdir -p pubkey
!mkdir -p privkey

data_owner_key = SigningKey.from_pem_or_generate("./privkey/data_owner.key.pem")
data_owner_pubkey = data_owner_key.pubkey.save_pem("./pubkey/data_owner.pem")

The next step is to launch the server using Docker. We will run the docker image in interactive mode so that we can accept any query requests that do not automitically pass our default privacy policy.

## !Important:
Please run the following command in a separate terminal so that you can respond to requests interactively whilst running the rest of the code in the notebook:

In [19]:
#!Important: Run in separate terminal:
# docker run -it -p 50056:50056 --mount type=bind,source=$(pwd)/pubkey,target=/app/bin/keys mithrilsecuritysas/bastionlab:latest

The final stage to get set-up (I promise!) is to connect to our sever using our private key and send our dataset to the server. We will do this in three key steps:
1 - Reading in the dataset using Polar's read_csv() function, which returns a Polar's DataFrame instance containing the dataset.
2 - Connecting to the server using Bastionlab's Connection() method.
3 - Sending the Polar's DataFrame instance to the server using Bastionlab's send_df() method which will return a RemoteLazyFrame instance, which is a reference to the DataFrame uploaded.

In [20]:
from bastionlab import Connection
import polars as pl

df = pl.read_csv("modified_titanic_data.csv")

connection = Connection("localhost", 50056, signing_key=data_owner_key)
client = connection.client

rdf = client.send_df(df, blacklist="Name")

rdf

FetchableLazyFrame(identifier=64fc37cb-456b-4564-a312-a92ea805b825)

We can now see the list of column names in the dataset by using RemoteDataFrame's columns() method.

In [21]:
rdf.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked',
 'OnboardTime']

## Dropping columns

Let's imagine that we don't need the column "Fare". We could simply drop the column by using RemoteLazyFrame's drop method, which takes the name of a column or a list of column names as a parameter and returns a Lazy Data Frame which no longer includes this/those columns.

In [22]:
rdf = rdf.drop("Fare")
rdf.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Cabin',
 'Embarked',
 'OnboardTime']

As you can see, this fare is now longer in our rdf RemoteDataFrame instance.

## Cleaning null values

The next problem we want to address is null values in the database. First of all, let's see how many null values we have in the "Age". We can do this by selecting the column 'Age' and using the is_null() method. This will output a RemoteLazyFrame with each original cell in the 'Age' column that was null now replaced by the value 1 and each original cell that was not null replacing by the value 0. Therefore, we can simply use sum() to count up all these 1 values, and therefore, give us the total null values in the 'Age' column.

Note that to access and show the data in this RemoteDataFrame, we always need to use the collect() and fetch() methods. In this case, this will trigger a request for the data owner's approval- please respond to this request by inputting 'y' in your terminal running the docker image to accept this request.

In [23]:
total_nulls = rdf.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


Age
u32
178


##Replacing null values

So now we know that there are 178 null values in the 'Age' column, the next question is what do we want to do with them? One method to deal with null values would be to replace them with another value. Here, I will use the fill_null function to replace all null Age cells with the value 100.

We can now verify that this has worked by checking how many null values we have in our new RemoteLazyFrame instance called swap.

In [24]:
swap = rdf.fill_null("100")
total_nulls = swap.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


Age
u32
0


Let's also check how many cells contain 100! We can do this by filtering the values in 'Age' down to those equal to 100 and then counting all the cells in that column. The output is of course now 178!

In [25]:
total_100s = swap.filter(pl.col("Age") == "100").select(pl.col("Age").count())

total_100s.collect().fetch()

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


Age
u32
178


## Converting column types

As you may have noticed, our "Age" column contains strings not integers. If we wanted to change that, we could use the .cast() method with strict set to False to convert out string values to numerical ones!

In [26]:
swap = swap.with_column(pl.col("Age").cast(pl.Int64, strict=False))

total_num_100s = swap.filter(pl.col("Age") == 100).select(pl.col("Age").count())

total_num_100s.collect().fetch()

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


Age
u32
178


## Deleting null values

Another method for handling null values is simply to delete them. We can do this by using RemoteLazyFrame's drop_nulls method.

As you can see our drop instance of the original rdf RemoteLazyFrame now also has zero null values.

In [27]:
drop = rdf.drop_nulls()

total_nulls = drop.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


Age
u32
0


# Cleaning duplicates

The next area of data cleaning with BastionLab we will look at is duplicates. We can easily filter down a column or whole dataset to contain only unique cells by using the unique() method.

Let's start by checking how many values there currently are in rdf's 'Sex' column.

In [28]:
count = rdf.select(pl.col("Sex").count()).collect().fetch()

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


Now, let's use unique() to remove all duplicate values in the dataset. We now see unique vaulues only.

In [29]:
df = rdf.unique()
df.select(pl.col("Sex")).collect().fetch()

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


Sex
str
"""male"""
"""female"""
"""m"""
"""m """
"""M"""
"""F"""
"""f"""


This leads us to our final cleaning topic: how to map all alternative forms for one value, i.e "m", "M" and "male" for male, to one same value.

One way to achieve this is using a polar's "when-then-otherwise" statement to replace alternative forms of "male" with one chosen form.

As you can see, we now only have "male" or "female" in our output.

In [30]:
new_rdf = (
    df.select(
        pl.when(pl.col("Sex") == "M")
        .then("male")
        .when(pl.col("Sex") == "m")
        .then("male")
        .when(pl.col("Sex") == "m ")
        .then("male")
        .when(pl.col("Sex") == "F")
        .then("female")
        .otherwise(pl.col("Sex"))
    )
    .collect()
    .fetch()
)
new_rdf

Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.[37m
[32mThe query has been accepted by the data owner.[37m


literal
str
"""male"""
"""female"""
"""male"""
"""male"""
"""male"""
"""female"""
"""f"""


So that concludes our data cleaning tutorial. All that's left to do now is to close your connection to the server!

In [31]:
connection.close()