# Data Cleaning with BastionLab

In this tutorial, we are going to see how to remove unwanted columns, clean null values and duplicates, and replace values in our dataset.

## Pre-requisites

### Technical Requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager
- [Docker](https://www.docker.com/) 

*Here's the [Docker official tutorial](https://docker-curriculum.com/) to set it up on your computer.*

## Pip packages and dataset

In order to run this notebook, you will also need to install Polars, Bastionlab and download the dataset we will be using in this tutorial. You can download all of these by running the following code block, or alternatively, you can download download the dataset from Kaggle by following this link: https://www.kaggle.com/competitions/titanic/data and creaing a free user account.

This dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic. However, it has been modified by data scientist XX to contain some values that need cleaning up before we can start running queries!

In [28]:
! pip install polars
! pip install bastionlab

In [29]:
!wget 'https://raw.githubusercontent.com/chingjunetao/medium-article/master/simple-guide-to-data-cleaning/modified_titanic_data.csv'

## Getting set-up
We now need to launch the server using Docker.

In [31]:
!docker run -it -p 50056:50056 --env DISABLE_AUTHENTICATION=1 -d mithrilsecuritysas/bastionlab:latest

The final step is to connect to our sever to send our dataset.

Firstly, we read in the dataset using Polar's read_csv() function, which returns a Polar's DataFrame instance containing the dataset.

Secondly, we connect to the server using Bastionlab's Connection() method.

In [32]:
from bastionlab import Connection
import polars as pl

df = pl.read_csv("modified_titanic_data.csv")

connection = Connection("localhost")
client = connection.client

Finally, we send the Polar's DataFrame instance to the server using Bastionlab's polars.send_df() method which will return a RemoteLazyFrame instance, a reference to the DataFrame uploaded which we will be working with throughout this tutorial.

For the sake of this tutorial, we specify an unsafe policy which disables all checks. We set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), we set it to `Log()` in the following example.

Note that this is purely done so that we can focus on demonstrating data cleaning in BastionLab without having to worry about approving any data access requests. But this policy is not suited for production.


In [33]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

FetchableLazyFrame(identifier=8a745515-1877-4182-ad6a-05be8931addc)

We can now see the list of column names in the dataset by using RemoteDataFrame's columns() method.

In [34]:
rdf.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked',
 'OnboardTime']

## Dropping columns

Let's imagine that we don't need the column "Fare". We could simply drop the column by using RemoteLazyFrame's drop method, which takes the name of a column or a list of column names as a parameter and returns a Lazy Data Frame which no longer includes this/those columns.

In [35]:
rdf = rdf.drop("Fare")
rdf.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Cabin',
 'Embarked',
 'OnboardTime']

As you can see, this fare is now longer in our rdf RemoteDataFrame instance.

## Cleaning null values

The next problem we want to address is null values in the database. First of all, let's see how many null values we have in the "Age". We can do this by selecting the column 'Age' and using the is_null() method. This will output a RemoteLazyFrame with each original cell in the 'Age' column that was null now replaced by the value 1 and each original cell that was not null replacing by the value 0. Therefore, we can simply use sum() to count up all these 1 values, and therefore, give us the total null values in the 'Age' column.

Note that to access and show the data in this RemoteDataFrame, we always need to use the collect() and fetch() methods. In this case, this will trigger a request for the data owner's approval- please respond to this request by inputting 'y' in your terminal running the docker image to accept this request.

In [36]:
total_nulls = rdf.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Age
u32
178


##Replacing null values

So now we know that there are 178 null values in the 'Age' column, the next question is what do we want to do with them? One method to deal with null values would be to replace them with another value. Here, I will use the fill_null function to replace all null Age cells with the value 100.

We can now verify that this has worked by checking how many null values we have in our new RemoteLazyFrame instance called swap.

In [37]:
swap = rdf.fill_null("100")
total_nulls = swap.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Age
u32
0


Let's also check how many cells contain 100! We can do this by filtering the values in 'Age' down to those equal to 100 and then counting all the cells in that column. The output is of course now 178!

In [38]:
total_100s = swap.filter(pl.col("Age") == "100").select(pl.col("Age").count())

total_100s.collect().fetch()

Age
u32
178


## Converting column types

As you may have noticed, our "Age" column contains strings not integers. If we wanted to change that, we could use the .cast() method with strict set to False to convert out string values to numerical ones!

In [39]:
swap = swap.with_column(pl.col("Age").cast(pl.Int64, strict=False))

total_num_100s = swap.filter(pl.col("Age") == 100).select(pl.col("Age").count())

total_num_100s.collect().fetch()

Age
u32
178


## Deleting null values

Another method for handling null values is simply to delete them. We can do this by using RemoteLazyFrame's drop_nulls method.

As you can see our drop instance of the original rdf RemoteLazyFrame now also has zero null values.

In [40]:
drop = rdf.drop_nulls()

total_nulls = drop.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Age
u32
0


# Cleaning duplicates

The next area of data cleaning with BastionLab we will look at is duplicates. We can easily filter down a column or whole dataset to contain only unique cells by using the unique() method.

Let's start by checking how many values there currently are in rdf's 'Sex' column.

In [41]:
count = rdf.select(pl.col("Sex").count()).collect().fetch()

Now, let's use unique() to remove all duplicate values in the dataset. We now see unique vaulues only.

In [42]:
df = rdf.unique()
df.select(pl.col("Sex")).collect().fetch()

Sex
str
"""male"""
"""female"""
"""m"""
"""m """
"""M"""
"""F"""
"""f"""


This leads us to our final cleaning topic: how to map all alternative forms for one value, i.e "m", "M" and "male" for male, to one same value.

One way to achieve this is using a polar's "when-then-otherwise" statement to replace alternative forms of "male" with one chosen form.

As you can see, we now only have "male" or "female" in our output.

In [43]:
new_rdf = (
    df.select(
        pl.when(pl.col("Sex") == "M")
        .then("male")
        .when(pl.col("Sex") == "m")
        .then("male")
        .when(pl.col("Sex") == "m ")
        .then("male")
        .when(pl.col("Sex") == "F")
        .then("female")
        .otherwise(pl.col("Sex"))
    )
    .collect()
    .fetch()
)
new_rdf

literal
str
"""male"""
"""female"""
"""male"""
"""male"""
"""male"""
"""female"""
"""f"""


So that concludes our data cleaning tutorial. All that's left to do now is to close your connection to the server!

In [44]:
connection.close()