# Data Cleaning
___________________________________________

***#Why is data cleaning important, specifically remove unwanted colums, clean null values and duplicates and replace values? You can also check Kaggle:https://www.kaggle.com/learn/data-cleaning. Context of 5/10 lines***

In this tutorial, we are going to see how to **remove unwanted columns**, **clean null values and duplicates**, and **replace values** in our dataset.

## Pre-requisites
___________________________________________

### Technical Requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater *(get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)*
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager
- [Docker](https://www.docker.com/) 

*Here's the [Docker official tutorial](https://docker-curriculum.com/) to set it up on your computer.*

### Pip packages and dataset

In order to run this notebook, you will also need to install Polars, BastionLab and download the dataset we will be using in this tutorial. You can download all of these by running the following code block.

This dataset is based on the [Titanic dataset](https://www.kaggle.com/competitions/titanic/data), one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic. However, it has been modified by data scientist XX to contain some values that need cleaning up before we can start running queries...

In [28]:
! pip install polars
! pip install bastionlab

In [29]:
!wget 'https://raw.githubusercontent.com/chingjunetao/medium-article/master/simple-guide-to-data-cleaning/modified_titanic_data.csv'

>*You can also get it from source with a free user account at https://www.kaggle.com/competitions/titanic/data)*

### Getting set up
We now need to launch the server using Docker.

In [31]:
!docker run -it -p 50056:50056 --env DISABLE_AUTHENTICATION=1 -d mithrilsecuritysas/bastionlab:latest

The final step is to connect to our sever to send our dataset.

First, we read in the dataset using Polar's `read_csv()` function, which returns a Polar's DataFrame instance containing the dataset. Then, we connect to the server using Bastionlab's `Connection()` method.

In [32]:
from bastionlab import Connection
import polars as pl

df = pl.read_csv("modified_titanic_data.csv")

connection = Connection("localhost")
client = connection.client

Finally, we send the Polar's DataFrame instance to the server using Bastionlab's `polars.send_df()` method which will return a `RemoteLazyFrame` instance, a reference to the DataFrame uploaded. We will be working with it throughout this tutorial.

For the sake of this tutorial, we specify an **unsafe policy which disables all checks**. We set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), so we set it to `Log()` in the following example.

>**Important note** - This unsafe policy is used so that we can focus on demonstrating data cleaning in BastionLab, without having to worry about approving any data access requests. However this policy is **not** suited for production.


In [33]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

FetchableLazyFrame(identifier=8a745515-1877-4182-ad6a-05be8931addc)

We can now see the list of column names in the dataset by using RemoteDataFrame's `columns()` method.

In [34]:
rdf.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked',
 'OnboardTime']

## Dropping columns
__________________________________________________________________________________

Let's imagine that we don't need the column `"Fare"`. You can drop the column by using RemoteLazyFrame's `drop()` method, which takes the name of a column or a list of column names as a parameter and returns a Lazy Data Frame which no longer includes this/those columns.

***#Just to be sure this is standardised: Lazy DataFrame or Lazy Data Frame or lazy data frame?***

In [35]:
rdf = rdf.drop("Fare")
rdf.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Cabin',
 'Embarked',
 'OnboardTime']

As you can see, `'Fare'` is now longer in our `rdf` RemoteDataFrame instance.

## Cleaning null values

__________________________________________________________________________________


The next problem we want to address is null values in the database. 

First of all, let's see how many null values we have in `"Age"`. You can do this by selecting the column `'Age'` and using the `is_null()` method. This will output a RemoteLazyFrame, in which each original cell in the `'Age'` column that was `null` is now replaced by the value `1`; and each original cell that was not `null` is replaced by the value `0`. From there, we can get the total of null values in the `'Age'` column by using `sum()` to add up all the `1` values.

>Note - To access and show the data in this RemoteDataFrame, you always need to use the `collect()` and `fetch()` methods. In this case, this will trigger a request for the data owner's approval: `"please respond to this request by inputting 'y' in your terminal running the docker image to accept this request"`.

In [36]:
total_nulls = rdf.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Age
u32
178


## Replacing null values
__________________________________________________________________________________


Now we know that there are 178 `null` values in the `'Age'` column, the next question is what do we want to do with them? One method to deal with null values would be to replace them with another value. Here, we'll use the `fill_null()` function to replace all `null` Age cells with the value `100`.

We can now verify that this has worked by checking how many `null` values we have in our new RemoteLazyFrame instance, called `swap`.

In [37]:
swap = rdf.fill_null("100")
total_nulls = swap.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Age
u32
0


Let's also check how many cells contain `100`. We can do this by filtering the values in `'Age'` down to those equal to 100 and then counting all the cells in that column. The output is, of course, 178!

In [38]:
total_100s = swap.filter(pl.col("Age") == "100").select(pl.col("Age").count())

total_100s.collect().fetch()

Age
u32
178


## Converting column types
__________________________________________________________________________________


As you may have noticed, our `"Age"` column contains `strings` not `integers`. If you wanted to change that, you could use the `.cast()` method with strict set to `False` to convert string values to numerical ones.

In [39]:
swap = swap.with_column(pl.col("Age").cast(pl.Int64, strict=False))

total_num_100s = swap.filter(pl.col("Age") == 100).select(pl.col("Age").count())

total_num_100s.collect().fetch()

Age
u32
178


## Deleting null values
__________________________________________________________________________________


Another method for handling `null` values is simply to delete them. You can do this by using RemoteLazyFrame's `drop_nulls()` method.

As you can see our `drop` instance of the original `rdf` RemoteLazyFrame now also has zero null values.

In [40]:
drop = rdf.drop_nulls()

total_nulls = drop.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Age
u32
0


# Cleaning duplicates
__________________________________________________________________________________


After all those operations, we might still be left with duplicates. You can filter down a column or a whole dataset so they contain only unique cells by using the `unique()` method.

Let's start by checking how many values there currently are in `rdf`'s `'Sex'` column.

In [41]:
count = rdf.select(pl.col("Sex").count()).collect().fetch()

Now, let's use `unique()` to remove all duplicate values in the dataset. The column `'Sex'` is left with only unique values.

In [42]:
df = rdf.unique()
df.select(pl.col("Sex")).collect().fetch()

Sex
str
"""male"""
"""female"""
"""m"""
"""m """
"""M"""
"""F"""
"""f"""


This leads us to the final cleaning step: how can you map all alternative forms for one value to one value? For example `"m"`, `"M"` and `"male"` for male.

One way to achieve this is using a polar's `"when-then-otherwise"` statement to replace alternative forms of "male" with one chosen form.

In [43]:
new_rdf = (
    df.select(
        pl.when(pl.col("Sex") == "M")
        .then("male")
        .when(pl.col("Sex") == "m")
        .then("male")
        .when(pl.col("Sex") == "m ")
        .then("male")
        .when(pl.col("Sex") == "F")
        .then("female")
        .otherwise(pl.col("Sex"))
    )
    .collect()
    .fetch()
)
new_rdf

literal
str
"""male"""
"""female"""
"""male"""
"""male"""
"""male"""
"""female"""
"""f"""


As you can see, we now only have "male" or "female" in our output!

You now know **how to clean your dataset with BastionLab** and you can close your connection to the server.

In [44]:
connection.close()