<div id="colab_button">
  <h1>RemoteDataFrame String Manipulation</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/saving_dataframes.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

----------------------------------------

By the end of this tutorial, you would have seen how we can apply string methods (`split`, `replace`, `match`, etc) on `columns` in `RemoteDataFrame`s.

Let's dive in!

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Install [Polars](https://pola-rs.github.io/polars-book/user-guide/quickstart/intro.html)
- Download [the dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip) we will be using in this tutorial.

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.6/docs/docs/tutorials/data_cleaning.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
# !pip install bastionlab
# !pip install polars

# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
# !unzip smsspamcollection.zip

### Launch and connect to the server

In [None]:
# # launch bastionlab_server test package
# import bastionlab_server

# srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [None]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

### Upload the dataframe to the server

Before we upload the dataset to the server, we'll create a custom privacy policy which will log any queries which do not aggregate at least 10 rows. *You can check out how to define a privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).* 

In [None]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

# Read CSV file using Polars and rename columns with `text`, `label`
df = pl.read_csv(
    "SMSSpamCollection", has_header=False, sep="\t", new_columns=["label", "text"]
)

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=False)

rdf = client.polars.send_df(df, policy=policy)

rdf

The server returns a `RemoteLazyFrame` which we will be working with throughout the rest of this tutorial!

## Applying String Operations 
--------------------------

### split

With split, we will split the columns of the `RemoteDataFrame` based on a few tokens (whitespace, comma, and question mark)

Below, we show the columns in our RemoteDataFrame

In [None]:
cols = rdf.columns

print(cols)

In [None]:
# Here, we split the
rdf.split(" ").collect().fetch()

### to_lowercase

Here, we will convert all the texts within all the columns to lower case.

In [None]:
rdf.to_lowercase().collect().fetch()

### to_uppercase

Here, we will convert all the texts within all the columns to upper case.

In [None]:
rdf.to_uppercase().collect().fetch()

### replace

Here, we will replace the word `ham` with `jam`.

In [None]:
rdf.replace(pattern="ham", to="jam").collect().fetch()

### replace_all

Here, we will apply `replace_all` to all the columns of the `RemoteDataFrame`.

> Note that the difference between `replace` and `replace_all` is that `replace` only changes the first occurrence of the pattern. But `replace_all` replaces all occurrence of the pattern in the sentence.

> Also note that both `replace` and `replace_all` are case sensitive. You could pass Regex to make the pattern matching case insensitive.

In [None]:
rdf.replace_all(pattern="Go", to="leave").collect().fetch()

_We add the case insensitivity flag `(?i)` to the pattern._

In [None]:
rdf.replace_all(pattern="(?i)go", to="leave").collect().fetch()

### fuzzy_match

Here, we will try fuzzy matching on `RemoteDataFrame`. We will fuzzy match "`am`" on the `text` column.

In [None]:
rdf.fuzzy_match(pattern="am", cols=["text"]).collect().fetch()

### findall

Findall searches through the `RemoteDataFrame` for the pattern match.

Below, we will look for the pattern `free` in a case insensitive manner, i.e., using the regex flag.

In [None]:
rdf.findall(pattern="(?i)free").collect().fetch()

### contains

Contains acts like findall but returns a boolean if a match was found or not.

Here, we will look for the string "_`free`_" but in a case sensitive manner.

In [None]:
rdf.contains("free").collect().fetch()

Support for string manipulation methods can be increased, for the moment, these methods are the only ones supported.

- `split`
- `contains`
- `replace`
- `replace_all`
- `findall`
- `contains`
- `match`
- `fuzzy_match`
- `extract`

Let's now close the connection and shutdown the server.

In [None]:
connection.close()
bastionlab_server.stop(srv)