<div id="colab_button">
  <h1>String Manipulation</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/saving_dataframes.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

----------------------------------------

By the end of this tutorial, you would have seen how we can apply string methods (`split`, `replace`, `match`, etc) to `RemoteDataFrame`s.

Let's dive in!

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip) we will be using in this tutorial.

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/saving_dataframes.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [4]:
!pip install bastionlab

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


***# Can you write a little description of the dataset? Similar to "Our dataset is based on the [Titanic dataset](https://www.kaggle.com/c/titanic), one of the most popular ressource used for understanding machine learning, which contains information relating to the passengers aboard the Titanic."***

### Launch and connect to the server

In [None]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [2]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

  from .autonotebook import tqdm as notebook_tqdm


### Upload the dataframe to the server

Before we upload the dataset to the server, we'll create a custom privacy policy which will log any queries which do not aggregate at least 10 rows. *You can check out how to define a privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).* 

***# Here the text doesn't fit with what you did under. You need to change it ^^***

In [5]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

# Read CSV file using Polars and rename columns with `text`, `label`
df = pl.read_csv(
    "SMSSpamCollection", has_header=False, sep="\t", new_columns=["label", "text"]
)

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=False)

rdf = client.polars.send_df(df, policy=policy)

rdf

FetchableLazyFrame(identifier=7ec01bbf-0872-4a90-a472-68f40c562d3a)

The server returns a `RemoteLazyFrame` which we will be working with throughout the rest of this tutorial!

## Applying String Operations 
--------------------------

> Note: An important point to note is that all these string methods are very much similar to the operations applied on strings in Python.

Another important point to note is that when no columns are provided to the operators, the respective operation is applied to all the columns in the `RemoteDataFrame`.

You can choose a column to applied the operation to by passing it in the `cols` argument in each method.

For example, 

```python
    rdf.split(" ", cols=['name'])
```

or 

```python
    rdf.to_lowercase(cols=['name', 'address'])
```

### split

With split, we will split the columns of the `RemoteDataFrame` based on a few tokens (whitespace, comma, and question mark)

Below, we show the columns in our RemoteDataFrame

In [6]:
cols = rdf.columns

print(cols)

['label', 'text']


Here, we split the rows in the `RemoteDataFrame` with the token `<space>`.

Since we did not select which column to apply the split operation, the split operation is applied to  all columns in the `RemoteDataFrame`.


In [8]:
rdf.split(" ").collect().fetch().limit(5)

label,text
list[str],list[str]
"[""ham""]","[""Go"", ""until"", ... ""wat...""]"
"[""ham""]","[""Ok"", ""lar..."", ... ""oni...""]"
"[""spam""]","[""Free"", ""entry"", ... ""08452810075over18's""]"
"[""ham""]","[""U"", ""dun"", ... ""say...""]"
"[""ham""]","[""Nah"", ""I"", ... ""though""]"


### to_lowercase

Here, we will convert all the texts within all the columns to lower case.

In [9]:
rdf.to_lowercase().collect().fetch().limit(5)

label,text
str,str
"""ham""","""go until juron..."
"""ham""","""ok lar... joki..."
"""spam""","""free entry in ..."
"""ham""","""u dun say so e..."
"""ham""","""nah i don't th..."


### to_uppercase

Here, we will convert all the texts within all the columns to upper case.

In [10]:
rdf.to_uppercase().collect().fetch().limit(5)

label,text
str,str
"""HAM""","""GO UNTIL JURON..."
"""HAM""","""OK LAR... JOKI..."
"""SPAM""","""FREE ENTRY IN ..."
"""HAM""","""U DUN SAY SO E..."
"""HAM""","""NAH I DON'T TH..."


### replace

Here, we will replace the word `ham` with `jam`.

In [11]:
rdf.replace(pattern="ham", to="jam").collect().fetch().limit(5)

label,text
str,str
"""jam""","""Go until juron..."
"""jam""","""Ok lar... Joki..."
"""spam""","""Free entry in ..."
"""jam""","""U dun say so e..."
"""jam""","""Nah I don't th..."


We see from the print above that `ham` has been replaced with `jam`.

### replace_all

Here, we will apply `replace_all` to all the columns of the `RemoteDataFrame`.

> Note that the difference between `replace` and `replace_all` is that `replace` only changes the first occurrence of the pattern. But `replace_all` replaces all occurrence of the pattern in the sentence.

In [13]:
rdf.replace_all(pattern="go", to="leave").collect().fetch().limit(5)

label,text
str,str
"""ham""","""Go until juron..."
"""ham""","""Ok lar... Joki..."
"""spam""","""Free entry in ..."
"""ham""","""U dun say so e..."
"""ham""","""Nah I don't th..."


From the print above, we see that `Go` hasn't been replaced with `leave` because both `replace` and `replace_all` are case sensitive. You could pass Regex to make the pattern matching case insensitive.

_We add the case insensitivity flag `(?i)` to the pattern._

In [16]:
rdf.replace_all(pattern="(?i)go", to="leave").collect().fetch().limit(5)

label,text
str,str
"""ham""","""leave until ju..."
"""ham""","""Ok lar... Joki..."
"""spam""","""Free entry in ..."
"""ham""","""U dun say so e..."
"""ham""","""Nah I don't th..."


In [14]:
rdf.replace_all(pattern="Go", to="leave").collect().fetch().limit(5)

label,text
str,str
"""ham""","""leave until ju..."
"""ham""","""Ok lar... Joki..."
"""spam""","""Free entry in ..."
"""ham""","""U dun say so e..."
"""ham""","""Nah I don't th..."


### fuzzy_match

Here, we will try fuzzy matching on `RemoteDataFrame`. We will fuzzy match "`am`" on the `text` column.

In [18]:
rdf.fuzzy_match(pattern="am", cols=["text"]).collect().fetch().limit(5)

label,text
str,str
"""ham""","""Go until juron..."
"""ham""",
"""spam""","""Free entry in ..."
"""ham""",
"""ham""",


We see that some fields have `null`. This is because there weren't matches found by the fuzzy matcher.

### findall

Findall searches through the `RemoteDataFrame` for the pattern match.

Below, we will look for the pattern `free` in a case insensitive manner, i.e., using the regex flag.

In [19]:
rdf.findall(pattern="(?i)free").collect().fetch().limit(5)

label,text
list[str],list[str]
[],[]
[],[]
[],"[""Free""]"
[],[]
[],[]


If there were matches, we put the string in an array in the row, if no matches were found, an empty array is returned for that row.

### contains

Contains acts like findall but returns a boolean if a match was found or not.

Here, we will look for the string "_`free`_" but in a case sensitive manner.

In [21]:
rdf.contains("free").collect().fetch().limit(5)

label,text
bool,bool
False,False
False,False
False,False
False,False
False,False


Here, we get all `false` but we know that at least the third row of the `text` column. As we saw from the `replace` section, string operations are case-sensitive and we will have to use a regex to match or use the cased version of the search pattern.

In [23]:
print(rdf.contains("(?i)free").collect().fetch().limit(5))

print(rdf.contains("Free").collect().fetch().limit(5))

shape: (5, 2)
┌───────┬───────┐
│ label ┆ text  │
│ ---   ┆ ---   │
│ bool  ┆ bool  │
╞═══════╪═══════╡
│ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false │
└───────┴───────┘
shape: (5, 2)
┌───────┬───────┐
│ label ┆ text  │
│ ---   ┆ ---   │
│ bool  ┆ bool  │
╞═══════╪═══════╡
│ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false │
└───────┴───────┘


And as expected, we get `True` for the `row 3, column text`.

Let's now close the connection and shutdown the server.

In [None]:
connection.close()
bastionlab_server.stop(srv)