<div id="colab_button">
  <h1>Text Data Preprocessing</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/string_preprocessing.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

----------------------------------------

In this tutorial, we will be processing the [SMS Spam Collection](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/) dataset.

We will see how to split strings, remove stop words, match cases, and a few other string operations that could be done on our `RemoteDataFrame`. If you want to know more about `RemoteDataFrame`s, kindly check out out [Remote Data Science](https://github.com/mithril-security/bastionlab/blob/v0.3.7/docs/docs/concepts-guides/remote_data_science.md) concept guide.

Also, we will be preprocessing our text as if it would be used in an NLP task.

Let's dive in!

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip) we will be using in this tutorial.
- Install [Polars](https://pypi.org/project/polars/)

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/saving_dataframes.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [1]:
!pip install bastionlab
!pip install bastionlab_server
!pip install polars

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip


### Dataset Description
The SMS Spam Dataset contains approximatively 5000 messages tagged as legitimate or spam messages. 

A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. 

A subset of 3,375 SMS randomly chosen ham (good) messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. 

You can read more about the dataset [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).



### Launch and connect to the server

In [2]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()


>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [5]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client


### Loading Dataset


In [8]:
!head -n 2 SMSSpamCollection

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...


Using the `head` command on the SMSSpamCollection file, we observe that the CSV file doesn't have a header and so, we would have to give it one when we load the file using Polars.

In [9]:
import polars as pl

# Read CSV file using Polars and rename columns with `text`, `label`
df = pl.read_csv(
    "SMSSpamCollection", has_header=False, sep="\t", new_columns=["label", "text"]
)

We load the CSV file by setting the `has_header` flag to `False` and passing our headers `label` and `text` to the `read_csv` method.

### Upload the dataframe to the server

Before we upload the dataset to the server, we'll create a custom privacy policy which essentially passes all queries. 

> Note that we only use the `TrueRule` for testing and in a production environment, you might want to use the `Aggregate` policy. Read more about policies [here](https://github.com/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/defining_policy_privacy.ipynb).

In [10]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=False)

rdf = client.polars.send_df(df, policy=policy)

rdf


FetchableLazyFrame(identifier=a7b39af4-2171-470f-9ce1-5e459eae40c2)

The server returns a `RemoteLazyFrame` which we will be working with throughout the rest of this tutorial!

## Preprocessing our text collection 
--------------------------

> Note: An important point to note is that all these string methods are very much similar to the operations applied on strings in Python.

### Capitalizing our texts
We will use the `to_lowercase` method to ensure that all the strings in our `RemoteDataFrame` are in lowercase.

_You could also call `to_uppercase` on the `RemoteDataFrame` to convert all the strings to uppercase._

In [11]:
rdf = rdf.to_lowercase()
rdf.collect().fetch()


label,text
str,str
"""ham""","""go until juron..."
"""ham""","""ok lar... joki..."
"""spam""","""free entry in ..."
"""ham""","""u dun say so e..."
"""ham""","""nah i don't th..."
"""spam""","""freemsg hey th..."
"""ham""","""even my brothe..."
"""ham""","""as per your re..."
"""spam""","""winner!! as a ..."
"""spam""","""had your mobil..."


In [12]:
rdf.to_uppercase().collect().fetch()

label,text
str,str
"""HAM""","""GO UNTIL JURON..."
"""HAM""","""OK LAR... JOKI..."
"""SPAM""","""FREE ENTRY IN ..."
"""HAM""","""U DUN SAY SO E..."
"""HAM""","""NAH I DON'T TH..."
"""SPAM""","""FREEMSG HEY TH..."
"""HAM""","""EVEN MY BROTHE..."
"""HAM""","""AS PER YOUR RE..."
"""SPAM""","""WINNER!! AS A ..."
"""SPAM""","""HAD YOUR MOBIL..."


### Removing punctuations

Now, we would want to remove all punctuations from our strings and have only words with which to continue.

We will use this regular expression (`[^a-z0-9\\\s]+`) to remove all punctuations.

In order to remove all occurrences and not the first occurrence, we will use the method `replace_all` and not `replace`.

> Kindly note that all methods are case-sensitive and the Regex flag `(?i)` or the right string to match.

In [13]:
rdf = rdf.replace_all(pattern="[^a-z0-9\\\s]+", to="").collect()
rdf.fetch()


label,text
str,str
"""ham""","""go until juron..."
"""ham""","""ok lar joking ..."
"""spam""","""free entry in ..."
"""ham""","""u dun say so e..."
"""ham""","""nah i dont thi..."
"""spam""","""freemsg hey th..."
"""ham""","""even my brothe..."
"""ham""","""as per your re..."
"""spam""","""winner as a va..."
"""spam""","""had your mobil..."


### Matching words in a fuzzy manner

Before using our dataset for an machine learning tasks, let's try to filter out a few words which are more synanymous with spams to see how we can apply the fuzzy matching method on our `RemoteDataFrame`.


In [14]:
rdf.fuzzy_match(pattern="free", cols=["text"]).collect().fetch()


label,text
str,str
"""ham""","""go until juron..."
"""ham""",
"""spam""","""free entry in ..."
"""ham""",
"""ham""","""nah i dont thi..."
"""spam""","""freemsg hey th..."
"""ham""",
"""ham""","""as per your re..."
"""spam""",
"""spam""","""had your mobil..."


In [15]:
rdf.fuzzy_match(pattern="urgent", cols=["text"]).collect().fetch()


label,text
str,str
"""ham""","""go until juron..."
"""ham""",
"""spam""",
"""ham""",
"""ham""",
"""spam""","""freemsg hey th..."
"""ham""",
"""ham""","""as per your re..."
"""spam""",
"""spam""",


We see that some fields have `null`. This is because there weren't matches found by the fuzzy matcher.

### Finding the frequency of words with findall

With `findall`, we could be very specific with our search and return the found match in a list.

Let's use find all to find the frequency of the word `free` in our `RemoteDataFrame`.

In [16]:
df = rdf.findall(pattern="(?i)free").collect().fetch()

df.select(pl.col("text").arr.lengths().sum())

text
u32
327


We can see from the results above that there are `327` instances of the word `free` in our `RemoteDataFrame`.

### Word frequency but with booleans

Contains acts like findall but returns a boolean if a match was found or not.

We could use the method `contains` to also perform the word frequency operation we just did above.

Let's look at how we could do that.

In [17]:
df = rdf.contains("(?i)free").collect().fetch()

df.select(
    pl.when(pl.col("text") == True).then(1).otherwise(0).sum(),
)

literal
i64
234


We observe that the results of the `contains` and that of `findall` are slightly different because for contains we only return a True when we find the first match.

### Tokenizing our sentences

Our final step would be to tokenize our sentences into words so they could be easily transformed for our NLP task.

We will utilize the `split` method for that.

In [18]:
rdf = rdf.split(sep=" ", cols=["text"]).collect()
rdf.fetch()

label,text
str,list[str]
"""ham""","[""go"", ""until"", ... ""wat""]"
"""ham""","[""ok"", ""lar"", ... ""oni""]"
"""spam""","[""free"", ""entry"", ... ""08452810075over18s""]"
"""ham""","[""u"", ""dun"", ... ""say""]"
"""ham""","[""nah"", ""i"", ... ""though""]"
"""spam""","[""freemsg"", ""hey"", ... ""rcv""]"
"""ham""","[""even"", ""my"", ... ""patent""]"
"""ham""","[""as"", ""per"", ... ""callertune""]"
"""spam""","[""winner"", ""as"", ... ""only""]"
"""spam""","[""had"", ""your"", ... ""08002986030""]"


Now, we have our texts tokenized on `<space>`.

Let's now close the connection and shutdown the server.

In [None]:
# connection.close()
# bastionlab_server.stop(srv)


NameError: name 'bastionlab_server' is not defined