<div id="colab_button">
  <h1>Text Data Preprocessing</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/string_preprocessing.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

----------------------------------------

Data cleaning is an essential preparation step for both data analysis and machine learning tasks - and this include dealing with textual data! But this type of data comes with its own unique preprocessing challenges: such as handling different casing or misspelt words. 

It is the data scientist's job to clean it sufficiently to ensure it will be grouped together correctly and give accurate query results and predictions. For this, you need the right tools...

In this tutorial, we'll see BastionLab's text pre-processing tools and how to use them, by becoming a data scientist performing text preprocessing for an Natural Language Processing task. We will go through some essential string manipulation functions such as splitting strings, removing stop words and changing case, while manipulating remote dataframes with safety guarantees.

Let's dive in!

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip) we will be using in this tutorial.

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/string_preprocessing.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [1]:
!pip install bastionlab
!pip install bastionlab_server

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

The [SMS Spam Dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) contains close to 5000 messages tagged as legitimate or spam messages.  It contains 425 spam SMS messages and 3,357 "ham" (genuine) messages. 

You can read more about the dataset [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

### Launch and connect to the server

In [2]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [3]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

### Getting a Polars dataframe from the CSV file


In most case, you can feed your csv file to Polars `read_csv()` function to create a dataframe and it works - except if the dataset you're using doesn't have a header! Then it would use the first line of the dataset as column names and the resulting dataframe would be wrong. 

We can check if a dataset has a header by inspecting the top two lines of the dataset. We'll use Linux's `head` command with the `-n 2` option and... we see that the CSV file doesn't have a header.

In [4]:
# inspect first two rows of dataset
!head -n 2 SMSSpamCollection

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...


Our next step is to let `read_csv` know that the CSV file has no header. We can do this using the `has_header=False` option, then supply our desired column names with the `new_columns` keyword.

In [5]:
import polars as pl

# read CSV file using Polars and name columns `text` and `label`
df = pl.read_csv(
    "SMSSpamCollection", has_header=False, sep="\t", new_columns=["label", "text"]
)
# display the first row of the dataset and column names
df.head(1)

label,text
str,str
"""ham""","""Go until juron..."


Now that we have our Polars DataFrame, it's time to upload it to the BastionLab server.

### Uploading the dataframe to the server

Before we upload the dataset to the server, we'll create a custom privacy policy which will allow us to run whatever query we want! To do this we will create a `Policy` with the `safe_zone` option set to `TrueRule()`, which means all queries will be considered safe. We would normally use a much stricter policy, but this policy is useful for the purposes of this demo as it allows us to print out sections of the raw data!

> Read more about policies [here](https://github.com/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/defining_policy_privacy.ipynb).

In [6]:
from bastionlab.polars.policy import Policy, TrueRule, Log

# create our open policy
policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=False)

Now that we've created our custom policy, we will use the `send_df` function to upload our dataset to the BastionLab server.

In [7]:
# upload DataFrame to BastionLab server
rdf = client.polars.send_df(df, policy=policy)

rdf

FetchableLazyFrame(identifier=1fb34c9b-d680-4301-b847-cf949b4d0bde)

The server returns a `RemoteLazyFrame` which allows us to work with the dataset remotely. We will be working with this RemoteLazyFrame for the rest of this tutorial!

## Preprocessing our text data 
--------------------------

> _Note: The following string methods are very much similar to the operations applied on strings in Python._

### Capitalizing the text
We can use the `to_lowercase()` and `to_uppercase()` methods to set all the strings in our `RemoteDataFrame` to lowercase or uppercase. This can be really handy if you want to group data together by string without worrying about variations in casing.

In [8]:
# set rdf to all lowercase
rdf = rdf.to_lowercase()

# display results
rdf.collect().fetch()

label,text
str,str
"""ham""","""go until juron..."
"""ham""","""ok lar... joki..."
"""spam""","""free entry in ..."
"""ham""","""u dun say so e..."
"""ham""","""nah i don't th..."
"""spam""","""freemsg hey th..."
"""ham""","""even my brothe..."
"""ham""","""as per your re..."
"""spam""","""winner!! as a ..."
"""spam""","""had your mobil..."


In [9]:
# display the dataset with all uppercase strings
rdf.to_uppercase().collect().fetch()

label,text
str,str
"""HAM""","""GO UNTIL JURON..."
"""HAM""","""OK LAR... JOKI..."
"""SPAM""","""FREE ENTRY IN ..."
"""HAM""","""U DUN SAY SO E..."
"""HAM""","""NAH I DON'T TH..."
"""SPAM""","""FREEMSG HEY TH..."
"""HAM""","""EVEN MY BROTHE..."
"""HAM""","""AS PER YOUR RE..."
"""SPAM""","""WINNER!! AS A ..."
"""SPAM""","""HAD YOUR MOBIL..."


### Removing punctuations

Now, we want to remove all punctuations from our strings.

To do this, we will use the `replace_all()` method to replace anything that doesn't match our regular expression pattern with `""`, effectively deleting it. Our regular expression matches on any combination of lowercase letters (a-z) - since the dataset is set to all lowercase -, numbers (0-9) and whitespaces (\s). Anything else is considered to be punctuation and is removed.

> You can find out more about regular expression [here](https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2010/ae5bf541(v=vs.100)?redirectedfrom=MSDN).

In [10]:
rdf = rdf.replace_all(pattern="[^a-z0-9\\\s]+", to="").collect()
rdf.fetch()

label,text
str,str
"""ham""","""go until juron..."
"""ham""","""ok lar joking ..."
"""spam""","""free entry in ..."
"""ham""","""u dun say so e..."
"""ham""","""nah i dont thi..."
"""spam""","""freemsg hey th..."
"""ham""","""even my brothe..."
"""ham""","""as per your re..."
"""spam""","""winner as a va..."
"""spam""","""had your mobil..."


### Matching words in a fuzzy manner

Sometimes, patterns and comparisons aren't so easy to make. That's when fuzzy string matching comes into play. It is the process of finding strings that approximately match a pattern by using values ranging between 1 and 0. Where `booleans` are either true or false, fuzzy logic reasons that a value can be somewhat true, somewhat false, completely true, etc. 

To explore this method, you can use our `fuzzy_match()` function. It filters out text including anything similar to the `pattern` provided in selected columns (`cols`).

The cells containing words that are near-matches to the pattern are returned in full. Cells that do not contain fuzzy matches of the pattern will be shown as `null`.

In [15]:
rdf.fuzzy_match(pattern="free", cols=["text"]).collect().fetch()

label,text
str,str
"""ham""","""go until juron..."
"""ham""",
"""spam""","""free entry in ..."
"""ham""",
"""ham""","""nah i dont thi..."
"""spam""","""freemsg hey th..."
"""ham""",
"""ham""","""as per your re..."
"""spam""",
"""spam""","""had your mobil..."


### Finding the frequency of words


We can search for all occurrences matching a specific pattern with `findall()`. 

Here we'll search for `free` and we use `(i)`regex flag to specify that we want to match on this word, regardless of casing.

We store this in a DataFrame where the `text` column now contains lists with each found match of `free`. A sentence that does not contain `free` becomes an empty list `[]`, and an entry containing `free` twice becomes `["free", "free"]`.

We can then get the total number of matches by using the `lengths()` and `sum()` methods. It will convert our lists to a number representing the length of the list and then sum up all these lengths to give us our total frequency.

In [28]:
# save results of findall in DataFrame
df = rdf.findall(pattern="(?i)free").collect().fetch()

# Get number of matches for each sentence by using the arr.lengths(), then use sum to get the total matches
df.select(pl.col("text").arr.lengths()).sum()

text
u32
327


We can see from the results above that there are `327` occurrences of the word `free` in our `RemoteDataFrame`.

### Frequency of rows containing a word

Another method of finding occurences is `contains()`. It gets the total number of rows that contain a word at least once.

`contains()` will fill our `text` column with true (`1`) where a match for our word is found and false (`0`) where no matches are found. We can use `sum` directly on this data to get the total number of rows which contained matches.

In [30]:
rdf.contains("(?i)free").sum().collect().fetch()

label,text
u32,u32
0,249


### Tokenizing our sentences

Finally, we can tokenize our sentences into words, so they can easily be transformed for AI training tasks (here, NLP with the spam dataset).

We will use the `split()` method to do this, which will transform our messages into lists of words. Whitespace characters are ignored.

In [None]:
rdf = rdf.split(sep=" ", cols=["text"]).collect()
rdf.fetch()

label,text
str,list[str]
"""ham""","[""go"", ""until"", ... ""wat""]"
"""ham""","[""ok"", ""lar"", ... ""oni""]"
"""spam""","[""free"", ""entry"", ... ""08452810075over18s""]"
"""ham""","[""u"", ""dun"", ... ""say""]"
"""ham""","[""nah"", ""i"", ... ""though""]"
"""spam""","[""freemsg"", ""hey"", ... ""rcv""]"
"""ham""","[""even"", ""my"", ... ""patent""]"
"""ham""","[""as"", ""per"", ... ""callertune""]"
"""spam""","[""winner"", ""as"", ... ""only""]"
"""spam""","[""had"", ""your"", ... ""08002986030""]"


We're good to go! Let's now close the connection and shutdown the server.

In [None]:
connection.close()
bastionlab_server.stop(srv)