## SQL queries
__________________________________________________

SQL is still the most important querying language in data science with around 70% of data scientists using SQL in their work. That's why we wanted to be able to accept SQL queries in BastionLab, enabling us to combine the familiarity of SQL queries with the security guarantees of BastionLab.

In the following notebook tutorial, we will show you how to **run basic SQL queries on RemoteLazyFrames**. 

But before we go any further, let's get everything set up! If you already know how to do this from previous tutorial or our [quick tour](../quick-tour/quick-tour.ipynb), feel free to skip ahead to the SQL queries section!

## Downloading the dataset

For this tutorial, we will use the Titanic dataset, a dataset on passengers of the Titanic which is often used in Machine Learning demos and tests.

You can download the dataset by running the following code block:

In [None]:
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

*Alternatively, you can get it from source with a free user account at https://www.kaggle.com/competitions/titanic/data)*

## Pre-requisites
___________________________________________

### Technical requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager

### Pip packages

In order to run this notebook, you will also need to install polars, bastionlab and bastionlab_server packages by running the following code block.

In [30]:
!pip install polars bastionlab bastionlab_server

>*Note that the bastionlab_server package we install here was created for testing purposes. You can alternatively install the BastionLab server using our Docker image or from source. Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

### Launching the server

The next step is to launch the server. The server exposes port `50056` for gRPC communication with clients and uses a default configuration (no authentication, default settings). For the purpose of this tutorial, these settings are sufficient and we won't change them. To launch the server, we use `bastionlab_server`'s `start()` method.

In [31]:
import bastionlab_server

srv = bastionlab_server.start()

### Importing our Titanic csv file

In order to upload our Titanic dataset to BastionLab, we first need to convert it into a Polars DataFrame by using polars `read_csv` function.

In [32]:
import polars as pl

df = pl.read_csv("titanic.csv")

We also need to establish a connection with the BastionLab server by creating an instance of `Connection()` and supplying the constructor with the host of our server instance.

>*In a typical workflow, the data owner would send a set of keys to the server, so that authorization can be required for all users at the point of connection. **BastionLab offers the authorization feature**, but we will not use this for this tutorial as we want to dive into SQL queries as quickly as possibly. To learn how to set-up with authentication, you can refer to the [authentication tutorial](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/authentication/).*

In [33]:
from bastionlab import Connection

connection = Connection("localhost", 50056)

### Creating our privacy policy

Before we upload our Polars DataFrame, we can create a custom privacy policy by defining what is classed as safe/unsafe and how we should handle unsafe actions.

Here I will create a policy which will log any queries which do not aggregate at least 10 rows.

In [34]:
from bastionlab.polars.policy import Policy, Aggregation, Log, Review

policy = Policy(
    safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=False
)

So finally, we can upload our DataFrame to BastionLab using the `send_df()` method of the `polars` interface of the client. We'll pass it our custom policy and a list of columns to be sanitized (*meaning set to null*) if retrieved by the data scientist.

In [35]:
rdf = connection.client.polars.send_df(df, policy=policy)

The server returns a `RemoteLazyFrame` which we will be working with throughout the rest of this tutorial!

## SQL queries
______________________________________________

SQL queries in BastionLab work by using the `sql` method on your RemoteLazyFrame.

`sql()` takes two arguments:
- `query`: a string containing your query,
- `rdfs`: Your RemoteLazyFrame(s) provided as *args.

### Selects

Let's start by looking at an example of how to select columns.

We first create our query string. Instead of naming a table to work with after the `from` keyword, we should leave a placeholder {}. We then send the RemoteLazyFrame which should go in the place of this placeholder as the following argument to our `sql` function.

Note you can user upper or lower case for the instructions in your SQL queries.

In [36]:
# select the Sex and Age columns, limit output to 3 columns
from bastionlab.polars import RemoteLazyFrame

q = "select Name, Sex, Age from {} limit 3"
RemoteLazyFrame.sql(q, rdf).collect().fetch()

Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame 80dcbafe-6864-4958-9293-5af6dd493b1c.

This incident will be reported to the data owner.[37m


Name,Sex,Age
str,str,f64
"""Braund, Mr. Ow...","""male""",22.0
"""Cumings, Mrs. ...","""female""",38.0
"""Heikkinen, Mis...","""female""",26.0


### Select with where 

We can also add a where clause into our query. Note how we can use `IS NOT NULL` to filter out any null results.

In [37]:
q = "SELECT * FROM {} WHERE Age BETWEEN 10 AND 18 AND Embarked = 'S' AND Cabin IS NOT NULL"
RemoteLazyFrame.sql(q, rdf).collect().fetch()

Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame 80dcbafe-6864-4958-9293-5af6dd493b1c.

This incident will be reported to the data owner.[37m


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
436,1,1,"""Carter, Miss. ...","""female""",14.0,1,2,"""113760""",120.0,"""B96 B98""","""S"""
505,1,1,"""Maioni, Miss. ...","""female""",16.0,0,0,"""110152""",86.5,"""B79""","""S"""
690,1,1,"""Madill, Miss. ...","""female""",15.0,0,1,"""24160""",211.3375,"""B5""","""S"""
782,1,1,"""Dick, Mrs. Alb...","""female""",17.0,1,0,"""17474""",57.0,"""B20""","""S"""
803,1,1,"""Carter, Master...","""male""",11.0,1,2,"""113760""",120.0,"""B96 B98""","""S"""
854,1,1,"""Lines, Miss. M...","""female""",16.0,0,1,"""PC 17592""",39.4,"""D28""","""S"""


#### Aggregated queries

So let's now do a few examples. Firstly, I am going to get the oldest and youngest passengers on the titanic by using `max` and `min` in my queries. I will also get the average age of passengers using `Avg`.

In [38]:
q = "SELECT Max(Age) AS Oldest, Min(Age) AS Youngest, Avg(Age) AS Average FROM {}"
RemoteLazyFrame.sql(q, rdf).collect().fetch()

Oldest,Youngest,Average
f64,f64,f64
80.0,0.42,29.699118


Note that although 0.42 may seem a strange result to get for an age query, there is indeed a 0.42 entry in the Titanic dataset if we check!

### Group by
Next up, we are going to get the number of survivors in each class (1, 2 and 3) by using `group by`.

In [39]:
q = 'SELECT Pclass, count("Survived") FROM {} GROUP BY Pclass ORDER BY Pclass'
RemoteLazyFrame.sql(q, rdf).collect().fetch()

Pclass,Survived
i64,u32
1,216
2,184
3,491


## Joins

Finally, let's take a look at an example of a join. We will do an inner join to combine rdf1, with Element and Melting Point columns, with rdf2, which has element and Id columns.

Notice here how we need to send our RemoteLazyFrames as arguments to `sql()` twice because we use our placeholders twice in our SQL query!

In [40]:
df1 = pl.DataFrame(
    {
        "Element": ["Silver", "Gold"],
        "Melting Point (K)": [1234.93, 1337.33],
    }
)

df2 = pl.DataFrame({"Element": ["Silver", "Gold"], "Id": [1, 2]})

rdf1 = connection.client.polars.send_df(df1, policy=policy)
rdf2 = connection.client.polars.send_df(df2, policy=policy)

test = RemoteLazyFrame.sql(
    "SELECT * FROM {} inner join {} ON {}.Element = {}.Element", rdf1, rdf2, rdf1, rdf2
)
test.collect().fetch()

Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame de896370-91d4-4f10-aeae-f651627d7f80.

This incident will be reported to the data owner.[37m


Element,Melting Point (K),Id
str,f64,i64
"""Silver""",1234.93,1
"""Gold""",1337.33,2


So that brings our tutorial on SQL queries to an end. We have learnt how to select data, how to filter data down using `WHERE`, how to use `GROUP BY` and aggregate functions and how to do `joins` using BastionLab's sql() functionality.

However, please be aware that not all SQL functionality will work with BastionLab- sometimes this is not possible for security reasons and other times this is due to features not yet being implemented in the underlying polars-sql package. For example, DELETE, UPDATE, CASE, INSERT and FULL OUTER and RIGHT JOIN are all not currently possible.

Let's now close the connection and shutdown the server.

In [None]:
connection.close()

In [None]:
bastionlab_server.stop(srv)