<div id="colab_button">
  <h1>Introduction to Machine Learning Algorithms with BastionLab</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/how-to-guides/introduction_to_ml_algorithms_with_bastionlab.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>
______________________________________________________

Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms that can make predictions or decisions based on data. ML allows us to harness the growing quantity of data around us to create smart and powerful AI applications, such as improving the accuracy of the diagnosis of illnesses or detecting cyber attacks.

Working with machine learning algorithms is a key part of BastionLab's offer, allowing users to train and deploy models using popular machine learning algorithms. What sets BastionLab apart from other softwares that facilitate ML training and deployment is its unique privacy features, allowing data owners to keep their datasets safe throughout the ML pipeline.

In this Jupyter notebook, we will explore four popular machine learning algorithms, their applications, and how to implement them using BastionLab.

The four algorithms we will cover are:

1. Linear Regression - for predicting continuous values
2. Gaussian Naive Bayes - for classification problems with continuous features
3. Logistic Regression - for binary and multi-class classification problems
4. Decision Trees - for classification and regression problems

We will be using various datasets to illustrate the applications of these algorithms.

So without further ado, let's start exploring these machine learning algorithms!

## Table of Contents
----------
1. [Pre-requisites](#pre-requisites)
2. [Setting up and connecting to server](#Setting-up-and-connecting-to-server)
3. [Linear Regression](#linear-regression)
4. [Gaussian Naive Bayes](#gaussian-naive-bayes)
5. [Logistic Regression](#logistic-regression)
    - [Binomial Logistic Regression](#binomial-logistic-regression)
    - [Multinomial Logistic Regression](#multinomial-logistic-regression)
6. [Decision Trees](#decision-trees)


## Pre-requisites

________________________________________________
### Installation

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Install [Numpy](https://pypi.org/project/numpy)
- Install [scikit-learn](https://pypi.org/project/scikit-learn)

We can install the latter three packages by running the code block below. 

>You can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
!pip install bastionlab
!pip install bastionlab_server
!pip install numpy
!pip install scikit-learn

## Setting up and connecting to server
---------------------------------------------------------------

### Launching the server

First things first: we need to launch the BastionLab server. 

In production we recommend this is done using our Docker image, but for testing purposes you can use our `bastionlab_server` package, which removes the need for user authentication.

In [None]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*For more details on how you can set up the server using our Docker image, check out our [Installation Tutorial](../getting-started/installation.md).*

### Connecting to the server
Next, we will connect to the server using the following snipped of code.

In [11]:
from bastionlab import Connection

# connect to server instance
connection = Connection("localhost")

<div class="admonition warning">
    Please note that the above connection created is an insecure connection. You can refer to the <a href="https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/authentication.ipynb">Authentication tutorial</a> to learn about setting up secure connections to the BastionLab server.
</div>

In [6]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=False)

> To keep this overview short and simple, we will use weak but reasonable guarantees. If you're interested in setting up stricter policies, you are encouraged to have a look at our [Privacy policy tutorial](https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/policy.ipynb).

## Linear Regression
------------------------------------------------------------------

### Overview
Linear Regression is a supervised learning algorithm used for predicting a continuous outcome variable (also known as the response variable) based on one or more predictor variables. In other words, it is a technique for modeling the relationship between a dependent variable (Y) and one or more independent variables (X).

The goal of Linear Regression is to find the best-fit line or hyperplane that describes the relationship between the independent and dependent variables. This line or hyperplane is defined by a set of coefficients that determine the slope and intercept of the line or hyperplane.

### Applications

Linear Regression is widely used across various fields such as finance, economics, healthcare, and social sciences. Some of its applications include:

* Predicting stock prices
* Forecasting sales revenue
* Estimating the impact of a marketing campaign on sales
* Predicting the price of a house based on its features such as size, location, etc.

### Walk-through

Let's now use Linear Regression in BastionLab to predict housing prices.


**Step 1: Importing Libraries**

The first step is to import the required libraries. We will be using NumPy and Polars for data manipulation, the BastionLab Linfa Models submodule for building the Linear Regression model and the BastionLab Linfa Metrics submodule for validation.

In [12]:
import polars as pl
from bastionlab.linfa.models import LinearRegression
from bastionlab.linfa.metrics import mean_squared_error
from bastionlab.polars import train_test_split

**Step 2: Loading Data**

The scikit-learn dependency we previous downloaded contains built-in datasets that we can load using the sklearn datasets submodule. We will load up the California Housing dataset, a popular dataset used in machine learning and statistics. It contains data collected in the 1990 California public census including information on housing prices, demographics, and geography for each block group in California. The dataset contains 20,640 observations and 8 attributes, including the median house value, median income and housing occupancy rate.

In [None]:
from sklearn import datasets
import pandas as pd

# load dataset from sklearn
data = datasets.fetch_california_housing(as_frame=True)

The `fetch_california_housing` function returns a dictionary-like object with both the feature matrix `X` (the input data) and target vector `Y` (the output data/labels), as well as some additional metadata about the dataset. We set the option `as_frame` to True to get our data as Pandas objects. 

In order to send this data to BastionLab, we need the feature matrix and target data as Polars DataFrames. We can convert the data to Polars DataFrames by supplying the Polars DataFrame constructor method with our Pandas objects. Our features matrix is a Pandas DataFrame and will be converted to a Polars DataFrame without any further intervention. Our target vector however is a Pandas Series, which is not accepted by the Polars DataFrame constructor. We must therefore first use Pandas `to_numpy()` method to convert the Series into a numpy array.  

In [60]:
# Convert inputs from Pandas DataFrame into Polars DataFrame
inputs = pl.DataFrame(data["data"])

# Convert target into Polars DataFrame from Numpy Ndarray
target = pl.DataFrame(data["target"].to_numpy())

We are now ready to send our data to BastionLab. We do this by using BastionLab's `send_df` method and supplying it with our inputs and targets Polars dataframes.

In [61]:
# upload our inputs and target data
remote_inputs = connection.client.polars.send_df(inputs)
remote_target = connection.client.polars.send_df(target)

The server returns a `FetchableLazyFrame` for both the inputs and target dataframes. This is a reference to the remote DataFrames which can be used as if it were locally available.

**Step 3: Data Preparation**

Before building the model, we need to clean and prepare the data. 

The data cleaning and preparation involved will dependent on the dataset in question but can involve splitting the data into training and testing sets, scaling the data, and handling missing values.

We will start by converting out remote datasets into remote arrays using BastionLab's `to_array` method.

In [16]:
remote_X = remote_inputs.to_array()
remote_Y = remote_target.to_array()

<div class="admonition important">
    To learn more about this method, checkout our <a href="#">data conversion tutorial</a>
</div>

Now that we have converted our remote dataframes into remote arrays, we can split our data into training and testing sets using BastionLab's `train_test_split` method, which is similar to sklearn's method of the same name.

In [50]:
# split data into training and testing sets in BastionLab
X_train, X_test, y_train, y_test = train_test_split(
    remote_X,
    remote_Y,
    test_size=0.3,
    shuffle=True,
)

By setting the `test_size` to 0.3, we allocate 30% of our datasets for validation to check our model fits our data correctly.

By setting `shuffle` to True, we split the data randomly. The shuffle parameter is needed to prevent non-random assignment to to train and test sets.

The `train_test_split` method returns our training and testing X and Y data (input and target data) as remote arrays.

**Step 4: Training and testing the Model**

We can now build and run the Linear Regression model in BastionLab.

In [51]:
# Creating the model object
lr = LinearRegression()

# Training the model
lr.fit(X_train, y_train)

# Predicting the target values for the test set
y_pred = lr.predict(X_test)

We use the `LinearRegression()` constructor to get an instance of the linear regression model.

We then use the `fit` method with our training datasets to train the model.

Finally, we use the `predict` method to predict the target values for out test input data. Target values are returned to us as a FetchableLazyFrame.

**Step 5: Model Evaluation**

We can evaluate our model using various different metrics, with the following methods available in BastionLab: SimpleValidationRequest, R2Score, MeanAbsoluteError, MeanSquaredError, MeanSquaredLogError, MedianAbsoluteError, MaxError, ExplainedVariance, Accuracy, F1Score, Mcc, ClassificationMetric and RegressionMetric.

In this example, we will use the Mean Squared Error (MSE) metric to evaluate our data. MSE is one of the the most popular evaluation metrics for linear regression models.

To evaluate our test data using MSE, we use BastionLab's `mean_squared_error`, supplying it with the original target data and the values predicted by our model. We need to convert our `y_pred` `FetchableLazyFrame` into an numpy ndarray for this to work.

In [52]:
mse = mean_squared_error(y_test, y_pred.to_array())
mse

FetchableLazyFrame(identifier=e94f6468-2263-41ea-be60-9d57eef12ee3)

The function returns our MSE score as a `FetchableLazyFrame`. To view the results, we need to run `fetch` on this FetchableLazyFrame.

In [53]:
mse.fetch()

shape: (1, 1)
┌────────────────────┐
│ mean_squared_error │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.539386           │
└────────────────────┘


We get a MSE score of around 0.5. With MSE scores, the closer to 0 the better the model. A perfect model would return an MSE of 0.

### Conclusion

We have seen how to upload training and testing data to BastionLab, prepare the data and use it to train and evaluate a Linear Regression model.

Let's now move onto the next ML algorithm we are going to explore today, Gaussian Naive Bayes.

## Gaussian Naive Bayes
------------------------------------------------------------------------

### Overview

Gaussian Naive Bayes is a probabilistic algorithm used in machine learning for classification tasks. It is based on the Bayes theorem and assumes that the features are independent of each other, which means that the presence of one feature does not affect the probability of the presence of another feature.

Gaussian Naive Bayes is particularly useful when the number of features is large and the number of training examples is small. It is a simple yet effective algorithm that has been successfully used in many applications, including text classification, spam filtering, and image recognition.

### Walk-through

Let's now take a look at how we can use create an ML model to correctly identify types of iris plants using the Gaussian Naive Bayes ML algorithm.



**Step 1: Importing Libraries**

Just like with in the previous example, we first need to import all the required libraries. As in the previous example, we will be using NumPy and Polars for data manipulation and we use the `train_test_split` function from `bastionlab.polars`. We will also need the `GaussianNB model` and `metrics` submodule from Bastionlab Linfa.


In [None]:
import numpy as np
import polars as pl
from bastionlab.linfa.models import GaussianNB
from bastionlab.linfa import metrics
from bastionlab.polars import train_test_split

**Step 2: Data Loading**

For this example, we will use the Iris dataset, which is a popular dataset used for classification tasks. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The dataset is included in the scikit-learn library, which can be loaded using the `load_iris()` method:

In [None]:
from sklearn.datasets import load_iris
import numpy as np

# load iris dataset
iris = load_iris()

data = iris.data
target = iris.target

The `load_iris()` method returns an object to us with `data` and `target` attributes containing numpy ndarrays with the features matrix and target vector data. The returned object also contains some additional metadata attributes.

**Step 3: Uploading to BastionLab**

Now we are ready to upload the input and target datasets to BastionLab. But first, we need to convert these numpy ndarrays into Polars DataFrames. We do this by providing Polars DataFrame constructor with our numpy ndarray objects and saving the returned Polars DataFrames.

In [None]:
# Convert input data to a Polars DataFrame
data = pl.DataFrame(data)

# Convert target data to a Polars DataFrame
target = pl.DataFrame(target)

Now, we are ready to upload the data and target dataframes to BastionLab using BastionLab's `send_df` method.

In [None]:
# upload data dataset
remote_data = connection.client.polars.send_df(data, policy)

# upload target dataset
remote_target = connection.client.polars.send_df(target, policy)

We now have a `FetchableLazyFrame` instance of both the input and target data which we can work with in BastionLab.

**Step 4: Preprocessing**

In order to use further pre-processing and training functions in BastionLab, we must first convert our `FetchableLazyFrames` into `RemoteArrays` using the `to_array()` method.

In [None]:
# convert data FetchableLazyFrame to remote array
remote_data = remote_data.to_array()

# convert target FetchableLazyFrame to remote array
remote_target = remote_target.to_array()

Before training our model, we need to split our data into training and testing sets. We will use the `train_test_split()` function from BastionLab Polars to do this.

In [None]:
# get testing and training X and Y arrays
X_train, X_test, y_train, y_test = train_test_split(
    remote_data, remote_target, test_size=0.2, random_state=42
)

By setting the `test_size` to 0.2, we allocate 20% of our datasets for validation to check our model's performance.

By setting `random_state` to an integer, which happens to be 42 but could be a different value, the function will produce the same test and training sets across different executions. The results are only changed if we change the integer value.

The `train_test_split` method returns our training and testing X and Y data (input and target data) as remote arrays.

**Step 5: Training and Predicting**

We can now train our Gaussian Naive Bayes model using the GaussianNB class from BastionLab Linfa. We will fit the model to the training data and use it to predict the labels of the testing data:

In [None]:
# Creating the model object
gnb = GaussianNB()

# Training the model
gnb.fit(X_train, y_train)

# Predicting the target values for the test set
y_pred = gnb.predict(X_test)

**Step 6: Evaluating the model**

We will now evaluate the performance of our model using the accuracy score metric, which gives us the percentage of correctly classified instances:
[UP TO HERE- EXPLAIN TO_ARRAY()]

In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred.to_array())
print("Accuracy:", accuracy)

Accuracy: FetchableLazyFrame(identifier=5be09565-36aa-46de-9e0e-9dfca3c1d134)


> Note that results of metrics are returned as a `RemoteDataFrame` and so, you would have to call `fetch` to see the results as a plain Polars DataFrame.

In [None]:
print(accuracy.fetch())

shape: (1, 1)
┌──────────┐
│ accuracy │
│ ---      │
│ f32      │
╞══════════╡
│ 0.933333 │
└──────────┘


### Conclusion

In this tutorial, we learned how to use Gaussian Naive Bayes for classification tasks using the Iris dataset. We loaded the data, sent it to the BastionLab server, preprocessed it, trained the model, and evaluated its performance using the accuracy score. Gaussian Naive Bayes is a simple yet effective algorithm that can be used for a variety of classification tasks.

## Logistic Regression
------------------------------------------------------------------

Logistic Regression is a widely used statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. It is commonly used in various fields such as medicine, social sciences, and marketing for binary and multi-class classification problems. In this Jupyter notebook, we will explore binomial and multinomial logistic regression, their applications, and how to implement them BastionLab.

### Types of Logistic Regression

- [Binomial Logistic Regression](#binomial-logistic-regression)
- [Multinomial Logistic Regression](#multinomial-logistic-regression)
  
We will start with binomial logistic regression and then move on to multinomial logistic regression.

We will use the following Python libraries:

* Pandas - for loading and manipulating the dataset
* BastionLab Linfa - for building and evaluating the logistic regression models
* Scikit-learn - for fetching the dataset.


We will be using the famous Iris dataset directly from sklearn which contains information about various flowers. The goal of the classification task is to predict the species of the flower based on its features.

Let's get started!

### Binomial Logistic Regression

#### Overview

Binomial logistic regression is a statistical method used to model the relationship between a binary target variable and one or more independent variables. It is commonly used in classification tasks where the target variable has only two possible outcomes.

#### Applications
Binomial logistic regression is widely used in various fields such as medicine, social sciences, and marketing. It can be used to predict the likelihood of a patient having a disease based on their symptoms, the likelihood of a customer purchasing a product based on their demographic information, and more.

#### Implementation
In this section, we will demonstrate how to implement binomial logistic regression using Python and Scikit-learn.

**Step 1: Importing Libraries**
The first step is to import the required libraries. We will be using Polars for data manipulation, BastionLab Polars for creating training and testing sub-datasets, and BastionLab Linfa for building the Binomial Logistic Regression model and using the metrics submodule for validation.

In [None]:
import polars as pl
from bastionlab.linfa.models import LogisticRegression
from bastionlab.linfa import metrics
from bastionlab.polars import train_test_split
from sklearn.datasets import load_iris

**Step 2: Loading the Iris Dataset**

The next step is to load the data into a Polars DataFrame. We will be using the Iris dataset, which contains information about various flowers.

In [None]:
iris = load_iris()
inputs = pl.DataFrame(iris.data)
target = pl.DataFrame(iris.target)

**Step 3: Uploading Data to BastionLab**

The iris dataset loaded from scikit-learn would have to be first uploaded onto BastionLab. The effect of this step is akin to only having access to the remote data to apply the Logistic Regression algorithm to, and you will also be using the [policy](###setting-up-the-privacy-policy) set up above. 


In [None]:
remote_inputs = connection.client.polars.send_df(inputs, policy)
remote_target = connection.client.polars.send_df(target, policy)

**Step 4: Preprocessing**

Before building the model, we need to prepare the data. This involves converting the target variable to a binary form and splitting the data into training and testing sets.

In [None]:
remote_target = remote_target.select(
    pl.when(pl.all() == 0).then(0).otherwise(1)
).collect()

Before using our `RemoteDataFrame`s, we have to convert them into `RemoteArray`s.

We use the `to_array` method to convert them.

In [None]:
remote_inputs = remote_inputs.to_array()
remote_target = remote_target.to_array()

Now, we split our dataset into training and testing subsets with the snippet below.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    remote_inputs, remote_target, test_size=0.3, random_state=42, shuffle=True
)

**Step 5: Building the model**

In [None]:
# Creating the model object
lr = LogisticRegression()

# Training the model
lr.fit(X_train, y_train)

# Predicting the target values for the test set
y_pred = lr.predict(X_test)

**Step 6: Model Evaluation**

Finally, we can evaluate the model using metrics such as accuracy score.

> Note that results of metrics are returned as a `RemoteDataFrame` and so, you would have to call `fetch` to see the results as a plain Polars DataFrame.

In [None]:
y_pred = y_pred.to_array()

In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred)
print(accuracy)

FetchableLazyFrame(identifier=ccad625d-f369-4a8a-94b4-ed81f13987f1)


In [None]:
accuracy = accuracy.fetch()
print(accuracy)

shape: (1, 1)
┌──────────┐
│ accuracy │
│ ---      │
│ f32      │
╞══════════╡
│ 0.521739 │
└──────────┘


#### Conclusion

In this tutorial, we have learned about binomial logistic regression and its application in binary classification problems. We started by understanding the logistic function, which is the core of logistic regression. Then we discussed the key components of logistic regression, including the dependent variable, independent variables, and the maximum likelihood estimation method.

### Multinomial Logistic Regression

#### Overview
Logistic Regression is a popular algorithm used for binary classification problems. However, it can be extended to handle multi-class classification problems using the Multinomial Logistic Regression algorithm. The idea behind this algorithm is to train a separate binary logistic regression model for each class, where the output of each model is the probability of the input belonging to that class. The class with the highest probability is then selected as the predicted class.

#### Implementation

**Step 1: Importing Libraries**
The first step is to import the required libraries. We will be using Polars for data manipulation, BastionLab Polars for creating training and testing sub-datasets, and BastionLab Linfa for building the Multinomial Logistic Regression model and using the metrics submodule for validation.

In [None]:
import polars as pl
from bastionlab.linfa.models import LogisticRegression
from bastionlab.linfa import metrics
from bastionlab.polars import train_test_split
from sklearn.datasets import load_iris

**Step 2: Loading the Iris Dataset**

The next step is to load the data into a Polars DataFrame. We will be using the Iris dataset, which contains information about various flowers.

In [None]:
iris = load_iris()
inputs = pl.DataFrame(iris.data)
target = pl.DataFrame(iris.target)

**Step 3: Uploading Data to BastionLab**

The iris dataset loaded from scikit-learn would have to be first uploaded onto BastionLab. The effect of this step is akin to only having access to the remote data to apply the Logistic Regression algorithm to, and you will also be using the [policy](###setting-up-the-privacy-policy) set up above. 

In [None]:
remote_inputs = connection.client.polars.send_df(inputs, policy)
remote_target = connection.client.polars.send_df(target, policy)

**Step 4: Preprocessing**

Before using our `RemoteDataFrame`s, we have to convert them into `RemoteArray`s.
We use the `to_array` method to convert them.

In [None]:
remote_inputs = remote_inputs.to_array()
remote_target = remote_target.to_array()

Now, we split our dataset into training and testing subsets with the snippet below.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    remote_inputs, remote_target, test_size=0.2, random_state=42, shuffle=True
)

**Step 5: Building the model**

In [None]:
# Creating the model object
lr = LogisticRegression(multi_class="multinomial", max_iter=1000)

# Training the model
lr.fit(X_train, y_train)

# Predicting the target values for the test set
y_pred = lr.predict(X_test)

**Step 6: Model Evaluation**

Finally, we can evaluate the model using metrics such as accuracy score.

> Note that results of metrics are returned as a `RemoteDataFrame` and so, you would have to call `fetch` to see the results as a plain Polars DataFrame.

In [None]:
y_pred = y_pred.to_array()

In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred)
print(accuracy)

FetchableLazyFrame(identifier=569c80a3-6fb5-461b-8b92-81b87f172d8f)


In [None]:
accuracy = accuracy.fetch()
print(accuracy)

shape: (1, 1)
┌──────────┐
│ accuracy │
│ ---      │
│ f32      │
╞══════════╡
│ 1.0      │
└──────────┘


#### Conclusion

In this tutorial, we have learned about multinomial logistic regression and its application in binary classification problems. We started by understanding the logistic function, which is the core of logistic regression. Then we discussed the key components of logistic regression, including the dependent variable, independent variables, and the maximum likelihood estimation method.

## Decision Trees

### Overview
Decision Trees is a supervised machine learning algorithm used for both classification and regression tasks. It works by recursively splitting the data into subsets based on the values of the features and predicting the target variable based on the subset it belongs to.

### Applications
Decision Trees is widely used in various fields such as finance, medicine, and marketing.

### Implementation
In this section, we will demonstrate how to implement Decision Trees using BastionLab.

**Step 1: Importing Libraries**

The first step is to import the required libraries. We will be using Polars for data manipulation, and BastionLab for building the Decision Trees model.

In [None]:
import polars as pl
from bastionlab.linfa.models import DecisionTreeClassifier
from bastionlab.linfa import metrics
from bastionlab.polars import train_test_split
from sklearn.datasets import load_iris

**Step 2: Loading Data**

The next step is to load the data into a Polars DataFrame. We will be using the Iris dataset.

A well-known and commonly used dataset in the field of machine learning is the Iris dataset. Sepal length, sepal width, petal length, and petal breadth are the four characteristics that each sample of 150 iris flowers has. When using this dataset for classification tasks, the goal is to identify the species of iris flower based on these four characteristics.

Iris setosa, Iris versicolor, and Iris virginica are the three classes of iris blooms represented in the dataset. There are 50 examples in each class, making 150 samples overall in the dataset. Sepal length and width, as well as petal length and width, are all measured in centimeters.

In [None]:
iris = load_iris()
inputs = pl.DataFrame(iris.data)
target = pl.DataFrame(iris.target)

**Step 3: Uploading Data to BastionLab**

The iris dataset loaded from scikit-learn would have to be first uploaded onto BastionLab. The effect of this step is akin to only having access to the remote data to apply the Logistic Regression algorithm to, and you will also be using the [policy](###setting-up-the-privacy-policy) set up above. 

In [None]:
remote_inputs = connection.client.polars.send_df(inputs, policy)
remote_target = connection.client.polars.send_df(target, policy)

**Step 4: Preprocessing**

Before using our `RemoteDataFrame`s, we have to convert them into `RemoteArray`s.
We use the `to_array` method to convert them.

In [None]:
remote_inputs = remote_inputs.to_array()
remote_target = remote_target.to_array()

Now, we split our dataset into training and testing subsets with the snippet below.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    remote_inputs, remote_target, test_size=0.2, random_state=42, shuffle=True
)

**Step 5: Building the model**

In [None]:
# Creating the model object
dtc = DecisionTreeClassifier()

# Training the model
dtc.fit(X_train, y_train)

# Predicting the target values for the test set
y_pred = dtc.predict(X_test)

**Step 6: Model Evaluation**

Finally, we can evaluate the model using metrics such as accuracy score.

> Note that results of metrics are returned as a `RemoteDataFrame` and so, you would have to call `fetch` to see the results as a plain Polars DataFrame.

In [None]:
y_pred = y_pred.to_array()

In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred)
print(accuracy)

FetchableLazyFrame(identifier=8f18c983-c028-4f9b-bb55-5c708f5f11df)


In [None]:
accuracy = accuracy.fetch()
print(accuracy)

shape: (1, 1)
┌──────────┐
│ accuracy │
│ ---      │
│ f32      │
╞══════════╡
│ 1.0      │
└──────────┘
