<div id="colab_button">
  <h1>Introduction to Machine Learning Algorithms with BastionLab</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/how-to-guides/introduction_to_ml_algorithms_with_bastionlab.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>
______________________________________________________

Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms that can make predictions or decisions based on data. ML allows us to harness the growing quantity of data around us to create smart and powerful AI applications, such as improving the accuracy of the diagnosis of illnesses or detecting cyber attacks.

Working with machine learning algorithms is a key part of BastionLab's offer, allowing users to train and deploy models using popular machine learning algorithms. What sets BastionLab apart from other softwares that facilitate ML training and deployment is its unique privacy features, allowing data owners to keep their datasets safe throughout the ML pipeline.

In this Jupyter notebook, we will explore four popular machine learning algorithms, their applications, and how to implement them using BastionLab.

The four algorithms we will cover are:

1. Linear Regression - for predicting continuous values
2. Gaussian Naive Bayes - for classification problems with continuous features
3. Logistic Regression - for binary and multi-class classification problems
4. Decision Trees - for classification and regression problems

So without further ado, let's start exploring these machine learning algorithms!

## Table of Contents
----------
1. [Pre-requisites](#pre-requisites)
2. [Setting up and connecting to server](#Setting-up-and-connecting-to-server)
3. [Uploading and preparing the datasets](#Uploading-and-preparing-the-datasets)
4. [Linear Regression](#linear-regression)
5. [Gaussian Naive Bayes](#gaussian-naive-bayes)
6. [Logistic Regression](#logistic-regression)
    - [Binomial Logistic Regression](#binomial-logistic-regression)
    - [Multinomial Logistic Regression](#multinomial-logistic-regression)
7. [Decision Trees](#decision-trees)


## Pre-requisites

________________________________________________
### Installation

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Install [Numpy](https://pypi.org/project/numpy)
- Install [scikit-learn](https://pypi.org/project/scikit-learn)

We can install the latter three packages by running the code block below. 

>You can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
!pip install bastionlab
!pip install bastionlab_server
!pip install numpy
!pip install scikit-learn

## Setting up and connecting to server
---------------------------------------------------------------

### Launching the server

First things first: we need to launch the BastionLab server. 

In production we recommend this is done using our Docker image, but for testing purposes you can use our `bastionlab_server` package, which removes the need for user authentication.

In [67]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*For more details on how you can set up the server using our Docker image, check out our [Installation Tutorial](../getting-started/installation.md).*

### Connecting to the server
Next, we will connect to the server using the following snipped of code.

In [68]:
from bastionlab import Connection

# connect to server instance
connection = Connection("localhost")

<div class="admonition warning">
    Please note that the above connection created is an insecure connection. You can refer to the <a href="https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/authentication.ipynb">Authentication tutorial</a> to learn about setting up secure connections to the BastionLab server.
</div>

## Uploading and preparing the datasets
------------------------------------------------------------------

**Step 1: Importing Libraries**

The first thing we need to do is import all the required libraries for this tutorial. We will be using sklearn's `datasets` submodule and `Polars` for loading and uploading our dataset, BastionLab's `Polars` submodule for dataset preprocessing and BastionLab Linfa's `model` and `metrics` submodules in order to build, train and evaluate our model.

In [69]:
# imports for data loading and uploading to BastionLa
from sklearn import datasets
import polars as pl

# imports for dataset preprocessing
from bastionlab.polars import train_test_split

# imports for training and evaluating model
from bastionlab.linfa import models, metrics

**Step 2: Loading Datasets**

We will be using two datasets to demonstrate the different machine learning algorithms.

The first dataset is the California Housing dataset, a popular dataset used in machine learning and statistics. It contains data collected in the 1990 California public census including information on housing prices, demographics, and geography for each block group in California. The dataset contains 20,640 observations and 8 attributes, including the median house value, median income and housing occupancy rate.

The second dataset is the Iris dataset, a popular dataset used for classification tasks. It contains 150 samples belonging to one of three iris flower species (Iris setosa, Iris virginica and Iris versicolor). The datasets contains information about four features: sepal length, sepal width, petal length, and petal width. 

Both of these datasets can be loaded using built-in methods in sklearn's `datasets` submodule. We can load the California Housing dataset using the `fetch_california_housing()` method and the Iris dataset using the `load_iris()` method.

In [70]:
# load datasets from sklearn
california = datasets.fetch_california_housing()
iris = datasets.load_iris()

These method return objects to us with `data` and `target` attributes containing Numpy ndarrays with the features matrix and target vector data. The returned object also contains some additional metadata attributes.

In [71]:
# collect feature matrix `X` (the input data) and target vector `Y` (the output data/labels)
iris_data = iris.data
iris_target = iris.target

california_data = california.data
california_target = california.target

**Step 3: Uploading datasets to BastionLab**

We are now ready to send our data to BastionLab. But first we need to convert our datasets into `Polars DataFrames` which is the required dataset format for BastionLab's `send_df` method used to upload datasets.

To convert our Iris and California Housing `X` and `Y` data, we use the Polars DataFrame constructor and supply it with our Numpy ndarray format of the datasets.

In [72]:
# Convert input and target data to a Polars DataFrames
california_data = pl.DataFrame(california_data)
california_target = pl.DataFrame(california_target)

iris_data = pl.DataFrame(iris_data)
iris_target = pl.DataFrame(iris_target)

We can now upload the datasets to BastionLab.

In [73]:
# upload our inputs and target datasets
remote_cal_data = connection.client.polars.send_df(california_data)
remote_cal_target = connection.client.polars.send_df(california_target)

remote_iris_data = connection.client.polars.send_df(iris_data)
remote_iris_target = connection.client.polars.send_df(iris_target)

The server returns a `FetchableLazyFrame` for all our uploaded dataframes. This is a reference to the remote DataFrames which can be used as if it were locally available.

**Step 4: Data Preparation**

First of all, in one of the up-coming sections we will use the Iris dataset to illustrate a binary classification problem. However, the Iris dataset is currently a multi-class classification problem with data grouped into one of three Iris species grouping (represented by 0, 1 and 2). We need to create a modified version of the target data which has just two possible outcomes indicating whether the input data relates to the Iris setosa group (0) or not (1).

To do this, we need to take our `FetchableLazyFrame` instance of the Iris target data `remote_iris_target` and use Polars `.when().then().otherwise()` methods to replace all non-zero data with 1.

We will store this in a `bi_iris_target` variable.

In [74]:
# change training and testing output data to have two possible outcomes
bi_iris_target = remote_iris_target.select(
    pl.when(pl.all() == 0).then(0).otherwise(1)  # all non 0 data becomes 1
).collect()

Now that we have all the datasets we need for this tutorial, we need to prepare the data and split our `X` and `Y` datasets into testing and training datasets.

But first, we need to convert out remote datasets into remote arrays using BastionLab's `to_array` method. To avoid having duplicate remote arrays and datasets, the FetchableLazyFrames are automatically deleted by the server when we use the `to_array()` method, leaving us with the remote array copy of our datasets only.

In [75]:
# convert all required datasets to arrays
cal_remote_X = remote_cal_data.to_array()
cal_remote_Y = remote_cal_target.to_array()

iris_remote_X = remote_iris_data.to_array()
iris_remote_Y = remote_iris_target.to_array()

iris_bi_Y = bi_iris_target.to_array()

<div class="admonition important">
    To learn more about this method, checkout our <a href="#">data conversion tutorial</a>
</div>

Now that we have converted our remote dataframes into remote arrays, we can split our data into training and testing sets using BastionLab's `train_test_split` method, which is similar to sklearn's method of the same name.

Note how we create three training/testing sets: one for the California Housing dataset, one for the default Iris dataset and a final one for the Iris dataset which uses our modified binary classification output data.

In [76]:
# split data into training and testing sets
cal_x_train, cal_x_test, cal_y_train, cal_y_test = train_test_split(
    cal_remote_X,
    cal_remote_Y,
    test_size=0.3,
    shuffle=True,
)

iris_x_train, iris_x_test, iris_y_train, iris_y_test = train_test_split(
    iris_remote_X,
    iris_remote_Y,
    test_size=0.3,
    shuffle=True,
)

bi_iris_x_train, bi_iris_x_test, bi_iris_y_train, bi_iris_y_test = train_test_split(
    iris_remote_X,
    iris_bi_Y,
    test_size=0.3,
    shuffle=True,
)

By setting the `test_size` to 0.3, we allocate 30% of our datasets for validation to check our model's performance.

By setting `shuffle` to `True`, we ensure that the data is assigned to the testing and training datasets randomly.

The `train_test_split` method returns our training and testing X and Y data (input and target data) as remote arrays.

Now that we have our testing and training sets for the Iris and California Housing datasets set up in BastionLab, we are ready to explore how we can use them to test and train models using the different machine learning algorithms covered in this tutorial. Let's start with Linear Regression!

## Linear Regression
------------------------------------------------------------------

### Overview
Linear Regression is a supervised learning algorithm used for predicting a continuous outcome variable (also known as the response variable) based on one or more predictor variables. In other words, it is a technique for modeling the relationship between a dependent variable (Y) and one or more independent variables (X).

The goal of Linear Regression is to find the best-fit line or hyperplane that describes the relationship between the independent and dependent variables. This line or hyperplane is defined by a set of coefficients that determine the slope and intercept of the line or hyperplane.

### Applications

Linear Regression is widely used across various fields such as finance, economics, healthcare, and social sciences. Some of its applications include:

* Predicting stock prices
* Forecasting sales revenue
* Estimating the impact of a marketing campaign on sales
* Predicting the price of a house based on its features such as size, location, etc.

### Walk-through

We will now see how we can use the California Housing dataset we uploaded to BastionLab to build and run the Linear Regression model to predict housing prices.

**Training and testing a model using Linear Regression**

We will start by creating a Linear Regression model object, train it and then run it on out test input data. 

In [77]:
# Creating the model object
lr = models.LinearRegression()

# Training the model
lr.fit(cal_x_train, cal_y_train)

# Predicting the target values for the test set
y_pred = lr.predict(cal_x_test)

We use the `LinearRegression()` constructor to get an instance of the linear regression model.

We use the `fit` method with our training datasets to train the model.

Finally, we use the `predict` method to predict the target values for out test input data. Predicted values are returned to us as a `FetchableLazyFrame`.

**Model Evaluation**

We can evaluate our model using various different metrics, with the following methods available in BastionLab: SimpleValidationRequest, R2Score, MeanAbsoluteError, MeanSquaredError, MeanSquaredLogError, MedianAbsoluteError, MaxError, ExplainedVariance, Accuracy, F1Score, Mcc, ClassificationMetric and RegressionMetric.

In this example, we will use the Mean Squared Error (MSE) metric to evaluate our data. MSE is one of the the most popular evaluation metrics for linear regression models.

To evaluate our test data using MSE, we use BastionLab's `mean_squared_error`, supplying it with the original target data and the values predicted by our model. Note that we need to convert our `y_pred` `FetchableLazyFrame` into a Remote Array.

In [78]:
mse = metrics.mean_squared_error(cal_y_test, y_pred.to_array())
mse

FetchableLazyFrame(identifier=33cbbb25-f815-48c6-b252-e68966d9cec1)

The function returns our MSE score as a `FetchableLazyFrame`. To view the results, we need to run `fetch` on this FetchableLazyFrame.

In [79]:
mse.fetch()

mean_squared_error
f64
0.539783


We get a MSE score of around 0.54. With MSE scores, the closer to 0 the better the model. A perfect model would return an MSE of 0.

### Conclusion

We have seen how to train and evaluate a Linear Regression model to predict housing prices.

Let's now move onto the next ML algorithm we are going to explore today, Gaussian Naive Bayes.

## Gaussian Naive Bayes
------------------------------------------------------------------------

### Overview

Gaussian Naive Bayes is a probabilistic algorithm used in machine learning for classification tasks. It is based on the Bayes theorem and assumes that the features are independent of each other, which means that the presence of one feature does not affect the probability of the presence of another feature.

Gaussian Naive Bayes is particularly useful when the number of features is large and the number of training examples is small. It is a simple yet effective algorithm that has been successfully used in many applications, including text classification, spam filtering, and image recognition.

### Walk-through

Let's take a look at how we can use create an ML model to correctly identify species of Iris flowers using the Gaussian Naive Bayes ML algorithm. For this example, we will be using the Iris testing and training datasets we uploaded to BastionLab at the start of this tutorial.

**Training and Predicting**

We first get an instance of the Gaussian Naive Bayes model using the GaussianNB class from BastionLab Linfa. We then fit the model to the training data and use it to predict the labels of the testing data.

In [80]:
# Creating the model object
gnb = models.GaussianNB()

# Training the model
gnb.fit(iris_x_train, iris_y_train)

# Predicting the target values for the test set
y_pred = gnb.predict(iris_x_test)

We save the `FetchableLazyFrame` we get back containing our model's predicted outputs as `y_pred`.

**Evaluating the model**

We will now evaluate the performance of our model using the accuracy score metric, which gives us the percentage of correctly classified instances.

To do this we use the `accuracy_score` method, providing it with out test Y dataset and our predicted values converted to a Remote Array.

In [81]:
accuracy = metrics.accuracy_score(iris_y_test, y_pred.to_array())
print("Accuracy:", accuracy)

Accuracy: FetchableLazyFrame(identifier=2904860c-a42e-4d38-a2a7-c87212d730ec)


> Note that results of metrics are returned as a `RemoteDataFrame`, so we have to call `fetch` to see the results as a plain Polars DataFrame.

In [82]:
print(accuracy.fetch())

shape: (1, 1)
┌────────────────────┐
│ accuracy(accuracy) │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.934783           │
└────────────────────┘


We see that our model predicted species of flowers with around a 93% level of accuracy.

### Conclusion

We have seen how we can use Gaussian Naive Bayes for classification tasks using the Iris dataset. We were able to train a model which identifies species of Iris flowers and evaluated its performance using the accuracy score.

Next up, let's take a look at how we can create a Logistic Regression model in BastionLab.

## Logistic Regression
------------------------------------------------------------------

Logistic Regression is a widely used statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. It is used for binary and multi-class classification problems. In this section, we will explore binomial and multinomial logistic regression, their applications, and how to implement them BastionLab.
  
We will start with binomial logistic regression and then move on to multinomial logistic regression.

We will use the Iris dataset for this section which we have already uploaded to the BastionLab server.

Let's get started!

### Binomial Logistic Regression

#### Overview

Binomial logistic regression is a statistical method used to model the relationship between a binary target variable and one or more independent variables. It is used in classification tasks where the target variable has only two possible outcomes.

#### Applications
Binomial logistic regression is widely used across various fields. It could be used to predict the likelihood of a patient having a disease based on their symptoms or the likelihood of a customer purchasing a product based on their demographic information.

#### Walk-through

For this walk-through, we will use our Iris dataset specially modified to contain two possible output groups, indicating if a flower belongs to the Iris setosa group (0) or not (1). Our aim is to create a model using Logistic Regression to predict whether our input data relates to a Iris setosa flower or not.

**Training and running the model**

We start by getting our instance of our Logistic Regression model and train and test our model.

In [83]:
# Creating the model object
lr = models.LogisticRegression()

# Training the model
lr.fit(bi_iris_x_train, bi_iris_y_train)

# Predicting the target values for the test set
y_pred = lr.predict(bi_iris_x_test)

The `predict` method gives us back a `FetchableLazyFame` containing our predicted labels.

**Model Evaluation**

We can then evaluate the model using an accuracy score, which gives us a decimal value represented what percentage of predictions were correct. We must provide the `accuracy_score` method with both our test target data and our predicted data as a Remote Array.

In [84]:
accuracy = metrics.accuracy_score(bi_iris_y_test, y_pred.to_array())
print(accuracy)

FetchableLazyFrame(identifier=076938e7-5cea-4fb0-b009-b92da860a78d)


Our accuracy score is returned to us as `FetchableLazyFrame`. We must use the `fetch()` method to be able to view the results.

In [85]:
print(accuracy.fetch())

shape: (1, 1)
┌────────────────────┐
│ accuracy(accuracy) │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.478261           │
└────────────────────┘


We see that our model was 47% accurate in predicting whether input data related to an Iris setosa flower or not.

### Multinomial Logistic Regression

#### Overview
In the previous example, we saw how logistic Regression can be used for binary classification problems. However, it can be extended to handle multi-class classification problems using the Multinomial Logistic Regression algorithm. The idea behind this algorithm is to train a separate binary logistic regression model for each class, where the output of each model is the probability of the input belonging to that class. The class with the highest probability is then selected as the predicted class.

**Building the model**

We will again be using the Iris dataset for this example. However, we will no longer be using the specially modified version of the dataset which split predictions into two groups only that we used in the binomial example.

When we create out Logistic Regression class instance, we will need to specify we want a model for multi-class classification by setting the `multi_class` option to `multinomial`.

We then set the `max_iter` option to 1000, which sets the maximum number of iterations for our model to 1000.

In [86]:
# Creating the model object
lr = models.LogisticRegression(multi_class="multinomial", max_iter=1000)

Now that we have our model object, we can train our model using the standard Iris training data and then get our predicted labels for our test input data.

In [87]:
# Training the model
lr.fit(iris_x_train, iris_y_train)

# Predicting the target values for the test set
y_pred = lr.predict(iris_x_test)

**Model Evaluation**

Finally, we can evaluate the model using an accuracy score metric, which tells us what percentage of predictions were correct. 

We must provide the `accuracy_score` method with both our test target data and our predicted data as Remote Arrays.

In [88]:
accuracy = metrics.accuracy_score(iris_y_test, y_pred.to_array())
print(accuracy)

FetchableLazyFrame(identifier=06f9a77c-99b0-4566-b880-4f5315cc5c11)


> Note that results of metrics are returned as a `RemoteDataFrame`, so we need to call `fetch` to see the results as a plain Polars DataFrame.

In [89]:
accuracy = accuracy.fetch()
print(accuracy)

shape: (1, 1)
┌────────────────────┐
│ accuracy(accuracy) │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.956522           │
└────────────────────┘


We see that our model was about 95% accurate in predicting what type of Iris flower our input data related to.

#### Conclusion

In this section of the tutorial, we have seen how we can use binomial logistic regression with binary classification problems and multinomial logistic regression with multi-class classification problems.

## Decision Trees

### Overview
The final machine learning algorithm we are going to explore is decision trees, a supervised machine learning algorithm used for both classification and regression tasks. It works by recursively splitting the data into subsets based on the values of the features and predicting the target variable based on the subset it belongs to.

### Applications
The decision trees algorithm is widely used in various fields such as finance, medicine, and marketing.

### Walk-through

In this section, we will demonstrate how to create a ML model using the Decision Trees algorithm to predict the type of Iris flower our input data corresponds to. We will again be using the Iris dataset for this example.

As in the previous examples, we start by initializing an instance of our model's class. We then use the `fit` and `predict` methods to train our model and then use the model to get predictions for our test data.

**Training and running the model**

In [94]:
# Creating the model object
dtc = models.DecisionTreeClassifier()

# Training the model
dtc.fit(iris_x_train, iris_y_train)

# Predicting the target values for the test set
y_pred = dtc.predict(iris_x_test)

**Step 6: Model Evaluation**

Now we can evaluate the model using the accuracy score metric, which indicates what percentage of the model's predictions were correct based on our test data.

We must send our test target data and our predicted data as Remote Arrays. To convert out predicted data from a `FetchableLazyFrame` to a Remote Array, we use BastionLab's `to_array()` method.

In [95]:
# get accuracy score of our model
accuracy = metrics.accuracy_score(iris_y_test, y_pred.to_array())
print(accuracy)

FetchableLazyFrame(identifier=d9a13e6b-bf33-4f28-8654-c9cfaedcadac)


> Note that results of metrics are returned as a `RemoteDataFrame` and so, you would have to call `fetch` to see the results as a plain Polars DataFrame.

In [96]:
# display accuracy results
print(accuracy.fetch())

shape: (1, 1)
┌────────────────────┐
│ accuracy(accuracy) │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.891304           │
└────────────────────┘


We see that our Decision Trees model was 89% accurate in identifying the type of Iris flower our input data corresponded to.

### Conclusion

We have now seen how we can select and use the Decision Trees Classifier model in BastionLab. We were able to train a model which was 100% accurate in identifying species of Iris flowers.

## Conclusions
------------------------------------------------------------------

In this tutorial, we have zoomed in on five machine learning algorithms available in BastionLab and seen how we can implement them for various classifying or predictive tasks. There are many more machine learning algorithms and evaluating metrics to explore, so feel free to experiment with the models and metrics available in the BastionLab Linfa submodule.