<div id="colab_button">
  <h1>Introduction to Machine Learning Algorithms with BastionLab</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/how-to-guides/introduction_to_ml_algorithms_with_bastionlab.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>
______________________________________________________


Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms that can learn from and make predictions or decisions on data. Machine learning algorithms have become increasingly important in various fields such as medicine, finance, and marketing. In this Jupyter notebook, we will explore five popular machine learning algorithms, their applications, and how to implement them using Python and Scikit-learn.

The five algorithms we will cover are:

1. Linear Regression - for predicting continuous values
2. Gaussian Naive Bayes - for classification problems with continuous features
3. KMeans - for clustering data points into groups
4. Decision Trees - for classification and regression problems
5. Logistic Regression - for binary and multi-class classification problems

We will be using various datasets to illustrate the applications of these algorithms.

Let's dive in and explore these machine learning algorithms!

## Table of Contents
----------
1. [Linear Regression](#linear-regression)
2. [Gaussian Naive Bayes](#gaussian-naive-bayes)
3. [KMeans](#kmeans)
4. [Decision Trees](#decision-trees)
5. [Logistic Regression](#logistic-regression)
    - [Binomial Logistic Regression](#binomial-logistic-regression)
    - [Multinomial Logistic Regression](#multinomial-logistic-regression)


## Pre-requisites

________________________________________________
### Installation

This tutorial is intended for individuals with a basic understanding of Python programming language and some familiarity with the following libraries:

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Install [Numpy](https://pypi.org/project/numpy)
- Install [scikit-learn](https://pypi.org/project/scikit-learn)

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/how-to-guides/introduction_to_ml_algorithms_with_bastionlab.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
!pip install bastionlab
!pip install bastionlab_server
!pip install numpy
!pip install scikit-learn

## Bootstrapping



### Creating a connection

Before jumping into the tutorial, we will have to set up a connection to the BastionLab server. You will use the snippet below to create a connection.

In [1]:
from bastionlab import Connection

connection = Connection("localhost")

<div class="admonition warning">
    Please note that the above connection created is an insecure connection. You can refer to the <a href="https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/authentication.ipynb">Authentication tutorial</a> to learn about setting up secure connections to the BastionLab server.
</div>

### Setting up the privacy policy

Using BastionLab client, we'll upload our data to the server in a secure and private way. To do so, we need to define a custom privacy policy that will require two parameters.

- A `safe_zone` which is a condition any request must meet to be considered privacy-preserving.
- An `unsafe_handling` which is the action taken in case a request violates the `safe_zone`.

For the purpose of this tutorial, we'll use the following:
- Any request that aggregates at least 10 rows of the original DataFrame is safe,
- We decide to log any offending request on the server side, so the Data Owner can see it.

To send the DataFrame with the privacy policy to the server, we'll use the `send_df()` method of the `polars` interface of the client. We'll pass it our custom policy and a list of columns to be sanitized (*meaning set to null*) if retrieved by the data scientist.


In [2]:
from bastionlab.polars.policy import Policy, Aggregation, Log

policy = Policy(
    safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=True
)

  from .autonotebook import tqdm as notebook_tqdm


> To keep this overview short and simple, we will use weak but reasonable guarantees. If you're interested in setting up stricter policies, you are encouraged to have a look at our [Privacy policy tutorial](https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/policy.ipynb).

## Linear Regression

### Overview
Linear Regression is a supervised learning algorithm used for predicting a continuous outcome variable (also known as the response variable) based on one or more predictor variables. In other words, it is a technique for modeling the relationship between a dependent variable (Y) and one or more independent variables (X).

The goal of Linear Regression is to find the best-fit line or hyperplane that describes the relationship between the independent and dependent variables. This line or hyperplane is defined by a set of coefficients that determine the slope and intercept of the line or hyperplane.

Linear Regression is a widely used algorithm in machine learning and is used in various applications such as stock price prediction, sales forecasting, and many others. In this notebook, we will explore the basics of Linear Regression and its implementation using BastionLab.

### Applications

Applications
Linear Regression is widely used in various fields such as finance, economics, healthcare, and social sciences. Some of its applications include:

* Predicting stock prices
* Forecasting sales revenue
* Estimating the impact of a marketing campaign on sales
* Predicting the price of a house based on its features such as size, location, etc.

### Implementation

In this section, we will demonstrate how to implement Linear Regression using BastionLab.

**Step 1: Importing Libraries**

The first step is to import the required libraries. We will be using NumPy and Polars for data manipulation, and BastionLab Linfa for building the Linear Regression model and using the metrics submodule for validation.

In [13]:
import polars as pl
from bastionlab.linfa import LinearRegression
from bastionlab.linfa.metrics import mean_squared_error
from bastionlab.polars import train_test_split

**Step 2: Loading Data**

The California Housing dataset is a popular dataset used in machine learning and statistics. It is a real-world dataset that contains information collected from the 1990 California census. The dataset includes information on housing prices, demographics, and geography for each block group in California.

Each row in the dataset represents a block group, which is the smallest geographic unit for which the US Census Bureau provides data. The dataset contains 20,640 observations and 8 attributes, including the median house value, median income, housing occupancy rate, and more.

The goal of this tutorial is to use the California Housing dataset to predict housing prices using linear regression.


In [14]:
from sklearn import datasets

data = datasets.fetch_california_housing(as_frame=True)

In the above code, we use the load_boston function to load the California Housing dataset and return both the feature matrix X and the target vector y. When `as_frame=True` is set, it will return both our `data or X` and `targets or Y` as pandas `DataFrames` together metadata about the dataset. 

We will convert the pandas DataFrames into Polars DataFrames.

In [15]:
# Convert inputs from Pandas DataFrame into Polars DataFrame
inputs = pl.DataFrame(data["data"])

# Convert target into Polars DataFrame from Numpy Ndarray
target = pl.DataFrame(data["target"].to_numpy())

Now that we have our data in all the right forms (in a Polars DataFrame for BastionLab,) we can go ahead and perform remote data operations.

### Sending the California Housing dataset to BastionLab

In this tutorial, we assume that the data owner has a private dataset they want to explore. As they don't have the expertise, they would like to hire a data scientist and give them restricted access to their private data.

In this part, we'll see how to **upload a data frame** to the server and use the [policy](###setting-up-the-privacy-policy) set up above. It is key to know that BastionLab ensures that **the original dataset cannot be downloaded by the data scientist**.

In [16]:
remote_inputs = connection.client.polars.send_df(inputs, policy)
remote_target = connection.client.polars.send_df(target, policy)

The server returns a `FetchableLazyFrame` which is a reference to the remote DataFrame. It can be used as if it were locally available. We'll see how to use it in the data scientist's side section.

**Step 3: Data Preparation**

Before building the model, we need to prepare the data. This involves splitting the data into training and testing sets, scaling the data, and handling missing values.

In [17]:
remote_X = remote_inputs.to_array()
remote_Y = remote_target.to_array()

<div class="admonition important">
    `to_array` is BastionLab's implementation which is used to convert a `RemoteDataFrame` into an Ndarray on the server.
    Once the method is called on the RemoteDataFrame, it's eagerly converted into a RemoteArray which is related to an Ndarray on the server.
    If you want to know more about it, checkout our <a href="#">data conversion tutorial</a>
</div>

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    remote_X,
    remote_Y,
    test_size=0.3,
    shuffle=True,
)

`train_test_split` just like in `sklearn`, is used to split the RemoteArray into train and test parts.
In this example, the test size is set as 30% of the whole dataset and this will be used to validate that the model was fit properly.

Also, we enable shuffling to improve the fitting of our model.

**Step 4: Building the Model**

We can now build the Linear Regression model using Scikit-learn.

In [19]:
# Creating the model object
lr = LinearRegression()

# Training the model
lr.fit(X_train, y_train)

# Predicting the target values for the test set
y_pred = lr.predict(X_test)

**Step 5: Model Evaluation**

Finally, we can evaluate the model using metrics such as Mean Squared Error (MSE) and R-Squared (R^2).

In [21]:
mse = mean_squared_error(y_test, y_pred.to_array())
print(mse)

FetchableLazyFrame(identifier=12dda597-0cba-48cb-aad5-337774263108)


> Note that results of metrics are returned as a `RemoteDataFrame` and so, you would have to call `fetch` to see the results as a plain Polars DataFrame.

In [22]:
print(mse.fetch())

shape: (1, 1)
┌────────────────────┐
│ mean_squared_error │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.53789            │
└────────────────────┘


### Conclusion

The tutorial introduced the Linear Regression algorithm for regression tasks and showed how to apply it to the California Housing dataset using Bastionlab's machine learning module `linfa`. 

The tutorial covered data loading, splitting data into training and testing sets, creating a Linear Regression model instance, fitting the model to the training data, making predictions on the testing data, and evaluating the model's performance using the mean squared error metric. 

By the end of the section, you had built a simple Linear Regression model, trained it on real-world data, and evaluated its performance using a standard evaluation metric.

## Gaussian Naive Bayes
### Overview

Gaussian Naive Bayes is a probabilistic algorithm used in machine learning for classification tasks. It is based on the Bayes theorem and assumes that the features are independent of each other, which means that the presence of one feature does not affect the probability of the presence of another feature.

Gaussian Naive Bayes is particularly useful when the number of features is large and the number of training examples is small. It is a simple yet effective algorithm that has been successfully used in many applications, including text classification, spam filtering, and image recognition.

### Implementation

**Step 1: Importing Libraries**

The first step is to import the required libraries. We will be using NumPy and Polars for data manipulation, BastionLab Polars for creating training and testing sub-datasets, and BastionLab Linfa for building the Gaussian Naive Bayes Regression model and using the metrics submodule for validation.


In [3]:
import numpy as np
import polars as pl
from bastionlab.linfa import GaussianNB, metrics
from bastionlab.polars import train_test_split

**Step 2: Data Loading**

In this tutorial, we will use the Iris dataset, which is a popular dataset used for classification tasks. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The dataset is included in the scikit-learn library, which can be loaded using the following code:

In [4]:
from sklearn.datasets import load_iris

iris = load_iris()
data = iris.data
target = iris.target

**Step 2: Uploading to BastionLab**

The iris dataset loaded from scikit-learn would have to be first uploaded onto BastionLab. The effect of this step is akin to only having access to the remote data to apply the Naive Bayes algorithm to, and you will also be using the [policy](###setting-up-the-privacy-policy) set up above. 

The loaded data would be converted into a Polars DataFrame, which is BastionLab's input type. The snippet below takes the Numpy ndarrays received and converts them into a Polars DataFrame.

In [5]:
data = pl.DataFrame(data)
target = pl.DataFrame(target)

Now, we upload the data and target dataframes to have their remote representations.

In [6]:
remote_data = connection.client.polars.send_df(data, policy)
remote_target = connection.client.polars.send_df(target, policy)

The remote data will be converted into `RemoteArray` because our training function and the preprocessing functions accepts `RemoteArray`s.

In [7]:
remote_data = remote_data.to_array()
remote_target = remote_target.to_array()

**Step 3: Preprocessing**

Before training our model, we need to preprocess the remote data by splitting it into training and testing sets. We will use the train_test_split() function from BastionLab Polars to split the data into 80% training set and 20% testing set:

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    remote_data, remote_target, test_size=0.2, random_state=42
)

**Step 4: Training and Predicting**

We can now train our Gaussian Naive Bayes model using the GaussianNB class from BastionLab Linfa. We will fit the model to the training data and use it to predict the labels of the testing data:

In [9]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

**Step 5: Evaluating the model**

To evaluate the performance of our model, we can use various metrics such as accuracy, precision, recall, and F1-score. In this tutorial, we will use the accuracy score, which measures the percentage of correctly classified instances:

Before passing our `y_pred` to the `accuracy_score` function, we ought to convert it into a `RemoteArray` because the results of the `predict` was a `RemoteDataFrame`.

In [10]:
y_pred = y_pred.to_array()

In [11]:
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: FetchableLazyFrame(identifier=9c5504ca-ced8-474d-878b-03a010a97d87)


> Note that results of metrics are returned as a `RemoteDataFrame` and so, you would have to call `fetch` to see the results as a plain Polars DataFrame.

In [12]:
print(accuracy.fetch())

shape: (1, 1)
┌──────────┐
│ accuracy │
│ ---      │
│ f32      │
╞══════════╡
│ 0.933333 │
└──────────┘


### Conclusion

In this tutorial, we learned how to use Gaussian Naive Bayes for classification tasks using the Iris dataset. We loaded the data, sent it to the BastionLab server, preprocessed it, trained the model, and evaluated its performance using the accuracy score. Gaussian Naive Bayes is a simple yet effective algorithm that can be used for a variety of classification tasks.