<a href="https://colab.research.google.com/github/krishna-shah-07/Machine_Learning/blob/main/algorithm_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Algorithm Tuning

Algorithm Tuning allows us to test different models on a given dataset, and helps to figure out which particular model gives the highest value of a user-defined performance metric on that particular dataset.

Clone the repo with notebooks and corresponding data.

In [1]:
!git clone https://github.com/TurboML-Inc/colab-notebooks.git

fatal: destination path 'colab-notebooks' already exists and is not an empty directory.


Set up the environment and install TurboML's SDK.

In [2]:
!pip install -q condacolab
import condacolab
condacolab.install()
!bash colab-notebooks/install_turboml.sh

✨🍰✨ Everything looks OK!
Error while loading conda entry point: conda-libmamba-solver (module 'libmambapy' has no attribute 'QueryFormat')

CondaValueError: You have chosen a non-default solver backend (libmamba) but it was not recognized. Choose one of: classic

Collecting pandas==2.2.2
  Using cached pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Using cached pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.3
    Uninstalling pandas-2.2.3:
      Successfully uninstalled pandas-2.2.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
turboml-sdk 0.1.4 requires pandas<3.0.0,>=2.2.3, but you have pandas 2.2.2 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2

The kernel should now be restarted with TurboML's SDK installed.

In [3]:
cd colab-notebooks

/content/colab-notebooks


Login to your TurboML instance.

In [4]:
import pandas as pd
import turboml as tb
tb.init(backend_url="https://archetypal-loon.api.turboml.online", api_key="tb_VCTm5Cv8DH1uF526HJIhg0lTuSffp81B_0bcaa119")
from sklearn import metrics

Importing the necessary modules and reading the dataset.

In [5]:
transactions_df = pd.read_csv("data/transactions.csv").reset_index()
labels_df = pd.read_csv("data/labels.csv").reset_index()

## Dataset

We use the `PandasDataset` class to create a dataset to be used for tuning, and also configure the dataset to indicate the column with the primary key.

For this example, we use the first 100k rows.

In [6]:
transactions_100k = tb.PandasDataset(
    dataframe=transactions_df[:100000], key_field="index", streaming=False
)
labels_100k = tb.PandasDataset(
    dataframe=labels_df[:100000], key_field="index", streaming=False
)

In [7]:
numerical_fields = [
    "transactionAmount",
]
categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
inputs = transactions_100k.get_input_fields(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_100k.get_label_field(label_field="is_fraud")

## Training/Tuning

We will be comparing the `Neural Network` and `Hoeffding Tree Classifier`, and the metric we will be optimizing is `accuracy`.

Configuring the NN according to the dataset.

In [8]:
new_layer = tb.NNLayer(output_size=2)

nn = tb.NeuralNetwork()
nn.layers.append(new_layer)

The `algorithm_tuning` function takes in the models being tested as a list along with the metric to test against, and returns an object for the model which had the highest score for the given metric.

In [9]:
model_score_list = tb.algorithm_tuning(
    models_to_test=[
        tb.HoeffdingTreeClassifier(n_classes=2),
        nn,
    ],
    metric_to_optimize="accuracy",
    input=inputs,
    labels=label,
)
best_model, best_score = model_score_list[0]
best_model

INFO:turboml.common.internal:Starting to upload data... Total rows: 100000
Progress: 100%|██████████| 98.0/98.0 [00:02<00:00, 42.5chunk/s]
INFO:turboml.common.internal:Completed data upload.
INFO:turboml.common.internal:Starting to upload data... Total rows: 100000
Progress: 100%|██████████| 98.0/98.0 [00:00<00:00, 337chunk/s]
INFO:turboml.common.internal:Completed data upload.



Model: HoeffdingTreeClassifier
Parameters:
  - model_id: yoefuqykxd
  - version: wheckexdbt
  - delta: 1e-07
  - tau: 0.05
  - grace_period: 200
  - n_classes: 2
  - leaf_pred_method: mc
  - split_method: gini
Accuracy Score: 0.99386



INFO:turboml.common.internal:Starting to upload data... Total rows: 100000
Progress: 100%|██████████| 98.0/98.0 [00:02<00:00, 41.7chunk/s]
INFO:turboml.common.internal:Completed data upload.
INFO:turboml.common.internal:Starting to upload data... Total rows: 100000
Progress: 100%|██████████| 98.0/98.0 [00:00<00:00, 178chunk/s]
INFO:turboml.common.internal:Completed data upload.



Model: NeuralNetwork
Parameters:
  - model_id: wuckmcyjjr
  - version: smjpswkthk
  - dropout: 0
  - layers: [NNLayer(output_size=64, activation='relu', dropout=0.3, residual_connections=[], use_bias=True), NNLayer(output_size=64, activation='relu', dropout=0.3, residual_connections=[], use_bias=True), NNLayer(output_size=1, activation='sigmoid', dropout=0.3, residual_connections=[], use_bias=True), NNLayer(output_size=2, activation='relu', dropout=0.3, residual_connections=[], use_bias=True)]
  - loss_function: mse
  - learning_rate: 0.01
  - optimizer: sgd
  - batch_size: 64
Accuracy Score: 0.04396



HoeffdingTreeClassifier(model_id='yoefuqykxd', version='wheckexdbt', delta=1e-07, tau=0.05, grace_period=200, n_classes=2, leaf_pred_method='mc', split_method='gini')

# Testing

After finding out the best performing model, we can use it normally for inference on the entire dataset and testing on more performance metrics.

In [10]:
transactions_full = tb.PandasDataset(
    dataframe=transactions_df, key_field="index", streaming=False
)
features = transactions_full.get_input_fields(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)

outputs = best_model.predict(features)

INFO:turboml.common.internal:Starting to upload data... Total rows: 201406
Progress: 100%|██████████| 197/197 [00:02<00:00, 67.2chunk/s]
INFO:turboml.common.internal:Completed data upload.


In [11]:
print(
    "Accuracy: ",
    metrics.accuracy_score(labels_df["is_fraud"], outputs["predicted_class"]),
)
print("F1: ", metrics.f1_score(labels_df["is_fraud"], outputs["predicted_class"]))

Accuracy:  0.9943497214581493
F1:  0.9367707523058117
