# Arize integration guide
Arize and Neptune are MLOps tools that aim to improve connected but related parts of your ML pipeline and workflow.

Arize helps you:
- visualize your production model performance
- understand drift and data quality issues

Neptune logs, stores, displays, and compares your model-building metadata for better experiment tracking and model registry.

Together, Arize and Neptune help you:
- Train the best model
- Validate your model pre-launch
- Compare production performances of those models

## Before you start

This notebook example lets you try out Neptune as an anonymous user, with zero setup.

If you want to see the example logged to your own workspace instead:

  1. Create a Neptune account. [Register &rarr;](https://neptune.ai/register)
  1. Create a Neptune project that you will use for tracking metadata. For instructions, see [Creating a project](https://docs.neptune.ai/setup/creating_project) in the Neptune docs.
  1. Have Arize installed.

## Install Neptune and dependencies

In [None]:
!pip install neptune neptune-sklearn arize pandas scikit-learn numpy

In [None]:
import neptune
import neptune.integrations.sklearn as npt_utils
from neptune.utils import stringify_unsupported

import os

from arize.api import Client
from arize.utils.types import ModelTypes, Environments

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
model_id = "neptune_cancer_prediction_model"
model_version = "v1"
model_type = ModelTypes.BINARY_CLASSIFICATION


def process_data(X, y):
    X = np.array(X).reshape((len(X), 30))
    y = np.array(y)
    return X, y


# Load and split data
data = datasets.load_breast_cancer()

X, y = datasets.load_breast_cancer(return_X_y=True)
X, y = X.astype(np.float32), y

X, y = pd.DataFrame(X, columns=data["feature_names"]), pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=42)


# Define model
model = LogisticRegression(random_state=42, max_iter=1000)

## Log training and validation metadata to Neptune

In [None]:
# (Neptune) Initialize a run
run = neptune.init_run(project="common/showroom", api_token=neptune.ANONYMOUS_API_TOKEN)

# Model training
model.fit(X_train, y_train)

# (Neptune) Log model performance
run["regression_summary"] = npt_utils.create_classifier_summary(
    model, X_train, X_test, y_train, y_test
)

# (Neptune) Log model parameters
run["estimator/params"] = stringify_unsupported(npt_utils.get_estimator_params(model))

# (Neptune) Save model
run["estimator/pickled-model"] = npt_utils.get_pickled_model(model)

# (Neptune) Log "model_id", for better reference
run["model_id"] = model_id

### Stop logging

Once you are done logging, stop tracking the run.

In [None]:
run.stop()

## Log training and validation records to Arize

Arize logs training and validation records to an Evaluation Store for model pre-launch validation, such as visualizing performance across different feature slices (for example, model accuracy for lower-income versus higher-income individuals).

In [None]:
# (Arize) Initialize Arize client
arize = Client(space_key=os.environ["ARIZE_SPACE_KEY"], api_key=os.environ["ARIZE_API_KEY"])

# Generate model predictions
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

# (Arize) Logging training
train_prediction_labels = pd.Series(y_train_pred, name="predicted_labels")
train_actual_labels = pd.Series(y_train, name="actual_labels")
train_feature_df = pd.DataFrame(X_train, columns=data["feature_names"]).to_dict("list")

train_responses = arize.log(
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    prediction_label=train_prediction_labels,
    actual_label=train_actual_labels,
    environment=Environments.TRAINING,
    features=train_feature_df,
)

# (Arize) Logging validation
val_prediction_labels = pd.Series(y_val_pred)
val_actual_labels = pd.Series(y_val)
val_features_df = pd.DataFrame(X_val, columns=data["feature_names"]).to_dict("list")

val_responses = arize.log(
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    batch_id="batch0",
    prediction_label=val_prediction_labels,
    actual_label=val_actual_labels,
    environment=Environments.VALIDATION,
    features=val_features_df,
)