[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/tabular-classification/sklearn/churn-classifier/churn-classifier-sklearn.ipynb)


# Churn classification using sklearn

This notebook illustrates how sklearn models can be upladed to the Unbox platform.

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/unboxai/examples-gallery/main/tabular-classification/sklearn/churn-classifier/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## Importing the modules and loading the dataset

In [None]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. Alternatively, you can also find the dataset on [this Kaggle competition](https://www.kaggle.com/competitions/churn-modelling/overview).

In [None]:
DATASET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/Churn_Modelling.csv"

In [None]:
data = pd.read_csv(DATASET_URL)

In [None]:
X = data.iloc[:, 3:-1]
y = data.iloc[:, -1]
X

## Pre-processing the data

In [None]:
def data_encode_one_hot(df, encoders):
    """ Encodes categorical features using one-hot encoding. """
    df = df.copy(True)
    df.reset_index(drop=True, inplace=True) # Causes NaNs otherwise
    for feature, enc in encoders.items():
        enc_df = pd.DataFrame(enc.transform(df[[feature]]).toarray(), columns=enc.get_feature_names([feature]))
        df = df.join(enc_df)
        df = df.drop(columns=feature)
    return df

In [None]:
def create_encoder_dict(df, categorical_feature_names):
    """ Creates encoders for each of the categorical features. 
        The predict function will need these encoders. 
    """
    from sklearn.preprocessing import OneHotEncoder
    encoders = {}
    for feature in categorical_feature_names:
        enc = OneHotEncoder(handle_unknown='ignore')
        enc.fit(df[[feature]])
        encoders[feature] = enc
    return encoders

In [None]:
encoders = create_encoder_dict(X, ['Geography', 'Gender'])

X_enc_one_hot = data_encode_one_hot(X, encoders)
X_enc_one_hot

## Splitting the data into training and validation sets

In [None]:
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_train_one_hot = data_encode_one_hot(x_train, encoders)
x_val_one_hot = data_encode_one_hot(x_val, encoders)

## Training and evaluating the model's performance

In [None]:
sklearn_model = LogisticRegression(random_state=1300)
sklearn_model.fit(x_train_one_hot, y_train)

In [None]:
print(classification_report(y_val, sklearn_model.predict(x_val_one_hot)))

## Unbox part!

### pip installing unboxapi

In [None]:
!pip install unboxapi

### Instantiating the client

In [None]:
import unboxapi

client = unboxapi.UnboxClient("YOUR_API_KEY_HERE")

### Creating a project on the platform

In [None]:
from unboxapi.tasks import TaskType

project = client.create_or_load_project(name="Churn Prediction",
                                        task_type=TaskType.TabularClassification,
                                        description="Evaluation of ML approaches to predict churn")

### Uploading the validation set

In [None]:
# Some variables that will be used for the dataset upload
class_names = ["Retained", "Exited"]
feature_names = list(x_val.columns)

# Add the ground truths to the original dataset for Unbox
validation_set = x_val.copy()
validation_set['churn'] = y_val.values
training_set = x_train.copy()
training_set['churn'] = y_train.values

In [None]:
dataset = project.add_dataframe(
    df=validation_set,
    class_names=class_names,
    label_column_name='churn',
    commit_message='first commit!',
    feature_names=feature_names,
    categorical_feature_names=["Gender", "Geography"],
)

### Uploading the model

First, it is important to create a `predict_proba` function, which is how Unbox interacts with your model

In [None]:
def predict_proba(model, input_features: np.ndarray, col_names, one_hot_encoder, encoders):
    """Convert the raw input_features into one-hot encoded features
    using our one hot encoder and each feature's encoder. """
    df = pd.DataFrame(input_features, columns=col_names)
    encoded_df = one_hot_encoder(df, encoders)
    return model.predict_proba(encoded_df.to_numpy())

Let's test the `predict_proba` function to make sure the input-output format is consistent with what Unbox expects:

In [None]:
predict_proba(sklearn_model, x_val[:3][feature_names].to_numpy(), feature_names, data_encode_one_hot, encoders)

Now, we can upload the model:

In [None]:
from unboxapi.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=class_names,
    name='Churn Classifier',
    commit_message='this is my churn classification model',
    feature_names=feature_names,
    train_sample_df=training_set[:3000],
    train_sample_label_column_name='churn',
    categorical_feature_names=["Gender", "Geography"],
    col_names=feature_names,
    requirements_txt_file='requirements.txt',
    one_hot_encoder=data_encode_one_hot,
    encoders=encoders,
)