[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/tabular-classification/documentation-tutorial/tabular-tutorial-part-2.ipynb)



# Welcome to the Unbox tabular tutorial - Part 2

You should use this notebook together with the final part of the [**tabular tutorial**](https://docs.unbox.ai/docs/uploading-your-first-model-and-dataset) from our documentation. This is where we solve the identified issue affecting the first version of our model.

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/unboxai/examples-gallery/main/tabular-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## 1. Loading the original training set

First, let's import the libraries we need and load the churn training set. We will then confirm the issue we have identified during the tutorial.

In [None]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
TRAINING_SET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/Churn+prediction/churn_train.csv"
VALIDATION_SET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/Churn+prediction/churn_val.csv"

In [None]:
training_set = pd.read_csv(TRAINING_SET_URL)
val_set = pd.read_csv(VALIDATION_SET_URL)

During the tutorial, we discovered that our model was having trouble predicting samples from female users. More specifically, from retained female users, which was causing an error class to be 5x larger than the other.

We hypothesized that that was a symptom of that situation being underrepresented on the training set. This is indeed the case! 

In [None]:
training_set.groupby(["Gender", "Exited"])["Exited"].count()

Notice how there are only 100 samples from female users. Out of those, only 20 are from retained female users. On the other hand, male users make up ~97% of the dataset. It is clear that we need more data for female users to remove our model's bias!

## 2. Augmenting the training set

To augment our training set, we have gotten almost 5000 new labeled samples from production.

In [None]:
NEW_PROD_DATA_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/Churn+prediction/churn_new_prod_data.csv"

In [None]:
new_prod_data = pd.read_csv(NEW_PROD_DATA_URL)
new_prod_data.head()

We are going to augment our training set with 3000 new samples from female users. Hopefully, now our training set is much more balanced!

In [None]:
female_user_data = new_prod_data[new_prod_data["Gender"] == "Female"]

In [None]:
training_set_new = pd.concat([training_set, female_user_data.iloc[:3000, :]])

## 3. Pre-process data and re-train the model

Now it's time to pre-process the data again, to encode the categorical features and re-train our gradient boosting classifier.

In [None]:
X_train = training_set_new.iloc[:, 3:-1]
y_train = training_set_new.iloc[:, -1]

X_val = val_set.iloc[:, 3:-1]
y_val = val_set.iloc[:, -1]

In [None]:
def data_encode_one_hot(df, encoders):
    """ Encodes categorical features using one-hot encoding. """
    df = df.copy(True)
    df.reset_index(drop=True, inplace=True) # Causes NaNs otherwise
    for feature, enc in encoders.items():
        print(f"encoding {feature}")
        enc_df = pd.DataFrame(enc.transform(df[[feature]]).toarray(), columns=enc.get_feature_names([feature]))
        df = df.join(enc_df)
        df = df.drop(columns=feature)
    return df

In [None]:
def create_encoder_dict(df, categorical_feature_names):
    """ Creates encoders for each of the categorical features. 
        The predict function will need these encoders. 
    """
    encoders = {}
    for feature in categorical_feature_names:
        enc = OneHotEncoder(handle_unknown='ignore')
        enc.fit(df[[feature]])
        encoders[feature] = enc
    return encoders

In [None]:
# creating the encoder dict for the categorical features (gender and geography)
encoders = create_encoder_dict(X_train, ['Geography', 'Gender'])

In [None]:
# encoding the categorical features in our training and validation sets
X_train_one_hot = data_encode_one_hot(X_train, encoders)
X_val_one_hot = data_encode_one_hot(X_val, encoders)

In [None]:
sklearn_model = GradientBoostingClassifier(random_state=42) 
sklearn_model.fit(X_train_one_hot, y_train)

In [None]:
print(classification_report(y_val, sklearn_model.predict(X_val_one_hot)))

## 5. Unbox part -- have fun creating the next few cells!

Now it's up to you! We will just compute a few important quantities and functions. 

Head back to the tutorial to see how you need to fill out the next few cells.

In [None]:
feature_names = X_train.columns.values.tolist()
categorical_feature_names = ["Gender", "Geography"]
class_names = ["Retained", "Exited"]

In [None]:
def predict_proba(model, input_features: np.ndarray, col_names, one_hot_encoder, encoders):
    df = pd.DataFrame(input_features, columns=col_names)
    encoded_df = one_hot_encoder(df, encoders)
    return model.predict_proba(encoded_df.to_numpy())

In [None]:
# instantiating the client and loading the project
import unboxapi

client = unboxapi.UnboxClient('YOUR_API_KEY_HERE')
project = client.load_project(name='Churn prediction')

In [None]:
# uploading the model to the project
from unboxapi.models import ModelType

model = project.add_model(
    name='Churn Classifier',
    commit_message='Retrain on augmented training set with female users',
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=class_names,
    feature_names=feature_names,
    train_sample_df=training_set_new[:3000],
    train_sample_label_column_name='Exited',
    requirements_txt_file='requirements.txt',
    categorical_feature_names=categorical_feature_names,
    col_names=feature_names,
    one_hot_encoder=data_encode_one_hot,
    encoders=encoders,
)