[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/tabular-classification/documentation-tutorial/tabular-tutorial-part-1.ipynb)



# Welcome to the Openlayer tabular tutorial - Part 1

You should use this notebook together with the [**tabular tutorial**](https://docs.openlayer.com/docs/uploading-your-first-model-and-dataset) from our documentation.

In [1]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/unboxai/examples-gallery/main/tabular-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"
fi

In [2]:
!pip install -r requirements.txt



## 1. Loading the dataset

First, let's import the libraries we need and load the churn training and validation datasets.

In [3]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. The churn dataset we use was constructed from the dataset from [this Kaggle competition](https://www.kaggle.com/competitions/churn-modelling/overview).

In [4]:
TRAINING_SET_URL = "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/Churn+prediction/churn_train.csv"
VALIDATION_SET_URL = "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/Churn+prediction/churn_val.csv"

In [5]:
# loading and having a look at the training set
training_set = pd.read_csv(TRAINING_SET_URL)
val_set = pd.read_csv(VALIDATION_SET_URL)

training_set.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,6,15574012,Chu,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
1,7,15592531,Bartlett,822,France,Male,50,7,0.0,2,1,1,10062.8,0
2,9,15792365,He,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
3,10,15592389,H?,684,France,Male,27,2,134603.88,1,1,1,71725.73,0
4,11,15767821,Bearce,528,France,Male,31,6,102016.72,2,0,0,80181.12,0


The label we want to learn to predict is in the column `Exited`: retained users have a value of 0 while users that exited have a value of 1. Additionally, we **don't** want to use the `RowNumber`, `CurtomerId`, and `Surname` in our model, so we exclude these columns from our dataset.

In [6]:
X_train = training_set.iloc[:, 3:-1]
y_train = training_set.iloc[:, -1]

X_val = val_set.iloc[:, 3:-1]
y_val = val_set.iloc[:, -1]

## 2. Pre-processing the data

Notice from one of the previous cell's output that the users' genders and geographies are **categorical features**. Therefore, before feeding the data into the model, we need to encode them. Let's apply a **one-hot-encoding**, which is a common choice when dealing with categorical features.

In [7]:
def data_encode_one_hot(df, encoders):
    """ Encodes categorical features using one-hot encoding. """
    df = df.copy(True)
    df.reset_index(drop=True, inplace=True) # Causes NaNs otherwise
    for feature, enc in encoders.items():
        print(f"encoding {feature}")
        enc_df = pd.DataFrame(enc.transform(df[[feature]]).toarray(), columns=enc.get_feature_names([feature]))
        df = df.join(enc_df)
        df = df.drop(columns=feature)
    return df

In [8]:
def create_encoder_dict(df, categorical_feature_names):
    """ Creates encoders for each of the categorical features. 
        The predict function will need these encoders. 
    """
    encoders = {}
    for feature in categorical_feature_names:
        enc = OneHotEncoder(handle_unknown='ignore')
        enc.fit(df[[feature]])
        encoders[feature] = enc
    return encoders

In [9]:
# creating the encoder dict for the categorical features (gender and geography)
encoders = create_encoder_dict(X_train, ['Geography', 'Gender'])

In [10]:
# encoding the categorical features in our training and validation sets
X_train_one_hot = data_encode_one_hot(X_train, encoders)
X_val_one_hot = data_encode_one_hot(X_val, encoders)

encoding Geography
encoding Gender
encoding Geography
encoding Gender


## 3. Training and evaluating our model

We are going to train a gradient boosting classifier on the training data. Let's then check out what the model's performance is in the validation set.

In [11]:
sklearn_model = GradientBoostingClassifier(random_state=42) 
sklearn_model.fit(X_train_one_hot, y_train)

GradientBoostingClassifier(random_state=42)

In [12]:
print(classification_report(y_val, sklearn_model.predict(X_val_one_hot)))

              precision    recall  f1-score   support

           0       0.91      0.68      0.78       795
           1       0.38      0.75      0.50       205

    accuracy                           0.69      1000
   macro avg       0.65      0.72      0.64      1000
weighted avg       0.80      0.69      0.72      1000



## 5. Openlayer part!

Now it's up to you! We will just compute a few important variables and concatenate the x and y, because Openlayer expects a single dataframe with features and labels for the upload. 

Head back to the tutorial for an explanation of next few cells.

In [13]:
feature_names = X_train.columns.values.tolist()
categorical_feature_names = ["Gender", "Geography"]
class_names = ["Retained", "Exited"]

In [14]:
training_set = pd.concat([X_train, y_train], axis=1)
validation_set = pd.concat([X_val, y_val], axis=1)

In [15]:
# installing the Openlayer Python API
!pip install openlayer

Collecting openlayer
  Using cached openlayer-0.3.0-py3-none-any.whl (25 kB)
Collecting protobuf==3.20.0
  Using cached protobuf-3.20.0-cp38-cp38-macosx_10_9_x86_64.whl (962 kB)
Collecting urllib3<=1.25.11
  Using cached urllib3-1.25.11-py2.py3-none-any.whl (127 kB)




Installing collected packages: urllib3, protobuf, openlayer
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.12
    Uninstalling urllib3-1.26.12:
      Successfully uninstalled urllib3-1.26.12
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.21.5
    Uninstalling protobuf-4.21.5:
      Successfully uninstalled protobuf-4.21.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
twine 4.0.0 requires urllib3>=1.26.0, but you have urllib3 1.25.11 which is incompatible.
scaleapi 2.9.0 requires urllib3>=1.26.0, but you have urllib3 1.25.11 which is incompatible.[0m[31m
[0mSuccessfully installed protobuf-3.20.0 openlayer-0.3.0 urllib3-1.25.11


In [17]:
# instantiating the client
import openlayer

openlayer.api.OPENLAYER_ENDPOINT = "http://localhost:8080/v1"
openlayer.api.STORAGE = openlayer.api.StorageType.ONPREM

client = openlayer.OpenlayerClient('P0ZYAERZvzvbPvsXHTBJ2ORBqHxq9pUE')

In [18]:
# creating the project
from openlayer.tasks import TaskType

project = client.create_project(name="Churn prediction",
                               task_type=TaskType.TabularClassification,
                               description="Evaluation of ML approaches to predict churn")

Created your project. Navigate to http://localhost:8000/projects/1 to see it.


In [19]:
# uploading the dataset to the project
dataset = project.add_dataframe(
  df=validation_set,  
  commit_message='churn validation set for October',
  class_names=class_names,  
  label_column_name='g',    
  feature_names=feature_names,  
  categorical_feature_names=categorical_feature_names,  
)

Adding your dataset to Openlayer! Check out the project page to have a look.


In [20]:
# defining the model's predict probability function
def predict_proba(model, input_features: np.ndarray, col_names: list, one_hot_encoder, encoders):
    # Pre-processing the categorical features
    df = pd.DataFrame(input_features, columns=col_names)
    encoded_df = one_hot_encoder(df, encoders)
    
    # Getting the model's predictions
    preds = model.predict_proba(encoded_df.to_numpy())
    
    return preds

In [21]:
# uploading the model to the project
from openlayer.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=class_names,
    name='Churn Classifier',
    commit_message='this is my churn classification model',
    feature_names=feature_names,
    train_sample_df=training_set[:3000],
    train_sample_label_column_name='Exited',
    categorical_feature_names=categorical_feature_names,
    requirements_txt_file='requirements.txt',
    col_names=feature_names,
    one_hot_encoder=data_encode_one_hot,
    encoders=encoders,
)

Bundling model and artifacts...
Adding your model to Openlayer! Check out the project page to have a look.
