[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/tabular-classification/documentation-tutorial/tabular_tutorial.ipynb)


# Welcome to the Unbox tabular tutorial!

We made our best to make it as simple as possible. You should use this notebook together with the [**tabular tutorial**](https://docs.unbox.ai/docs/uploading-your-first-model-and-dataset) from our documentation.

In [None]:
!pip install -r requirements.txt

## 1. Loading the dataset

First, let's import the libraries we need and load the churn dataset.

In [3]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. Alternatively, you can also find the dataset on [this Kaggle competition](https://www.kaggle.com/competitions/churn-modelling/overview).

In [4]:
DATASET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/tabular-classification/Churn_Modelling.csv"

In [5]:
# loading and having a look at the full churn dataset
churn_dataset = pd.read_csv(DATASET_URL)

churn_dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


The label we want to learn to predict is in the column `Exited`: retained users have a value of 0 while users that exited have a value of 1. Additionally, we **don't** want to use the `RowNumber`, `CurtomerId`, and `Surname` in our model, so we exclude these columns from our dataset.


In [6]:
churn_dataset = churn_dataset.iloc[:, 3:]

## 2. Pre-processing the data

Notice from one of the previous cell's output that the users' genders and geographies are **categorical features**. Therefore, before feeding the data into the model, we need to encode them. Let's apply a **one-hot-encoding**, which is a common choice when dealing with categorical features.

In [7]:
# computing some important information about our dataset -- which we'll need in the future 
categorical_map = {feature: list(churn_dataset[feature].unique()) for feature in ["Gender", "Geography"]}
class_names = ["Retained", "Exited"]
feature_names = churn_dataset.columns.values.tolist()[:-1]

In [8]:
def data_encode_one_hot(df, encoders):
    """ Encodes categorical features using one-hot encoding. """
    df = df.copy(True)
    df.reset_index(drop=True, inplace=True) # Causes NaNs otherwise
    for feature, enc in encoders.items():
        print(f"encoding {feature}")
        enc_df = pd.DataFrame(enc.transform(df[[feature]]).toarray(), columns=enc.get_feature_names([feature]))
        df = df.join(enc_df)
        df = df.drop(columns=feature)
    return df

In [9]:
def create_encoder_dict(df, categorical_feature_names):
    """ Creates encoders for each of the categorical features. 
        The predict function will need these encoders. 
    """
    encoders = {}
    for feature in categorical_feature_names:
        enc = OneHotEncoder(handle_unknown='ignore')
        enc.fit(df[[feature]])
        encoders[feature] = enc
    return encoders

In [10]:
# creating the encoder dict for the categorical features (gender and geography)
encoders = create_encoder_dict(churn_dataset, ['Geography', 'Gender'])

## 3. Splitting the data into training and validation sets

Now that we are ready to encode our categorical features, let's split the data into training and validation sets.

In [11]:
x_train, x_val, y_train, y_val = train_test_split(churn_dataset.iloc[:, :-1], churn_dataset.iloc[:, -1], test_size=0.2, random_state=42)
x_train_one_hot = data_encode_one_hot(x_train, encoders)
x_val_one_hot = data_encode_one_hot(x_val, encoders)

encoding Geography
encoding Gender
encoding Geography
encoding Gender


## 4. Training and evaluating our model

We are going to train a gradient boosting classifier on the training data. Let's then check out what the model's performance is in the validation set.

In [12]:
sklearn_model = GradientBoostingClassifier(random_state=42) 
sklearn_model.fit(x_train_one_hot, y_train)

GradientBoostingClassifier(random_state=42)

In [13]:
print("The model's accuracy on the validation set is equal to: " + 
      str(100 * accuracy_score(y_val, sklearn_model.predict(x_val_one_hot))) + "%")

The model's accuracy on the validation set is equal to: 86.4%


## 5. Unbox part -- have fun creating the next few cells!

Now it's up to you! We will just concatenate the x and y, because Unbox expects a single dataframe with features and labels for the upload. 

Head back to the tutorial to see how you need to fill out the next few cells.

In [14]:
training_set = pd.concat([x_train, y_train], axis=1)
validation_set = pd.concat([x_val, y_val], axis=1)

In [13]:
# instantiating the client and creating the project

In [14]:
# defining the predict function

In [15]:
# uploading the model

In [16]:
# uploading the dataset