## Installing the Python API and downloading the dataset

In [None]:
!pip install unboxapi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unboxapi
  Downloading unboxapi-0.1.2-py3-none-any.whl (15 kB)
Collecting bentoml==0.13.1
  Downloading BentoML-0.13.1-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.5 MB/s 
Collecting simple-di
  Downloading simple_di-0.1.5-py3-none-any.whl (9.8 kB)
Collecting configparser
  Downloading configparser-5.2.0-py3-none-any.whl (19 kB)
Collecting deepmerge
  Downloading deepmerge-1.0.1-py3-none-any.whl (8.0 kB)
Collecting sqlalchemy<1.4.0,>=1.3.0
  Downloading SQLAlchemy-1.3.24-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 42.6 MB/s 
Collecting boto3
  Downloading boto3-1.24.29-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 48.9 MB/s 
[?25hCollecting docker
  Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)
[K     |████████████████████████████████| 146 kB 49.0 MB/s 
C

In [None]:
!wget "https://raw.githubusercontent.com/unboxai/artifacts/master/training.csv"

--2022-06-29 19:54:04--  https://raw.githubusercontent.com/unboxai/artifacts/master/training.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 663037 (647K) [text/plain]
Saving to: ‘training.csv’


2022-06-29 19:54:04 (74.9 MB/s) - ‘training.csv’ saved [663037/663037]



# Welcome to the Unbox NLP tutorial!

We made our best to make it as simple as possible. You should use this notebook together with the tutorial from our documentation.

## 1. Loading the dataset

First, let's import the libraries we need and load the banking dataset.

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

In [None]:
# loading and having a look at the full banking dataset
banking_dataset = pd.read_csv("training.csv")

banking_dataset.head()

Unnamed: 0,text,category
0,I am still waiting on my card?,card_arrival
1,What can I do if my card still hasn't arrived ...,card_arrival
2,I have been waiting over a week. Is the card s...,card_arrival
3,Can I track my card while it is in the process...,card_arrival
4,"How do I know if I will get my card, or if it ...",card_arrival


The label we want to learn to predict is in the column `category`. However, we first need to encode it in a way that each category receives a label. We can easily do that with pandas.

In [None]:
banking_dataset['category'] = banking_dataset['category'].astype('category')
banking_dataset['label_code'] = banking_dataset['category'].cat.codes

## 2. Getting the label list

The label list is simply a list with the names of all the categories. This list will be passed as an argument when we upload our model and dataset to Unbox so that we can display them nicely.

In [None]:
label_dict = dict(zip(banking_dataset['category'].cat.codes, banking_dataset['category']))

label_list = [None] * len(label_dict)
for index, label in label_dict.items():
    label_list[index] = label

## 3. Splitting the data into training and validation sets

Now, let's split the banking dataset into training and validation sets. To do so, we will shuffle the data and use the first 7000 rows as a training set and the remaining ones as a validation set.

In [None]:
# shuffling the data
banking_dataset = banking_dataset.sample(frac=1, random_state=42)  

training_set = banking_dataset[:7000]
validation_set = banking_dataset[7000:]

## 4. Training and evaluating our model

We are going to train a logistic regression on the training data. Let's then check out what the model's performance is in the validation set.

In [None]:
sklearn_model = Pipeline([('count_vect', CountVectorizer(ngram_range=(1,2), stop_words='english')), 
                          ('lr', LogisticRegression(random_state=42))])
sklearn_model.fit(training_set['text'], training_set['label_code'])

Pipeline(steps=[('count_vect',
                 CountVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('lr', LogisticRegression(random_state=42))])

In [None]:
print("The model's accuracy on the validation set is equal to: " + 
      str(100 * accuracy_score(validation_set['label_code'], sklearn_model.predict(validation_set['text']))) + "%")

The model's accuracy on the validation set is equal to: 84.63073852295409%


## 5. Unbox part -- have fun creating the next few cells!

Now it's up to you!

Head back to the tutorial to see how you need to fill out the next few cells.

In [None]:
# instantiating the client
import unboxapi

client = unboxapi.UnboxClient('b132697b-97d3-4771-b014-90054cc31a7b')

In [None]:
# defining the predict function
def predict_function(model, text_list):
    return model.predict_proba(text_list)

In [None]:
# uploading the model
from unboxapi.tasks import TaskType
from unboxapi.models import ModelType

unbox_model = client.add_model(
    function=predict_function, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    task_type=TaskType.TextClassification,
    class_names=label_list,
    name="Banking Classifier",
    description="this is my sklearn banking model"
)

unbox_model.to_dict()

Bundling model and artifacts...
Uploading model to Unbox...


100%|██████████| 4.88M/4.88M [00:00<00:00, 24.5MB/s]


{'classNames': ['Refund_not_showing_up',
  'age_limit',
  'atm_support',
  'automatic_top_up',
  'balance_not_updated_after_cheque_or_cash_deposit',
  'beneficiary_not_allowed',
  'cancel_transfer',
  'card_acceptance',
  'card_arrival',
  'card_delivery_estimate',
  'card_linking',
  'card_not_working',
  'card_payment_fee_charged',
  'card_payment_not_recognised',
  'card_payment_wrong_exchange_rate',
  'card_swallowed',
  'change_pin',
  'compromised_card',
  'contactless_not_working',
  'declined_card_payment',
  'declined_cash_withdrawal',
  'declined_transfer',
  'direct_debit_payment_not_recognised',
  'disposable_card_limits',
  'edit_personal_details',
  'exchange_rate',
  'exchange_via_app',
  'extra_charge_on_statement',
  'failed_transfer',
  'fiat_currency_support',
  'get_physical_card',
  'getting_spare_card',
  'getting_virtual_card',
  'lost_or_stolen_card',
  'lost_or_stolen_phone',
  'passcode_forgotten',
  'pending_card_payment',
  'pending_cash_withdrawal',
  'pend

In [None]:
# uploading the dataset
from unboxapi.tasks import TaskType

dataset = client.add_dataframe(
    df=validation_set,
    class_names=label_list,
    label_column_name="label_code",
    text_column_name="text",
    task_type=TaskType.TextClassification,
    name="Banking Validation",
    description="my banking validation dataset"
)

dataset.to_dict()

100%|██████████| 85.1k/85.1k [00:00<00:00, 908kB/s]


{'classNameCounts': None,
 'classNames': ['Refund_not_showing_up',
  'age_limit',
  'atm_support',
  'automatic_top_up',
  'balance_not_updated_after_cheque_or_cash_deposit',
  'beneficiary_not_allowed',
  'cancel_transfer',
  'card_acceptance',
  'card_arrival',
  'card_delivery_estimate',
  'card_linking',
  'card_not_working',
  'card_payment_fee_charged',
  'card_payment_not_recognised',
  'card_payment_wrong_exchange_rate',
  'card_swallowed',
  'change_pin',
  'compromised_card',
  'contactless_not_working',
  'declined_card_payment',
  'declined_cash_withdrawal',
  'declined_transfer',
  'direct_debit_payment_not_recognised',
  'disposable_card_limits',
  'edit_personal_details',
  'exchange_rate',
  'exchange_via_app',
  'extra_charge_on_statement',
  'failed_transfer',
  'fiat_currency_support',
  'get_physical_card',
  'getting_spare_card',
  'getting_virtual_card',
  'lost_or_stolen_card',
  'lost_or_stolen_phone',
  'passcode_forgotten',
  'pending_card_payment',
  'pending