[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/text-classification/documentation-tutorial/nlp_tutorial.ipynb)

# Welcome to the Unbox NLP tutorial!

We made our best to make it as simple as possible. You should use this notebook together with the tutorial from our documentation.

In [None]:
!curl "https://raw.githubusercontent.com/unboxai/examples-gallery/main/text-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"

In [None]:
!pip install -r requirements.txt

## 1. Loading the dataset

First, let's import the libraries we need and load the banking dataset.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. Alternatively, you can also find the dataset on [HuggingFace](https://huggingface.co/datasets/banking77).

In [2]:
DATASET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/banking.csv"

In [3]:
# loading and having a look at the full banking dataset
banking_dataset = pd.read_csv(DATASET_URL)

banking_dataset.head()

Unnamed: 0,text,category
0,I am still waiting on my card?,card_arrival
1,What can I do if my card still hasn't arrived ...,card_arrival
2,I have been waiting over a week. Is the card s...,card_arrival
3,Can I track my card while it is in the process...,card_arrival
4,"How do I know if I will get my card, or if it ...",card_arrival


The label we want to learn to predict is in the column `category`. However, we first need to encode it in a way that each category receives a label. We can easily do that with pandas.

In [4]:
banking_dataset['category'] = banking_dataset['category'].astype('category')
banking_dataset['label_code'] = banking_dataset['category'].cat.codes

## 2. Getting the label list

The label list is simply a list with the names of all the categories. This list will be passed as an argument when we upload our model and dataset to Unbox so that we can display them nicely.

In [5]:
label_dict = dict(zip(banking_dataset['category'].cat.codes, banking_dataset['category']))

label_list = [None] * len(label_dict)
for index, label in label_dict.items():
    label_list[index] = label

## 3. Splitting the data into training and validation sets

Now, let's split the banking dataset into training and validation sets. To do so, we will shuffle the data and use the first 7000 rows as a training set and the remaining ones as a validation set.

In [6]:
# shuffling the data
banking_dataset = banking_dataset.sample(frac=1, random_state=42)  

training_set = banking_dataset[:7000]
validation_set = banking_dataset[7000:]

## 4. Training and evaluating our model

We are going to train a logistic regression on the training data. Let's then check out what the model's performance is in the validation set.

In [7]:
sklearn_model = Pipeline([('count_vect', CountVectorizer(ngram_range=(1,2), stop_words='english')), 
                          ('lr', LogisticRegression(random_state=42))])
sklearn_model.fit(training_set['text'], training_set['label_code'])

Pipeline(steps=[('count_vect',
                 CountVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('lr', LogisticRegression(random_state=42))])

In [8]:
print("The model's accuracy on the validation set is equal to: " + 
      str(100 * accuracy_score(validation_set['label_code'], sklearn_model.predict(validation_set['text']))) + "%")

The model's accuracy on the validation set is equal to: 84.63073852295409%


## 5. Unbox part -- have fun creating the next few cells!

Now it's up to you!

Head back to the tutorial to see how you need to fill out the next few cells.

In [9]:
# instantiating the client and creating the project

In [10]:
# defining the predict function

In [11]:
# uploading the model

In [12]:
# uploading the dataset