# Text & Categorical-set

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/categorical_set_feature.ipynb)

## Setup


In [None]:
pip install ydf datasets tiktoken -U

In [None]:
from itertools import islice

from datasets import load_dataset
import tiktoken
import ydf

## About this tutorial

This tutorial shows how to train a text classifier on the AG News using categorical-set both on a white-space and GPT tokenizer.

## What are text & categorical-set features?

A categorical-set feature is a type of input feature where each value is a set (i.e., a list of non-ordered) categorical values. They differ from classical categorical feature where each value is a single categorical value.

| Examples of categorical value                           | Examples of categorical-set value            |
|--------------------------------|-----------------------------|
| RED | {RED}                         |
| BLUE    | {RED, BLUE}                     |
| BLUE    | {}                     |
| *missing*     | *missing* |

## Why using categorical-set features?

Categorical-set features are useful for tags-like or tokenizable features like text and URLs, especially with small datasets.
For example, the text feature value "I eat an applepie!" becomes the categorical-set feature value {"I", "eat", "an", "applepie!"} using whitespace splitting uni-gram tokenization. Bi-gram or tri-gram tokenization can also be used and sometimes give good results. For example, the same text becomes {"I_eat", "eat_an", "an_applepie!"} with bigrams.

While simple, whitespace splitting does not works with punctuation and with some languages.
If possible, prefer modern tokenizers such as [Google's Sentencepiece](https://github.com/google/sentencepiece) or [OpenAI's Tiktoken](https://github.com/openai/tiktoken), or [Transformer tokenizers](https://huggingface.co/transformers/v3.0.2/model_doc/gpt2.html#gpt2tokenizer).

In this tutorial, we will use tiktoken with the "r50k_base" configuration used by GPT-2 and some GPT-3 models.
For example, the text above will become: ["I", "_eat", "_an", "_apple", "pie", "_!"]

## Loading the dataset

We are working with the AG News dataset. The goal of this dataset is to predict the type of an article from its text. It is a classical text classification dataset.

Let's load the dataset.

In [None]:
def ag_news_dataset(split: str):
  class_mapping = {
      0: "World",
      1: "Sports",
      2: "Business",
      3: "Sci/Tech",
  }
  for example in load_dataset("ag_news")[split]:
    yield {
        "text": example["text"],
        "label": class_mapping[example["label"]],
    }


# Print the first 3 training examples
for example_idx, example in enumerate(islice(ag_news_dataset("train"), 3)):
  print(f"==========\nExample #{example_idx}\n----------")
  print(example)

Example #0
----------
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 'Business'}
Example #1
----------
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.', 'label': 'Business'}
Example #2
----------
{'text': "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.", 'label': 'Business'}


We define our tokenizer function that will convert a text into a categorical-set.

A whitespace tokenier can be implemented as follow:

In [None]:
def tokenize_white_space(text):
  return text.split(" ")


tokenize_white_space("I eat an applepie!")

['I', 'eat', 'an', 'applepie!']

Alternatively, we Tiktoken tokenizer can be implemented a follow:

In [None]:
gpt2_tokenizer = tiktoken.get_encoding("r50k_base")

def tokenize_gpt2(text):
    return gpt2_tokenizer.decode_tokens_bytes(gpt2_tokenizer.encode(text))

tokenize_gpt2("I eat an applepie!")

[b'I', b' eat', b' an', b' apple', b'pie', b'!']

We will use this last tokenizer.

In [None]:
tokenize = tokenize_gpt2

Let's load more data and apply both tokenizers. For this example to run quickly, we will only use 10k training examples.

YDF supports different formats (see, `ydf.help.loading_data()`). We will use python dictionaries.

In [None]:
def create_dataset(split):
  labels = []
  tokens = []
  for example in islice(ag_news_dataset(split), 10_000):
    labels.append(example["label"])
    tokens.append(tokenize(example["text"]))
  return {"label": labels, "token": tokens}


train_dataset = create_dataset("train")
test_dataset = create_dataset("test")

Let's look at at the first training example:

In [None]:
print("label:", train_dataset["label"][0])
print("tokens:", train_dataset["token"][0])

label: Business
tokens: [b'Wall', b' St', b'.', b' Bears', b' Claw', b' Back', b' Into', b' the', b' Black', b' (', b'Reuters', b')', b' Reuters', b' -', b' Short', b'-', b'sell', b'ers', b',', b' Wall', b' Street', b"'s", b' dwindling', b'\\', b'band', b' of', b' ultra', b'-', b'cy', b'n', b'ics', b',', b' are', b' seeing', b' green', b' again', b'.']


### Train model

We can now train our models.

In [None]:
learner = ydf.RandomForestLearner(label="label", features=[("token", ydf.Semantic.CATEGORICAL_SET)])
model = learner.train(train_dataset)

Train model on 10000 examples
Model trained in 0:00:29.219372


It is always a good idea to check the models's description.

In [None]:
model.describe()

Then, you can look at the type of condition of the model. For example, if a condition is `"token" is in {prices, largest, maker, crude, sell, Hugo, workers, (Bloomberg), analysts}`, it means the model checks if any of those words are contained in the article.

In [None]:
model.plot_tree(max_depth=2)

Finally, we can evaluate the model.

In [None]:
model.evaluate(test_dataset)

Label \ Pred,Sci/Tech,World,Business,Sports
Sci/Tech,1652,177,450,136
World,84,1585,101,90
Business,124,67,1310,30
Sports,40,71,39,1644
