# Pretrained Embedding

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/pretrained_embedding.ipynb)

## Setup

In [None]:
pip install ydf tensorflow_hub tensorflow_datasets -U

In [1]:
import ydf  # To train the model
import tensorflow_datasets  # To download the movie review dataset
import tensorflow_hub  # To download the pre-trained embedding

## What is a pre-trained embedding?

**Pretrained embeddings** are models trained on a large corpus of data that can be used to improve the quality of your model when you do not have a lot of training data. Unlike a model that is trained for a specific task and outputs predictions for that task, a pretrained embedding model outputs \"embeddings,\" which are fixed-size numerical vectors that can be used as input features for a second model (e.g. a ydf model) to solve a variety of tasks. Pre-trained embeddings are also useful for applying a model to complex or unstructured data. For example, with an image, text, audio, or video pre-trained embedding, you can apply a YDF model to image, text, audio, and video data, respectively.

In this notebook, we will classify movie reviews as either "positive" or "negative". For instance, the review beginning with "This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas ..." is a positive review. Our dataset contains 25000 reviews, but because 25000 reviews are NOT enough to train a good text model, and because configuring a text model is complicated, we will simply use the [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) pre-trained embedding.


## Downloading dataset

We download the dataset from the [TensorFlow Dataset]() repository.

In [2]:
raw_train_ds = tensorflow_datasets.load(name="imdb_reviews", split="train")
raw_test_ds = tensorflow_datasets.load(name="imdb_reviews", split="test")

Let's look at the first 200 letters or the first 3 examples:

In [3]:
for example in raw_train_ds.take(3):
  print(f"""\
text: {example['text'].numpy()[:200]}
label: {example['label']}
=========================""")

text: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting "
label: 0
text: b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However '
label: 0
text: b'Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Moun'
label: 0


## Downloading embedding

In [4]:
embed = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

We can test the embedding on any text. It returns a vector of numbers. While those values do not have inherent meaning to us, YDF is very good at consuming them.

In [5]:
embeddings = embed([
    "The little blue dog eats a piece of ham.",
    "It is raining today."]).numpy()
print(embeddings)

[[-0.01440776  0.04751815  0.05348268 ...  0.018609    0.03508667
  -0.03631601]
 [-0.0552231  -0.02168638  0.05072879 ...  0.03051734 -0.00266217
   0.01246582]]


## Apply embedding on dataset

We can apply the embedding to our dataset. Since the dataset and the embedding are both created with TensorFlow, we will prepare a TensorFlow Dataset and feed it directly into YDF. YDF natively consumes TensorFlow Datasets.

In [6]:
def apply_embedding(batch):
    batch["text"] = embed(batch["text"])
    return batch

# The batch-size (256) has not impact on the YDF model. However,
# reading a TensorFlow dataset with a small (<50) batch size might
# be slow. Use a large batch size increases memory usage.
train_ds = raw_train_ds.batch(256).map(apply_embedding)
test_ds = raw_test_ds.batch(256).map(apply_embedding)

Let's show the first 10 dimensions of the embedding for the 3 examples in the first batch examples.

In [7]:
for example in train_ds.take(1):
  print(f"""\
text: {example['text'].numpy()[:3, :10]}
label: {example['label'].numpy()[:3]}
=========================""")

text: [[ 0.04070191  0.00420414 -0.01570062  0.06623042  0.06024029  0.00345815
  -0.00204514 -0.02974475 -0.06150667  0.02128238]
 [ 0.02308333  0.03450448  0.03191734  0.01053793 -0.004009    0.00847429
  -0.03853702 -0.02518811  0.03465953  0.08872268]
 [ 0.0223658  -0.00636589  0.04310491 -0.05726858  0.05173567 -0.02762526
  -0.0447326  -0.00299736 -0.0420398  -0.01994686]]
label: [0 0 0]


## Training a pre-trained embedding model

In [8]:
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

Train model on 25000 examples
Model trained in 0:00:27.819662


We can observe the 512 dimensions of the embedding. In the "variable importance" tab, we see that not all dimensions of the embedding are equally useful. For example, the feature `text.111_of_512` is very useful for the model.

In [9]:
model.describe()

## Evaluating model

We evaluate the model on the test dataset.

In [10]:
model.evaluate(test_ds)

Label \ Pred,0,1
0,10724,1941
1,1776,10559


The model accuracy is ~85%. Not too bad for a model trained in a few seconds with default hyper-parameters :)