[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/text-classification/tensorflow/tensorflow.ipynb)


# <a id="top"> Text classification using Tensorflow</a>

This notebook illustrates how tensorflow models can be uploaded to the Openlayer platform.

## <a id="toc">Table of contents</a>

1. [**Getting the data and training the model**](#1)
    - [Downloading the dataset](#download)
    - [Preparing the data](#prepare)
    - [Training the model](#train)
    

2. [**Using Openlayer's Python API**](#2)
    - [Instantiating the client](#client)
    - [Creating a project](#project)
    - [Uploading datasets](#dataset)
    - [Uploading models](#model)
        - [Shell models](#shell)
        - [Full models](#full-model)
    - [Committing and pushing to the platform](#commit)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/text-classification/tensorflow/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="1"> 1. Getting the data and training the model </a>

[Back to top](#top)

In this first part, we will get the dataset, pre-process it, split it into training and validation sets, and train a model. Feel free to skim through this section if you are already comfortable with how these steps look for a tensorflow model.   

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow import keras

### <a id="download">Downloading the dataset </a>


In [None]:
# Constants we'll use for the dataset
MAX_WORDS = 10000
REVIEW_CLASSES = ['negative', 'positive']

# download dataset from keras.
(_X_train, _y_train), (_X_test, _y_test) = keras.datasets.imdb.load_data(num_words=MAX_WORDS)

### <a id="prepare">Preparing the data</a>

The original dataset contains the reviews as word indices. To make it human-readable, we need the word index dict, that maps the indices to words. 

In [None]:
# Word index dict for the IMDB dataset
tf.keras.datasets.imdb.get_word_index()

In [None]:
# Invert the word index so that it maps words to ints, and not the other way around, like the default
word_index = tf.keras.datasets.imdb.get_word_index()

word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  
word_index["<UNUSED>"] = 3

# word_index.items  <str> to <int>
# reverse_word_index <int> to <str>
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [None]:
def decode_review(text):
    """Function that makes the samples human-readable"""
    return ' '.join([reverse_word_index.get(i, '#') for i in text])

In [None]:
def encode_review(text):
    """Function that converts a human-readable sentence to the list of indices format"""
    words = text.split(' ')
    ids = [word_index["<START>"]]
    for w in words:
        v = word_index.get(w, word_index["<UNK>"])
        # >1000, signed as <UNUSED>
        if v > MAX_WORDS:
            v = word_index["<UNUSED>"]
        ids.append(v)
    return ids    

In [None]:
decode_review(_X_train[0])

In [None]:
decode_review(_X_train[1])

In [None]:
X_train = keras.preprocessing.sequence.pad_sequences(
    _X_train,
    dtype='int32',
    value=word_index["<PAD>"],
    padding='post',
    maxlen=256
)

X_test = keras.preprocessing.sequence.pad_sequences(
    _X_test,
    dtype='int32',
    value=word_index["<PAD>"],
    padding='post',
    maxlen=256
)


# Classification. Convert y to 2 dims 
y_train = tf.one_hot(_y_train, depth=2)
y_test = tf.one_hot(_y_test, depth=2)

### <a id="train">Training the model</a>

In [None]:
# Model setting
tf_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 8),
    tf.keras.layers.GlobalAvgPool1D(),
    tf.keras.layers.Dense(6, activation="relu"),
    tf.keras.layers.Dense(2, activation="sigmoid"),
])


tf_model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

In [None]:
tf_model.fit(X_train, y_train, epochs=30, batch_size=512)

## <a id="2"> 2. Using Openlayer's Python API</a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

In [None]:
!pip install openlayer



### <a id="client">Instantiating the client</a>

In [None]:
import openlayer

client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")

### <a id="project">Creating a project on the platform</a>

In [None]:
from openlayer.tasks import TaskType


project = client.create_or_load_project(
    name="Text classification with Tensorflow",
    task_type=TaskType.TextClassification,
    description="Evaluating NN for text classification"
)

### <a id="dataset">Uploading datasets</a>

Before adding the datasets to a project, we need to do two things:
1. Enhance the dataset with additional columns to make it comprehensive, such as adding a column for labels and one for model predictions (if you're uploading a model as well).
2. Prepare a `dataset_config.yaml` file. This is a file that contains all the information needed by the Openlayer platform to utilize the dataset. It should include the column names, the class names, etc. For details on the fields of the `dataset_config.yaml` file, see the [API reference](https://reference.openlayer.com/reference/api/openlayer.OpenlayerClient.add_dataset.html#openlayer.OpenlayerClient.add_dataset).

Let's start by enhancing the datasets with the extra columns:

In [None]:
from typing import List

def make_pandas_df(X: np.ndarray, y: np.ndarray) -> pd.DataFrame:
  """Receives X (with word indexes) and y and makes them a pandas
  DataFrame, with the text in the column `text`, the zero-indexed
  labels in the column `labels`, and the model's predicted probabilities
  in the column `predictions`.
  """
  text_data = []

  # Get the model's predictions (class probabilities)
  predictions = get_model_predictions(X)

  # Make the text human-readable (decode from word index to words)
  for indices in X:
      special_chars = ["<PAD>", "<START>", "<UNK>", "<UNUSED>"]
      text = decode_review(indices)
      for char in special_chars:
          text = text.replace(char, "")
      text_data.append(text.strip())
    
  # Get the labels (zero-indexed)
  labels = y.numpy().argmax(axis=1).tolist()  
  
  # Prepare pandas df
  data_dict = {"text": text_data, "labels": labels, "predictions": predictions}
  df = pd.DataFrame.from_dict(data_dict).sample(frac=1, random_state=1)[:1000]
  df["text"] = df["text"].str[:700]

  return df

def get_model_predictions(text_indices) -> List[float]:
  """Gets the model's prediction probabilities. Returns
  a list of length equal to the number of classes, where
  each item corresponds to the model's predicted probability
  for a given class.
  """
  X = keras.preprocessing.sequence.pad_sequences(
      text_indices,
      dtype="int32",
      value=word_index["<PAD>"],
      padding='post',
      maxlen=256
  )
  y = tf_model(X)
    
  return y.numpy().tolist()

In [None]:
training_set = make_pandas_df(_X_train, y_train)
validation_set = make_pandas_df(_X_test, y_test)

In [None]:
training_set.head()

Now, we can prepare the `dataset_config.yaml` files for the training and validation sets.

In [None]:
class_names = ['negative', 'positive']
column_names = list(training_set.columns)
label_column_name = "labels"
predictions_column_name = "predictions"
text_column_name = "text"

In [None]:
import yaml 

# Note the camelCase for the dict's keys
training_dataset_config = {
    "classNames": class_names,
    "columnNames": column_names,
    "textColumnName": text_column_name,
    "label": "training",
    "labelColumnName": label_column_name,
    "predictionsColumnName": predictions_column_name,
}

with open("training_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(training_dataset_config, dataset_config_file, default_flow_style=False)

In [None]:
import copy

validation_dataset_config = copy.deepcopy(training_dataset_config)

# In our case, the only field that changes is the `label`, from "training" -> "validation"
validation_dataset_config["label"] = "validation"

with open("validation_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(validation_dataset_config, dataset_config_file, default_flow_style=False)

In [None]:
# Training set
project.add_dataframe(
    dataset_df=training_set,
    dataset_config_file_path="training_dataset_config.yaml",
)

In [None]:
# Validation set
project.add_dataframe(
    dataset_df=validation_set,
    dataset_config_file_path="validation_dataset_config.yaml",
)

We can check that both datasets are now staged using the `project.status()` method. 

In [None]:
project.status()

### <a id="model">Uploading models</a>

When it comes to uploading models to the Openlayer platform, there are two options:

- The first one is to upload a **shell model**. Shell models are the most straightforward way to get started. They are comprised of metadata and all of the analysis are done via its predictions (which are [uploaded with the datasets](#dataset)).
- The second one is to upload a **full model**, with artifacts. When a full model is uploaded, it becomes available in the platform and it becomes possible to perform what-if analysis, use all the explainability techniques available, and perform a series of robustness assessments with it. 

#### <a id="shell">Shell models</a>

To upload a shell model, we only need to define its name, the architecture type, and add some metadata that will be rendered in the platform to help us identify it. This information should be saved to a `model_config.yaml` file.

Let's create a `model_config.yaml` file for our model:

In [None]:
import yaml

model_config = {
    "name": "Sentiment analysis NN",
    "architectureType": "tensorflow",
    "metadata": {  # Can add anything here, as long as it is a dict
        "model_type": "Neural network - feed forward",
        "epochs": 30,
    },
    "classNames": class_names,
}

with open("model_config.yaml", "w") as model_config_file:
    yaml.dump(model_config, model_config_file, default_flow_style=False)

In [None]:
project.add_model(
    model_config_file_path="model_config.yaml",
)

We can check that both datasets and model are staged using the `project.status()` method.

In [None]:
project.status()

Since in this example, we're interested in uploading a full model, let's unstage the shell model:

In [None]:
project.restore("model")

#### <a id="full-model"> Full models </a>

To upload a full model to Openlayer, you will need to create a model package, which is nothing more than a folder with all the necessary information to run inference with the model. The package should include the following:
1. A `requirements.txt` file listing the dependencies for the model.
2. Serialized model files, such as model weights, encoders, etc., in a format specific to the framework used for training (e.g. `.pkl` for sklearn, `.pb` for TensorFlow, and so on.)
3. A `prediction_interface.py` file that acts as a wrapper for the model and implements the `predict_proba` function. 

Other than the model package, a `model_config.yaml` file is needed, with information about the model to the Openlayer platform, such as the framework used, feature names, and categorical feature names.

Lets prepare the model package one piece at a time

In [None]:
# Creating the model package folder (we'll call it `model_package`)
!mkdir model_package

**1. Adding the `requirements.txt` to the model package**

In [None]:
!scp requirements.txt model_package

**2. Serializing the model and other objects needed**

In [None]:
# Saving the model
tf_model.save("model_package/my_model")

In [None]:
import pickle 

# Saving the word index
with open('model_package/word_index.pkl', 'wb') as handle:
    pickle.dump(word_index, handle, protocol=pickle.HIGHEST_PROTOCOL)

**3. Writing the `prediction_interface.py` file**

In [None]:
%%writefile model_package/prediction_interface.py

import pickle
from pathlib import Path

import pandas as pd
import tensorflow as tf

PACKAGE_PATH = Path(__file__).parent


class TFModel:
    def __init__(self):
        """This is where the serialized objects needed should
        be loaded as class attributes."""
        self.model = tf.keras.models.load_model(str(PACKAGE_PATH) + "/my_model")

        with open(PACKAGE_PATH / "word_index.pkl", "rb") as word_index_file:
            self.word_index = pickle.load(word_index_file)

    def _encode_review(self, text: str):
      """Function that converts a human-readable sentence to the list of
      indices format"""
      words = text.split(' ')
      ids = [self.word_index["<START>"]]
      for w in words:
          v = self.word_index.get(w, self.word_index["<UNK>"])
          # >1000, signed as <UNUSED>
          if v > 1000:
              v = self.word_index["<UNUSED>"]
          ids.append(v)
      return ids 

    def predict_proba(self, input_data_df: pd.DataFrame):
        """Makes predictions with the model. Returns the class probabilities."""
        text_column = input_data_df.columns[0]
        texts = input_data_df[text_column].values

        X = [self._encode_review(t) for t in texts]
        X = tf.keras.preprocessing.sequence.pad_sequences(
              X,
              dtype="int32",
              value=self.word_index["<PAD>"],
              padding='post',
              maxlen=256
            )
        y = self.model(X)

        return y.numpy()


def load_model():
    """Function that returns the wrapped model object."""
    return TFModel()

**Creating the `model_config.yaml`**

In [None]:
import yaml

model_config = {
    "name": "Sentiment analysis NN",
    "architectureType": "tensorflow",
    "metadata": {  # Can add anything here, as long as it is a dict
        "model_type": "Neural network - feed forward",
        "epochs": 30,
    },
    "classNames": class_names,
}

with open("model_config.yaml", "w") as model_config_file:
    yaml.dump(model_config, model_config_file, default_flow_style=False)

Lets check that the model package contains everything needed:

In [None]:
from openlayer.validators import ModelValidator

model_validator = ModelValidator(
    model_package_dir="model_package", 
    sample_data = validation_set[["text"]].iloc[:10],
    model_config_file_path="model_config.yaml",
)
model_validator.validate()

Now, we are ready to add the model:

In [None]:
project.add_model(
    model_package_dir="model_package",
    model_config_file_path="model_config.yaml",
    sample_data=validation_set[["text"]].iloc[:10]
)

We can check that both datasets and model are staged using the `project.status()` method.

In [None]:
project.status()

### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the first project version to the platform. 

In [None]:
project.commit("Initial commit!")

In [None]:
project.status()

In [None]:
project.push()