[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/text-classification/documentation-tutorial/nlp-tutorial-part-3.ipynb)



# <a id="top">Openlayer text classification tutorial - Part 3</a>

Welcome! This is the third notebook from the text classification tutorial. Here, we solve the **data consistency** issues and commit the new datasets and model versions to the platform. You should use this notebook together with the **text classification tutorial from our documentation**.


## <a id="toc">Table of contents</a>

1. [**Fixing the data consistency issues and re-training the model**](#1)
    

2. [**Using Openlayer's Python API**](#2)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/text-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="1"> 1. Fixing the data integrity issues and re-training the model </a>

[Back to top](#top)

In this first part, we will download the data with the consistency issues fixed. This includes dropping the leaked rows from the training set.

In [None]:
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

### <a id="download">Downloading the dataset </a>

In [None]:
%%bash

if [ ! -e "20_news_train_consistency_fix.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/documentation/20_news_train_consistency_fix.csv" --output "20_news_train_consistency_fix.csv"
fi

if [ ! -e "20_news_val_consistency_fix.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/documentation/20_news_val_consistency_fix.csv" --output "20_news_val_consistency_fix.csv"
fi

In [None]:
# Loading and having a look at the training set
training_set = pd.read_csv("./20_news_train_consistency_fix.csv")
validation_set = pd.read_csv("./20_news_val_consistency_fix.csv")

training_set.head()

### <a id="train">Training the model</a>

In [None]:
sklearn_model = Pipeline([('count_vect', CountVectorizer(ngram_range=(1,2), stop_words='english')), 
                          ('lr', GradientBoostingClassifier(random_state=42))])
sklearn_model.fit(training_set['text'], training_set['label'])

In [None]:
print(classification_report(validation_set['label'], sklearn_model.predict(validation_set['text'])))

## <a id="2"> 2. Using Openlayer's Python API</a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

In [None]:
!pip install openlayer

### <a id="client">Instantiating the client</a>

In [None]:
import openlayer

client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")

### <a id="project">Loading a project from the platform</a>

In [None]:
from openlayer.tasks import TaskType

project = client.create_or_load_project(
    name="Hockey x Baseball - 20 newsgroups",
    task_type=TaskType.TextClassification,
    description="Evaluation of ML approaches for text classification"
)

### <a id="dataset">Uploading datasets</a>

In terms of configuration, the datasets haven't changed from the previous version. For completeness, we will write new `dataset_config.yaml` files.

First, let's start by enhancing the datasets with the extra column:

In [None]:
# Adding the column with the predictions (since we'll also upload a model later)
training_set["predictions"] = sklearn_model.predict_proba(training_set["text"]).tolist()
validation_set["predictions"] = sklearn_model.predict_proba(validation_set["text"]).tolist()

In [None]:
string_to_int_map = {"baseball": 0, "hockey": 1}
training_set["label"] = training_set["label"].map(string_to_int_map)
validation_set["label"] = validation_set["label"].map(string_to_int_map)

In [None]:
# Some variables that will go into the `dataset_config.yaml` file
class_names = ["Baseball", "Hockey"]
column_names = list(training_set.columns)
text_column_name = "text"
label_column_name = "label"
predictions_column_name = "predictions"

In [None]:
import yaml 

# Note the camelCase for the dict's keys
training_dataset_config = {
    "classNames": class_names,
    "columnNames": column_names,
    "textColumnName": "text",
    "label": "training",
    "labelColumnName": label_column_name,
    "predictionsColumnName": predictions_column_name,
}

with open("training_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(training_dataset_config, dataset_config_file, default_flow_style=False)

In [None]:
import copy

validation_dataset_config = copy.deepcopy(training_dataset_config)

# In our case, the only field that changes is the `label`, from "training" -> "validation"
validation_dataset_config["label"] = "validation"

with open("validation_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(validation_dataset_config, dataset_config_file, default_flow_style=False)

In [None]:
# Training set
project.add_dataframe(
    dataset_df=training_set,
    dataset_config_file_path="training_dataset_config.yaml",
)

In [None]:
# Validation set
project.add_dataframe(
    dataset_df=validation_set,
    dataset_config_file_path="validation_dataset_config.yaml",
)

We can check that both datasets are now staged using the `project.status()` method. 

In [None]:
project.status()

### <a id="model">Uploading models</a>

Once we're done with the consistency goals, we'll move on to performance goals, which have to do with the model itself. Therefore, now, we will upload a **full model** instead of a shell model. We will do so so that we can have explain the model's predictions on the platform using explainability techiques such as LIME.

#### <a id="full-model"> Full models </a>

To upload a full model to Openlayer, you will need to create a **model package**, which is nothing more than a folder with all the necessary information to run inference with the model. The package should include the following:
1. A `requirements.txt` file listing the dependencies for the model.
2. Serialized model files, such as model weights, encoders, etc., in a format specific to the framework used for training (e.g. `.pkl` for sklearn, `.pb` for TensorFlow, and so on.)
3. A `prediction_interface.py` file that acts as a wrapper for the model and implements the `predict_proba` function. 

Other than the model package, a `model_config.yaml` file is needed, with information about the model to the Openlayer platform, such as the framework used, feature names, and categorical feature names.

Lets prepare the model package one piece at a time.

In [None]:
# Creating the model package folder (we'll call it `model_package`)
!mkdir model_package

**1. Adding the `requirements.txt` to the model package**

In [None]:
!scp requirements.txt model_package

**2. Serializing the model**

In [None]:
import pickle 

# Trained model
with open("model_package/model.pkl", "wb") as handle:
    pickle.dump(sklearn_model, handle, protocol=pickle.HIGHEST_PROTOCOL)

**3. Writing the `prediction_interface.py` file**

In [None]:
%%writefile model_package/prediction_interface.py

import pickle
from pathlib import Path

import pandas as pd

PACKAGE_PATH = Path(__file__).parent


class SklearnModel:
    def __init__(self):
        """This is where the serialized objects needed should
        be loaded as class attributes."""

        with open(PACKAGE_PATH / "model.pkl", "rb") as model_file:
            self.model = pickle.load(model_file)

    def predict_proba(self, input_data_df: pd.DataFrame):
        """Makes predictions with the model. Returns the class probabilities."""
        text_column = input_data_df.columns[0]
        return self.model.predict_proba(input_data_df[text_column])


def load_model():
    """Function that returns the wrapped model object."""
    return SklearnModel()

**Creating the `model_config.yaml`**

In [None]:
import yaml

model_config = {
    "name": "News classifier",
    "architectureType": "sklearn",
    "metadata": {  # Can add anything here, as long as it is a dict
        "model_type": "Gradient Boosting Classifier",
        "regularization": "None",
        "vectorizer": "Count Vectorizer"
    },
    "classNames": class_names,
}

with open("model_config.yaml", "w") as model_config_file:
    yaml.dump(model_config, model_config_file, default_flow_style=False)

Lets check that the model package contains everything needed:

In [None]:
from openlayer.validators import ModelValidator

model_validator = ModelValidator(
    model_package_dir="model_package", 
    model_config_file_path="model_config.yaml",
    sample_data = validation_set.iloc[:10, :],
)
model_validator.validate()

All validations are passing, so we are ready to add the full model!

In [None]:
project.add_model(
    model_package_dir="model_package",
    model_config_file_path="model_config.yaml",
    sample_data=validation_set.iloc[:10, :],
)

We can check that both datasets and model are staged using the `project.status()` method.

In [None]:
project.status()

### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the new project version to the platform. 

In [None]:
project.commit("Fix data consistency goals")

In [None]:
project.status()

In [None]:
project.push()