[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/text-classification/documentation-tutorial/nlp-tutorial-part-2.ipynb)



# <a id="top">Openlayer text classification tutorial - Part 2</a>

Welcome! This is the second notebook from the text classification tutorial. Here, we solve the **data integrity** issues and commit the new datasets and model versions to the platform. You should use this notebook together with the **text classification tutorial from our documentation**.


## <a id="toc">Table of contents</a>

1. [**Fixing the data integrity issues and re-training the model**](#1)
    

2. [**Using Openlayer's Python API**](#2)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/text-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="1"> 1. Fixing the data integrity issues and re-training the model </a>

[Back to top](#top)

In this first part, we will download the data with the integrity issues fixed. This includes dropping duplicate rows, resolving conflicting labels, dropping correlated features, etc., as pointed out in the tutorial.

In [None]:
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

### <a id="download">Downloading the dataset </a>

In [None]:
%%bash

if [ ! -e "20_news_train_integrity_fix.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/documentation/20_news_train_integrity_fix.csv" --output "20_news_train_integrity_fix.csv"
fi

if [ ! -e "20_news_val_integrity_fix.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/documentation/20_news_val_integrity_fix.csv" --output "20_news_val_integrity_fix.csv"
fi

In [None]:
# Loading and having a look at the training set
training_set = pd.read_csv("./20_news_train_integrity_fix.csv")
validation_set = pd.read_csv("./20_news_val_integrity_fix.csv")

training_set.head()

### <a id="train">Training the model</a>

In [None]:
sklearn_model = Pipeline([('count_vect', CountVectorizer(ngram_range=(1,2), stop_words='english')), 
                          ('lr', GradientBoostingClassifier(random_state=42))])
sklearn_model.fit(training_set['text'], training_set['label'])

In [None]:
print(classification_report(validation_set['label'], sklearn_model.predict(validation_set['text'])))

## <a id="2"> 2. Using Openlayer's Python API</a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

In [None]:
!pip install openlayer

### <a id="client">Instantiating the client</a>

In [None]:
import openlayer

client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")

### <a id="project">Loading a project from the platform</a>

In [None]:
from openlayer.tasks import TaskType

project = client.create_or_load_project(
    name="Hockey x Baseball - 20 newsgroups",
    task_type=TaskType.TextClassification,
    description="Evaluation of ML approaches for text classification"
)

### <a id="dataset">Uploading datasets</a>

In terms of configuration, the datasets haven't changed from the previous version. For completeness, we will re-create the configs.

First, let's start by enhancing the datasets with the extra column:

In [None]:
# Adding the column with the predictions (since we'll also upload a model later)
training_set["predictions"] = sklearn_model.predict_proba(training_set["text"]).tolist()
validation_set["predictions"] = sklearn_model.predict_proba(validation_set["text"]).tolist()

In [None]:
string_to_int_map = {"baseball": 0, "hockey": 1}
training_set["label"] = training_set["label"].map(string_to_int_map)
validation_set["label"] = validation_set["label"].map(string_to_int_map)

In [None]:
# Some variables that will go into the `dataset_config`
class_names = ["Baseball", "Hockey"]
text_column_name = "text"
label_column_name = "label"
prediction_scores_column_name = "predictions"

In [None]:
# Note the camelCase for the dict's keys
training_dataset_config = {
    "classNames": class_names,
    "textColumnName": "text",
    "label": "training",
    "labelColumnName": label_column_name,
    "predictionScoresColumnName": prediction_scores_column_name,
}

In [None]:
import copy

validation_dataset_config = copy.deepcopy(training_dataset_config)

# In our case, the only field that changes is the `label`, from "training" -> "validation"
validation_dataset_config["label"] = "validation"

In [None]:
# Training set
project.add_dataframe(
    dataset_df=training_set,
    dataset_config=training_dataset_config
)

In [None]:
# Validation set
project.add_dataframe(
    dataset_df=validation_set,
    dataset_config=validation_dataset_config
)

We can check that both datasets are now staged using the `project.status()` method. 

In [None]:
project.status()

### <a id="model">Uploading models</a>

We will also upload a shell model here, since we're still focusing on the data on the plarform.

In [None]:
model_config = {
    "metadata": {  # Can add anything here, as long as it is a dict
        "model_type": "Gradient Boosting Classifier",
        "regularization": "None",
        "vectorizer": "Count Vectorizer"
    },
    "classNames": class_names,
}

In [None]:
project.add_model(
    model_config=model_config,
)

We can check that both datasets and model are staged using the `project.status()` method.

In [None]:
project.status()

### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the new project version to the platform. 

In [None]:
project.commit("Fix data integrity issues")

In [None]:
project.status()

In [None]:
project.push()