[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/text-classification/pilots/pilots-urgent-event.ipynb)


# <a id="top">Urgent event classification using sklearn</a>

This notebook illustrates how sklearn models can be uploaded to the Openlayer platform.


## <a id="toc">Table of contents</a>

1. [**Getting the data and training the model**](#1)
    - [Downloading the dataset](#download)
    - [Training the model](#train)
    

2. [**Using Openlayer's Python API**](#2)
    - [Instantiating the client](#client)
    - [Creating a project](#project)
    - [Uploading datasets](#dataset)
    - [Uploading models](#model)
        - [Shell models](#shell)
    - [Committing and pushing to the platform](#commit)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/text-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="1"> 1. Getting the data and training the model </a>

[Back to top](#top)

In this first part, we will get the dataset, pre-process it, split it into training and validation sets, and train a model. Feel free to skim through this section if you are already comfortable with how these steps look for an sklearn model.   

In [1]:
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

### <a id="download">Downloading the dataset </a>

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. 

In [2]:
%%bash

if [ ! -e "urgent_train.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/pilots/urgent_train.csv" --output "urgent_train.csv"
fi

if [ ! -e "urgent_val.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/pilots/urgent_val.csv" --output "urgent_val.csv"
fi

In [3]:
validation_set = pd.read_csv("./urgent_val.csv")

validation_set.head()

Unnamed: 0,text,label
0,i have to do what i have to do i feel like a l...,0
1,i feel like i am being punished for the choice...,0
2,i highly recommend it if you want to feel tota...,0
3,if i download images from internet for two hou...,0
4,i have to mention that i feel slightly unhappy...,0


In [4]:
# Emulating an LLM dataset
validation_set["input_var"] = 7
validation_set["model_output"] = validation_set["text"]
validation_set["label"] = validation_set["text"]

In [5]:
validation_set.head()

Unnamed: 0,text,label,input_var,model_output
0,i have to do what i have to do i feel like a l...,i have to do what i have to do i feel like a l...,7,i have to do what i have to do i feel like a l...
1,i feel like i am being punished for the choice...,i feel like i am being punished for the choice...,7,i feel like i am being punished for the choice...
2,i highly recommend it if you want to feel tota...,i highly recommend it if you want to feel tota...,7,i highly recommend it if you want to feel tota...
3,if i download images from internet for two hou...,if i download images from internet for two hou...,7,if i download images from internet for two hou...
4,i have to mention that i feel slightly unhappy...,i have to mention that i feel slightly unhappy...,7,i have to mention that i feel slightly unhappy...


## <a id="2"> 2. Using Openlayer's Python API</a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

### <a id="client">Instantiating the client</a>

In [6]:
import openlayer

openlayer.api.STORAGE = openlayer.api.StorageType.ONPREM
openlayer.api.OPENLAYER_ENDPOINT = "http://localhost:8080/v1"

client = openlayer.OpenlayerClient("WQPSLkDT4kvRTU7JzWM9Wa-L5OQs-Xl9")

In [7]:
from openlayer.tasks import TaskType

project = client.create_or_load_project(
    name="Urgent event classification - LLM",
    task_type=TaskType.TextClassification,
    description="Evaluation of ML approaches to classify messages"
)

Found your project. Navigate to http://localhost:8000/w-1db4b41/5873426e-6de8-40a4-9530-2071c6d8bc3e to see it.


In [8]:
# Some variables that will go into the `dataset_config.yaml` file
column_names = list(validation_set.columns)
input_variable_names = ["text", "input_var"]
output_column_name = "model_output"
ground_truth_column_name = "label"

In [9]:
import yaml 

validation_dataset_config = {
    "columnNames": column_names,
    "inputVariableNames": input_variable_names,
    "label": "validation",
    "outputColumnName": output_column_name,
}

with open("validation_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(validation_dataset_config, dataset_config_file, default_flow_style=False)

In [10]:
project.add_dataframe(
    dataset_df=validation_set,
    dataset_config_file_path="validation_dataset_config.yaml",
)

[2023-07-21 11:01:35,299] - INFO - ----------------------------------------------------------------------------
[2023-07-21 11:01:35,300] - INFO -                           Dataset validations                          
[2023-07-21 11:01:35,300] - INFO - ----------------------------------------------------------------------------

[2023-07-21 11:01:35,314] - INFO - ✓ All validation dataset validations passed!



<openlayer.validators.dataset_validators.LLMDatasetValidator object at 0x7fe6c850fe20>
Found an existing `validation` resource staged.
Do you want to overwrite it? [y/n] 
Keeping the existing `validation` resource staged.


In [11]:
project.status()

The following resources are staged, waiting to be committed:
	 - model
	 - validation
Use the `commit` method to add a commit message to your changes.


#### <a id="shell">Shell models</a>

To upload a shell model, we only need to define its name, the architecture type, and add some metadata that will be rendered in the platform to help us identify it. This information should be saved to a `model_config.yaml` file.

Let's create a `model_config.yaml` file for our model:

In [14]:
import yaml

llm_config = {
    "modelType": "shell",
    "name": "Urgent event classifier",
    "architectureType": "llm",
    "metadata": {  # Can add anything here, as long as it is a dict
        "blabla": "123",
    },
}

with open("llm_config.yaml", "w") as model_config_file:
    yaml.dump(llm_config, model_config_file, default_flow_style=False)

In [15]:
project.add_model(
    model_config_file_path="llm_config.yaml",
)

[2023-07-21 11:01:43,699] - INFO - ----------------------------------------------------------------------------
[2023-07-21 11:01:43,702] - INFO -                           Model validations                          
[2023-07-21 11:01:43,708] - INFO - ----------------------------------------------------------------------------

[2023-07-21 11:01:43,718] - INFO - ✓ All model validations passed!



<openlayer.validators.model_validators.LLMValidator object at 0x7fe70b2246d0>
{'name': 'Urgent event classifier', 'metadata': {'blabla': '123'}, 'architectureType': 'llm'}
None
Found an existing `model` resource staged.
Do you want to overwrite it? [y/n] y
Overwriting previously staged `model` resource...
Staged the `model` resource!


We can check that both datasets and model are staged using the `project.status()` method.

In [16]:
project.status()

The following resources are staged, waiting to be committed:
	 - model
	 - validation
Use the `commit` method to add a commit message to your changes.


### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the first project version to the platform. 

In [17]:
project.commit("Initial commit!")

[2023-07-21 11:01:46,298] - INFO - ----------------------------------------------------------------------------
[2023-07-21 11:01:46,310] - INFO -                           Commit message validations                          
[2023-07-21 11:01:46,311] - INFO - ----------------------------------------------------------------------------

[2023-07-21 11:01:46,312] - INFO - ✓ All commit message validations passed!



Committed!


In [18]:
project.status()

The following resources are committed, waiting to be pushed:
	 - model
	 - validation
Commit message from Fri Jul 21 11:01:46 2023:
	 Initial commit!
Use the `push` method to push your changes to the platform.


In [19]:
project.push()

[2023-07-21 11:01:47,042] - INFO - ----------------------------------------------------------------------------
[2023-07-21 11:01:47,043] - INFO -                           Commit bundle validations                          
[2023-07-21 11:01:47,043] - INFO - ----------------------------------------------------------------------------



entrou
model_output


ValueError: Invalid model type: None. The model type must be one of 'shell', 'full' or 'baseline'.