[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/text-classification/documentation-tutorial/nlp-tutorial-part-1.ipynb)



# Welcome to the Openlayer NLP tutorial - Part 1

You should use this notebook together with the [**NLP tutorial**](https://docs.openlayer.com/docs/uploading-your-first-model-and-dataset-1) from our documentation.

In [13]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/unboxai/examples-gallery/main/text-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## 1. Loading the dataset

First, let's import the libraries we need and load training and validation datasets.

In [15]:
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. 

In [16]:
TRAINING_SET_URL = "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/Urgent+events/urgent_events_train.csv"
VALIDATION_SET_URL = "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/Urgent+events/urgent_events_val.csv"

In [17]:
# loading and having a look at the training set
training_set = pd.read_csv(TRAINING_SET_URL, index_col=0)
val_set = pd.read_csv(VALIDATION_SET_URL, index_col=0)

training_set.head()

Unnamed: 0,text,label
0,how do i install a second hard drive onto my w...,0
1,how to add a movie to a powerpoint presentation,0
2,i feel so bad that im posting this blog so late,0
3,"he sleeps under the rain,send a tent to him he...",1
4,im now and still addicted to the way living a ...,0


The label we want to learn to predict is in the column `label`: urgent messages have a value of 0 while not urgent messages have a value of 1. 

## 2. Training and evaluating our model

We are going to train a gradient boosting classifier on the training data. Let's then check out what the model's performance is in the validation set.

In [18]:
sklearn_model = Pipeline([('count_vect', CountVectorizer(ngram_range=(1,2), stop_words='english')), 
                          ('lr', GradientBoostingClassifier(random_state=42))])
sklearn_model.fit(training_set['text'], training_set['label'])

Pipeline(steps=[('count_vect',
                 CountVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('lr', GradientBoostingClassifier(random_state=42))])

In [19]:
print(classification_report(val_set['label'], sklearn_model.predict(val_set['text'])))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1818
           1       0.94      0.72      0.81       182

    accuracy                           0.97      2000
   macro avg       0.95      0.86      0.90      2000
weighted avg       0.97      0.97      0.97      2000



## 3. Openlayer part!

Now it's up to you! 

Head back to the tutorial for an explanation of next few cells.

In [None]:
# installing the Openlayer Python API
!pip install openlayer

In [20]:
# instantiating the client
import openlayer

client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')

In [None]:
# creating the project
from openlayer.tasks import TaskType

project = client.create_or_load_project(name="Urgent event classification",
                                        task_type=TaskType.TextClassification,
                                        description="Evaluation of ML approaches to classify messages")

In [None]:
# uploading the dataset to the project
dataset = project.add_dataframe(
    df=val_set,
    class_names=["Not urgent", "Urgent"],
    label_column_name="label",
    text_column_name="text",
    commit_message="First commit!"  
)

In [23]:
# defining the model's predict probability function
def predict_proba(model, text_list):    
    # Getting the model's predictions
    preds = model.predict_proba(text_list)
    
    return preds

In [None]:
# uploading the model to the project
from openlayer.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=["Not urgent", "Urgent"],
    name='Gradient boosting classifier',
    commit_message='First commit!',
    requirements_txt_file='requirements.txt'
)