[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/text-classification/documentation-tutorial/nlp-tutorial-part-2.ipynb)



# Welcome to the Openlayer NLP tutorial - Part 2

You should use this notebook together with the final part of the [**NLP tutorial**](https://docs.openlayer.com/docs/uploading-your-first-model-and-dataset-1) from our documentation. This is where we solve the identified issue affecting the first version of our model.

In [13]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/unboxai/examples-gallery/main/text-classification/documentation-tutorial/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## 1. Loading the training set

First, let's import the libraries we need and load the new training set as well as the original validation set.

In [1]:
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv file. 

In [2]:
NEW_TRAINING_SET_URL = "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/Urgent+events/augmented_urgent_events_train.csv"
VALIDATION_SET_URL = "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/Urgent+events/urgent_events_val.csv"

In [3]:
# loading and having a look at the training set
training_set = pd.read_csv(NEW_TRAINING_SET_URL, index_col=0)
val_set = pd.read_csv(VALIDATION_SET_URL, index_col=0)

training_set.reset_index(inplace=True, drop=True)
training_set.head()

Unnamed: 0,text,label
0,".. . i am homeless, my family sleeps under the...",1
1,what does someone from titus magloire have to ...,1
2,i and my family (16 persons) in 3e comunal sec...,1
3,is anyone else sitting at work obsesivley answ...,0
4,how many babies do they have to squeeze to get...,0


This is the new training set we will use to try to mitigate the location issue we identified during the tutorial. Compared to the original dataset, the new training set was augmented by adding new rows with different locations for every `Urgent` sample that mentions a location. Here are a few augmented samples:

In [4]:
training_set.loc[[33, 225, 3510, 5823, 6511]]

Unnamed: 0,text,label
33,jacmel was badly damaged we in needs of tents ...,1
225,carrefour feuilles was badly damaged we in nee...,1
3510,cayes was badly damaged we in needs of tents asap,1
5823,saint hillaire was badly damaged we in needs o...,1
6511,sousmalta was badly damaged we in needs of ten...,1


## 2. Training and evaluating our model

We are going to train a gradient boosting classifier on the training data. Let's then check out what the model's performance is in the validation set.

In [5]:
sklearn_model = Pipeline([('count_vect', CountVectorizer(ngram_range=(1,2), stop_words='english')), 
                          ('lr', GradientBoostingClassifier(random_state=42))])
sklearn_model.fit(training_set['text'], training_set['label'])

Pipeline(steps=[('count_vect',
                 CountVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('lr', GradientBoostingClassifier(random_state=42))])

In [6]:
print(classification_report(val_set['label'], sklearn_model.predict(val_set['text'])))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      1818
           1       0.83      0.86      0.85       182

    accuracy                           0.97      2000
   macro avg       0.91      0.92      0.91      2000
weighted avg       0.97      0.97      0.97      2000



## 3. Openlayer part!

Now it's up to you! 

Head back to the tutorial for an explanation of next few cells.

In [None]:
# installing the Openlayer Python API
!pip install openlayer

In [None]:
# instantiating the client
import openlayer

client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')

In [None]:
# creating the project
from openlayer.tasks import TaskType

project = client.load_project(name="Urgent event classification")

In [None]:
# defining the model's predict probability function
def predict_proba(model, text_list):    
    # Getting the model's predictions
    preds = model.predict_proba(text_list)
    
    return preds

In [None]:
# uploading the model to the project
from openlayer.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=["Not urgent", "Urgent"],
    name='Gradient boosting classifier',
    commit_message='Attempt to fix location issue for Urgent class',
    requirements_txt_file='requirements.txt'
)