<img align="right" width="400" src="https://www.fhnw.ch/de/++theme++web16theme/assets/media/img/fachhochschule-nordwestschweiz-fhnw-logo.svg" alt="FHNW Logo">


# Export Trained Classifier

by Fabian Märki

## Summary
The aim of this notebook is to export a trained classifier together with the preprocessing pipeline. This classifier can then be imported and integrated into a production software stack (e.g. as a REST service which makes the classifier accessible from everywhere).

### REST Server
The code to make the exported classifier available as a service is available from [here](https://github.com/markif/2021_HS_CAS_NLP_LAB_Notebooks/tree/master/rest-server). There is also a README available with further instructions. 
There is an instance running on *86.119.38.109* which can be accessed by any REST client (e.g. with curl):
- `curl -X POST http://86.119.38.109:5000/api/v1/sentiment -H 'Content-Type: application/json' -d '["Dies ist ein super Arzt. <br>", "Ein schlechter <p> Arzt."]'`

### INCEpTION
[INCEpTION](https://inception-project.github.io/) allows for the integration of recommenders via REST (see [here](https://inception-project.github.io/example-projects/recommender/), [here](https://inception-project.github.io/example-projects/external-recommender/), [here](https://github.com/inception-project/external-recommender-spacy) and [here](https://github.com/inception-project/inception/blob/main/notebooks/external_recommender.ipynb)) which supports annotators by making suggestions for the task at hand (and thus can speedup the annotation process).

This notebook does not contain assigments: <font color='red'>Enjoy.</font>

<a href="https://colab.research.google.com/github/markif/2021_HS_CAS_NLP_Notebooks/blob/master/04_b_Export_Trained_Classifier.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
%%capture

!pip install 'fhnw-nlp-utils>=0.2.13,<0.3.0'

from fhnw.nlp.utils.storage import download
from fhnw.nlp.utils.storage import load_dataframe

import pandas as pd
import numpy as np

In [2]:
from fhnw.nlp.utils.system import system_info
print(system_info())

OS name: posix
Platform name: Linux
Platform release: 5.11.0-40-generic
Python version: 3.6.9
Tensorflow version: 2.5.1
GPU is available


In [3]:
%%time
download("https://drive.google.com/uc?id=19AFeVnOfX8WXU4_3rM7OFoNTWWog_sb_", "data/german_doctor_reviews_tokenized.parq")
data = load_dataframe("data/german_doctor_reviews_tokenized.parq")
data.shape

CPU times: user 7.71 s, sys: 1.48 s, total: 9.18 s
Wall time: 5.38 s


(350087, 10)

In [4]:
data.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment,token_clean,text_clean,token_lemma,token_stem,token_clean_stopwords
0,Ich bin franzose und bin seit ein paar Wochen ...,2.0,Ich bin franzose und bin seit ein paar Wochen ...,positive,1,"[ich, bin, franzose, und, bin, seit, ein, paar...",ich bin franzose und bin seit ein paar wochen ...,"[franzose, seit, paar, wochen, muenchen, zahn,...","[franzos, seit, paar, woch, muench, ., zahn, s...","[franzose, seit, paar, wochen, muenchen, ., za..."
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1,"[dieser, arzt, ist, das, unmöglichste, was, mi...",dieser arzt ist das unmöglichste was mir in me...,"[arzt, unmöglichste, leben, je, begegnen, unfr...","[arzt, unmog, leb, je, begegnet, unfreund, ,, ...","[arzt, unmöglichste, leben, je, begegnet, unfr..."
2,Hatte akute Beschwerden am Rücken. Herr Magura...,1.0,Hatte akute Beschwerden am Rücken. Herr Magura...,positive,1,"[hatte, akute, beschwerden, am, rücken, ., her...",hatte akute beschwerden am rücken . herr magur...,"[akut, beschwerden, rücken, magura, erste, arz...","[akut, beschwerd, ruck, ., magura, erst, arzt,...","[akute, beschwerden, rücken, ., magura, erste,..."


In [5]:
# remove all neutral sentimens
data = data.loc[(data["label"] != "neutral")]
data.shape

(331187, 10)

This time we use all data for training.

In [6]:
X_train, y_train = data["token_stem"], data["label"]

### Train Classifier

Train the Classifier using the cleaned data and the optimal hyperparameters found through hyperparameter tuning.

In [7]:
%%time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
# serialization in python does not work with lambdas (therefore this function)
from fhnw.nlp.utils.processing import identity

pipe = Pipeline([
         ("vec", CountVectorizer(tokenizer=identity, preprocessor=identity, stop_words=None)),
         ("tfidf", TfidfTransformer()),
         ("clf", SGDClassifier())
        ])

CPU times: user 274 ms, sys: 44.4 ms, total: 318 ms
Wall time: 312 ms


In [8]:
best_params = {
    "clf__alpha": 5.3e-06, 
    "tfidf__norm": "l2", 
    "tfidf__sublinear_tf": True, 
    "tfidf__use_idf": True, 
    "vec__ngram_range": (1, 2),
    "vec__max_df": 0.5, 
    "vec__min_df": 0.0001,
}

In [9]:
%%time

pipe.set_params(**best_params)

pipe.fit(X_train, y_train)

CPU times: user 27 s, sys: 595 ms, total: 27.6 s
Wall time: 27.6 s


Pipeline(steps=[('vec',
                 CountVectorizer(max_df=0.5, min_df=0.0001, ngram_range=(1, 2),
                                 preprocessor=<function identity at 0x7f5fd4104730>,
                                 tokenizer=<function identity at 0x7f5fd4104730>)),
                ('tfidf', TfidfTransformer(sublinear_tf=True)),
                ('clf', SGDClassifier(alpha=5.3e-06))])

Double check performance...

In [10]:
%%time

y_train_pred = pipe.predict(X_train)

CPU times: user 22.6 s, sys: 113 ms, total: 22.7 s
Wall time: 22.7 s


In [11]:
from sklearn.metrics import classification_report

report = classification_report(y_train, y_train_pred)
print(report)

              precision    recall  f1-score   support

    negative       0.94      0.91      0.93     33022
    positive       0.99      0.99      0.99    298165

    accuracy                           0.99    331187
   macro avg       0.97      0.95      0.96    331187
weighted avg       0.99      0.99      0.99    331187



Setup the preprocessing steps which were originally applied on the data before training.

In [12]:
%%capture

#!pip install 'spacy>=3.0.5'

#import spacy
#!python3 -m spacy download de_core_news_md

#nlp = spacy.load("de_core_news_md")

!pip install nltk

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import nltk

nltk.download('punkt')
nltk.download('stopwords')

stopwords = set(stopwords.words("german"))

stemmer = SnowballStemmer("german")

Import the preprocessing functions - these would already be available in a real setting since you used it for preprocessing the training data.

In [13]:
from fhnw.nlp.utils.text import clean_text
from fhnw.nlp.utils.normalize import normalize

Create Preprocessors which can be integrated into a Pipeline.

In [14]:
from fhnw.nlp.utils.processing import Preprocessor
from fhnw.nlp.utils.processing import provide_computed_df
from fhnw.nlp.utils.processing import provide_computed_series_as_list

In [15]:
text_cleaner = Preprocessor(clean_text, n_jobs=1, field_read="text_original", field_write="text", finalizer_func=provide_computed_df, keep_punctuation=True)
text_normalizer = Preprocessor(normalize, n_jobs=1, field_read="text", field_write="token_stem", finalizer_func=provide_computed_series_as_list, stopwords=stopwords, stemmer=stemmer, lemmanizer=None, lemma_with_ner=False)

Test if the Preprocessors work as expected.

In [16]:
data = pd.DataFrame(["Dies ist ein super Arzt. <br>", "Ein schlechter <p> Arzt."], columns =["text_original"])
data.head()

Unnamed: 0,text_original
0,Dies ist ein super Arzt. <br>
1,Ein schlechter <p> Arzt.


In [17]:
data = text_cleaner.transform(data)
data.head()

Unnamed: 0,text
0,Dies ist ein super Arzt.
1,Ein schlechter Arzt.


In [18]:
data = text_normalizer.transform(data)
data

[['sup', 'arzt', '.'], ['schlecht', 'arzt', '.']]

Create the pipeline with the pretrained classifier.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

classifier = Pipeline([
    ("clean", text_cleaner),
    ("normalize", text_normalizer),
    ("vec", pipe["vec"]),
    ("tfidf", pipe["tfidf"]),
    ("clf", pipe["clf"])
])

Test if the classifier works.

In [20]:
data = pd.DataFrame(["Dies ist ein super Arzt. <br>", "Ein schlechter <p> Arzt."], columns =["text_original"])
data.head()

Unnamed: 0,text_original
0,Dies ist ein super Arzt. <br>
1,Ein schlechter <p> Arzt.


In [21]:
classifier.predict(data)

array(['positive', 'negative'], dtype='<U8')

Export the complete classifier by serializing it to a file.

In [22]:
from fhnw.nlp.utils.storage import save_pickle
from fhnw.nlp.utils.storage import load_pickle

In [23]:
%%time

# There are several compression formats available 
# some result in smaller files (less to download)
# while others load/decompress faster (less startup time if loaded from local directory)

# gzip compressed file - small files and fast to load - good compromise 
save_pickle(classifier, "classifiers/sentiment_classifier.pgz")

# uncompressed file - fastest to load
#save_pickle(classifier, "classifiers/sentiment_classifier.pkl")

# bz2 compressed file - very small files but takes some time to load/decompress (and very long for compression)
#save_pickle(classifier, "classifiers/sentiment_classifier.pbz2")

CPU times: user 14.2 s, sys: 187 ms, total: 14.3 s
Wall time: 14.4 s


Import the classifier by deserializing it from a file.

In [24]:
%%time

loaded_classifier = load_pickle("classifiers/sentiment_classifier.pgz")
#loaded_classifier = load_pickle("classifiers/sentiment_classifier.pkl")
#loaded_classifier = load_pickle("classifiers/sentiment_classifier.pbz2") 

CPU times: user 749 ms, sys: 79.9 ms, total: 829 ms
Wall time: 828 ms


Test if it still works

In [25]:
loaded_classifier.predict(data)

array(['positive', 'negative'], dtype='<U8')

Restart kernel and try if it still works...

In [26]:
from fhnw.nlp.utils.storage import load_pickle

loaded_classifier2 = load_pickle("classifiers/sentiment_classifier.pgz")

In [27]:
import pandas as pd

data = pd.DataFrame(["Dies ist ein super Arzt. <br>", "Ein schlechter <p> Arzt."], columns =["text_original"])

loaded_classifier2.predict(data)

array(['positive', 'negative'], dtype='<U8')