# Converting a Scikit model into ONNX format

We'll start by training a scikit-learn model and demonstrate how we can convert this into ONNX format. In this example, we are using the 20 newsgroup dataset to train an MLP classifier, after being featurised by using CountVectorizer().

### 20 newsgroup dataset
We use the 20 newsgroups dataset in this experiment. It comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training and the other one for testing. Let's pick 2 categories out of 20 to simplify our experiment and reduce the overall runtime.

In [10]:
categories = [
 'alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc'
]

### Example post for label: 'sci.space'

'From: gsh7w@fermi.clas.Virginia.EDU (Greg Hennessy)\nSubject: Re: Why not concentrate on child molesters?\nOrganization: University of Virginia\nLines: 17\n\nIn article <15218@optilink.COM> cramer@optilink.COM (Clayton Cramer) writes:\n#Yet, when a law was proposed for Virginia that extended this \n#philosophy to cigarette smokers (so that people who smoked away\n#from the work couldn\'t be discriminated against by employers),\n#the liberal Gov. Wilder vetoed it.  Which shows that liberals don\'t\n#give a damn about "best person for the job," it\'s just a power\n#play.\n\nOf course Clayton ignores the fact that employers pay health\ninsurance, and insurance for smokers is more expensive than for\nnon-smokers. \n\n--\n-Greg Hennessy, University of Virginia\n USPS Mail:     Astronomy Department, Charlottesville, VA 22903-2475 USA\n Internet:      gsh7w@virginia.edu  \n UUCP:\t\t...!uunet!virginia!gsh7w\n'

First we install the required Python packages from requirements.txt.

In [None]:
!pip install -r requirements.txt

### Import the necessary packages.

In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from skl2onnx.convert import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
from onnxruntime import InferenceSession
from onnxmltools.utils import save_model

Fetch training and test set data using scikit's fetch_20newsgroups().

In [4]:
# Add 2 categories from the above list below
# cats = []
training_data = fetch_20newsgroups(subset='train', categories=cats)
test_data = fetch_20newsgroups(subset='test', categories=cats)

Assign data and labels to separate X and y variables for both training and test set.

In [5]:
X_train, y_train = np.array(training_data.data), training_data.target
X_test, y_test = np.array(test_data.data), test_data.target

### Scikit Pipeline
We create a scikit pipeline, which featurises the text using CountVectorizer() and then uses an MLPClassifier() to train the model.

In [3]:
# Create a Scikit Pipeline with CountVectorizer() and MLPClassifier() and assign it to model
# Call fit() on model with training data and labels to train your model

### Model accuracy
Calculate the accuracy of our model on the test set.

In [4]:
# Call predict() on your trained model and test dataset.
# Calculate accuracy by comparing with the actual test labels. 

### Conversion to ONNX
Convert the scikit model into ONNX format using convert_sklearn(), then save the ONNX model.

In [9]:
model_onnx = convert_sklearn(model, 'newsgroup', [('input', StringTensorType([None]))])
# save_model(model_onnx, '<model name>.onnx')

Visualise the onnx model using Netron: https://lutzroeder.github.io/netron/

### Load the onnx model
For inferencing, we first load the model as shown below.

In [10]:
# sess = InferenceSession('<model name>.onnx')

### Prediction using onnxruntime
In order to run prediction on a test set, we call run() passing the test set like this:

In [11]:
res = sess.run(None, input_feed={'input': X_test})

The above function call returns two outputs: label(output 0) and class probability scores(output 1).

## Comparing results of onnx and Scikit models
Here, we compare the labels returned by onnxruntime with the labels predicted by scikit.

In [11]:
# Compare predicted labels for scikit and onnx model.

We can also match the predicted probability scores of the two models.

In [12]:
# Compare predicted class probability scores for scikit and onnx models.