<a href="https://colab.research.google.com/github/njsuriya/ML_clf_algorithms/blob/main/sklearn_onnx_conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Installing the onnx-sklearn

In [1]:
!pip install skl2onnx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting skl2onnx
  Downloading skl2onnx-1.13-py2.py3-none-any.whl (288 kB)
[K     |████████████████████████████████| 288 kB 4.4 MB/s 
Collecting onnxconverter-common>=1.7.0
  Downloading onnxconverter_common-1.13.0-py2.py3-none-any.whl (83 kB)
[K     |████████████████████████████████| 83 kB 2.2 MB/s 
Collecting onnx>=1.2.1
  Downloading onnx-1.12.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 63.4 MB/s 
Installing collected packages: onnx, onnxconverter-common, skl2onnx
Successfully installed onnx-1.12.0 onnxconverter-common-1.13.0 skl2onnx-1.13


In [2]:
!pip install onnxruntime

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting onnxruntime
  Downloading onnxruntime-1.13.1-cp37-cp37m-manylinux_2_27_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 2.9 MB/s 
[?25hCollecting coloredlogs
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 3.3 MB/s 
Collecting humanfriendly>=9.1
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.0 MB/s 
Installing collected packages: humanfriendly, coloredlogs, onnxruntime
Successfully installed coloredlogs-15.0.1 humanfriendly-10.0 onnxruntime-1.13.1


Loading the Spacy NLP English Model

In [3]:
!python -m spacy download en_core_web_sm

2022-11-03 13:32:42.436509: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 4.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


##### Importing the packages

In [4]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

from skl2onnx import convert_sklearn,to_onnx
from skl2onnx.common.data_types import FloatTensorType,StringTensorType

import onnxruntime as rt

from sklearn.feature_extraction.text import TfidfVectorizer
from string import punctuation

import spacy

from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import accuracy_score

import pickle
import zlib

import numpy

from sklearn.pipeline import Pipeline

In [5]:
nlp = spacy.load('en_core_web_sm')

In [6]:
df = pd.read_csv('/content/amazon_4class_dataset.csv')

In [7]:
input_feature = df['text']
target = df['labels']

In [8]:
le = LabelEncoder()

In [9]:
target_ = le.fit_transform(target)

##### Split the Dataset

In [10]:
train_x,test_x,train_y,test_y = train_test_split(input_feature,target_)

In [11]:
def text_preprocess(text):
  text = text.translate(str.maketrans('', '', punctuation))
  text = text.lower()
  text = [word.lemma_ for word in nlp(text) if not word.is_stop]
  return " ".join(text)

In [12]:
df['cleaned_text'] = df['text'].apply(lambda text : text_preprocess(text))

###### Initialize the TFIDF Vectorizer with custom preprocessor

In [13]:
tfidf = TfidfVectorizer(preprocessor=text_preprocess)

###### Instatiating the TFIDF 

In [14]:
tfidf_= TfidfVectorizer()

###### Custom Preprocessed Vectorizer

In [15]:
feature = tfidf.fit_transform(input_feature)

###### Tfidf Vectorizer

In [16]:
feature_ = tfidf_.fit_transform(df['cleaned_text'])

###### Spliting the Data after vectorization of Text

In [17]:
x_train,x_test,y_train,y_test = train_test_split(feature,target_)

In [18]:
svc_clf = SVC()

###### Model with Custom Preprocess

In [19]:
pipe = Pipeline([('tfidf', TfidfVectorizer(preprocessor=text_preprocess)), ('svc', SVC())])

In [20]:
pipe.fit(train_x,train_y)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(preprocessor=<function text_preprocess at 0x7f9c6a9000e0>)),
                ('svc', SVC())])

###### Model 

In [21]:
pipe_ = Pipeline([('tfidf_',TfidfVectorizer()),('svc',SVC())])

In [22]:
pipe_.fit(train_x,train_y)

Pipeline(steps=[('tfidf_', TfidfVectorizer()), ('svc', SVC())])

In [23]:
svc_clf.fit(x_train,y_train)

SVC()

In [24]:
y_pred = svc_clf.predict(x_test)

In [25]:
accuracy_score = accuracy_score(y_test,y_pred)

In [26]:
pipe.score(test_x,test_y)

0.8924731182795699

In [37]:
pipe_.score(test_x,test_y)

0.8709677419354839

In [27]:
pickle.dump(svc_clf, open("/content/sk_model","wb"))

In [28]:
model_pickel = pickle.dumps(svc_clf)

In [29]:
model_bytes = zlib.compress(model_pickel)

In [30]:
# # # initial_type = [('float_input', FloatTensorType([None, 4]))]
# onx = to_onnx(svc_clf, x_test[:1].astype(numpy.float32))
# with open("/content/onnx_model.onnx", "wb") as f:
#     f.write(onx.SerializeToString())

#### **NOTE :** Custom preprocessor cannot be converted into ONNX. 

In [31]:
onnx = convert_sklearn(pipe_, name='text_classifier',
  initial_types=[('input', StringTensorType([1, 1]))])

with open("onnx_sklearn_model.onnx", "wb") as f:
    f.write(onnx.SerializeToString())

In [32]:
# sess = rt.InferenceSession("/content/onnx_sklearn_model.onnx")

In [33]:
# input_name = sess.get_inputs()[0].name
# input_name

In [34]:
# label_name = sess.get_outputs()[0].name

In [35]:
# test_text = "The amazon bluetooth was very good"
# pipe_.predict()

In [36]:
# pred_onx = sess.run([label_name], {input_name : x_test})[0]

In [38]:
from sklearn.naive_bayes import MultinomialNB

In [39]:
nb_clf = MultinomialNB()

In [40]:
pipe_nb_ = Pipeline([('tfidf',TfidfVectorizer()),('MultinomialNB',MultinomialNB())])

In [41]:
pipe_nb_.fit(train_x,train_y)

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('MultinomialNB', MultinomialNB())])

In [42]:
pipe_nb_.score(test_x,test_y)

0.8817204301075269