<p align="center">
<img src ="https://raw.githubusercontent.com/microsoft/azuredatastudio/master/src/sql/media/microsoft_logo_gray.svg?sanitize=true" width="250" align="center">
</p>

# **Train, Convert, and Deploy with ONNX Runtime**

In SQL Server, we support native PREDICT using ONNX Models. Currently, we only support models with **numeric data types**: int and bigint [data types](https://docs.microsoft.com/en-us/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver15), real and float [data types](https://docs.microsoft.com/en-us/sql/t-sql/data-types/float-and-real-transact-sql?view=sql-server-ver15). Other numeric types can be converted to types we support by using CAST and CONVERT [CAST and CONVERT](https://docs.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver15). The model inputs should be structured so that each input to the model corresponds to a single SQL Server column. For example: If you are using a pandas dataframe to train a model, then each input should be a separate column to the model.
 
ONNXMLTools enables you to convert models from different machine learning toolkits into ONNX. Currently, for numeric data types and single column inputs, the following toolkits are supported in SQL Server:
 
* [scikit-learn](https://github.com/onnx/sklearn-onnx)
* [Tensorflow](https://github.com/onnx/tensorflow-onnx)
* [Keras](https://github.com/onnx/keras-onnx)
* [CoreML](https://github.com/onnx/onnxmltools)
* [Spark ML (experimental)](https://github.com/onnx/onnxmltools/tree/master/onnxmltools/convert/sparkml)
* [LightGBM](https://github.com/onnx/onnxmltools)
* [libsvm](https://github.com/onnx/onnxmltools)
* [XGBoost](https://github.com/onnx/onnxmltools)
 
Follow the tutorial below based on scikit-learn and using the Boston Housing [dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html). This is an example which you can follow to load other datasets as well. 

### **Before you begin**

* Install [Azure Data Studio](https://docs.microsoft.com/sql/azure-data-studio/download) 

* Open Azure Data Studio and follow these steps to install the packages needed for this quickstart:

    1. Open [New Notebook](https://docs.microsoft.com/sql/azure-data-studio/sql-notebooks) connected to the Python 3 Kernel. 
    1. Click on the Manage Packages and under **Add New** search for **sklearn** and install the scikit-learn package. 
    1. Also, install the **onnxmltools**, **onnxruntime**, **skl2onnx**, and **sqlalchemy** packages.   

### **Steps**:
1. Create a pipeline to train a LinearRegression model.
2. Convert the model to the ONNX format. 
3. Test the ONNX model.
4. Insert the ONNX model into SQL Server.

### **Next Steps**:
[Run native PREDICT in SQL Server using the ONNX model](Native%20PREDICT%20on%20Azure%20SQL%20Database%20Edge.ipynb). 

## **1. Train a Pipeline**
Split the dataset to use features to preduct the median value of a house. Create a pipeline to train the LinearRegression model and then calculate the R2 score and mean squared error.

In [1]:
import numpy as np
import onnxmltools
import onnxruntime as rt
import pandas as pd
import skl2onnx
import sklearn
import sklearn.datasets

from sklearn.datasets import load_boston
boston = load_boston()
boston

df = pd.DataFrame(data=np.c_[boston['data'], boston['target']], columns=boston['feature_names'].tolist() + ['MEDV'])

# x contains all predictors (features)
x = df.drop(['MEDV'], axis = 1)

# y is what we are trying to predict - the median value
y = df.iloc[:,-1]


 # Split the data frame into features and target
x_train = df.drop(['MEDV'], axis = 1)
y_train = df.iloc[:,-1]

In [2]:
print("\n*** Training data set x\n")
print(x_train.head())

print("\n*** Training data set y\n")
print(y_train.head())


*** Training data set x

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  

*** Training data set y

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64


In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

continuous_transformer = Pipeline(steps=[('scaler', RobustScaler())])

# All columns are numeric - normalize them
preprocessor = ColumnTransformer(
    transformers=[
        ('continuous', continuous_transformer, [i for i in range(len(x_train.columns))])])

model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())])

# Train the model
model.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('continuous',
                                                  Pipeline(memory=None,
                                                           steps=[('scaler',
                                                                   RobustScaler(copy=True,
                                                                                quantile_range=(25.0,
                                                                                                75.0),
                                                                                with_centering=True,
                                                                                with_scaling=True))],
                                                 

In [5]:
y_pred = model.predict(x_train)

In [6]:
# Score the model
from sklearn.metrics import r2_score, mean_squared_error
sklearn_r2_score = r2_score(y_train, y_pred)
sklearn_mse = mean_squared_error(y_train, y_pred)
print('*** Scikit-learn r2 score: {}'.format(sklearn_r2_score))
print('*** Scikit-learn MSE: {}'.format(sklearn_mse))

*** Scikit-learn r2 score: 0.7406426641094094
*** Scikit-learn MSE: 21.894831181729206


## **2. Convert the model to ONNX**
Using skl2onnx, we will convert our LinearRegression model to the ONNX format and save it locally.

In [7]:
from skl2onnx.common.data_types import FloatTensorType, Int64TensorType, DoubleTensorType

def convert_dataframe_schema(df, drop=None, batch_axis=False):
    inputs = []
    nrows = None if batch_axis else 1
    for k, v in zip(df.columns, df.dtypes):
        if drop is not None and k in drop:
            continue
        if v == 'int64':
            t = Int64TensorType([nrows, 1])
        elif v == 'float32':
            t = FloatTensorType([nrows, 1])
        elif v == 'float64':
            t = DoubleTensorType([nrows, 1])
        else:
            raise Exception("Bad type")
        inputs.append((k, t))
    return inputs

In [8]:
# Convert the scikit model to onnx format
onnx_model = skl2onnx.convert_sklearn(model, 'Boston Data', convert_dataframe_schema(x_train))
# Save the onnx model locally
onnx_model_path = 'boston1.model.onnx'
onnxmltools.utils.save_model(onnx_model, onnx_model_path)

## **3. Test the ONNX model**
After converting the model to ONNX format, we score the model to show little to no degradation in performance.

*ONNX Runtime uses floats instead of doubles which explains small potential discrepencies.*

In [9]:
import onnxruntime as rt
sess = rt.InferenceSession(onnx_model_path)

y_pred = np.full(shape=(len(x_train)), fill_value=np.nan)

for i in range(len(x_train)):
    inputs = {}
    for j in range(len(x_train.columns)):
        inputs[x_train.columns[j]] = np.full(shape=(1,1), fill_value=x_train.iloc[i,j])

    sess_pred = sess.run(None, inputs)
    y_pred[i] = sess_pred[0][0][0]

onnx_r2_score = r2_score(y_train, y_pred)
onnx_mse = mean_squared_error(y_train, y_pred)

print()
print('*** Onnx r2 score: {}'.format(onnx_r2_score))
print('*** Onnx MSE: {}\n'.format(onnx_mse))
print('R2 Scores are equal' if sklearn_r2_score == onnx_r2_score else 'Difference in R2 scores: {}'.format(abs(sklearn_r2_score - onnx_r2_score)))
print('MSE are equal' if sklearn_mse == onnx_mse else 'Difference in MSE scores: {}'.format(abs(sklearn_mse - onnx_mse)))
print()



*** Onnx r2 score: 0.7406426691136831
*** Onnx MSE: 21.894830759270633

Difference in R2 scores: 5.00427377314594e-09
Difference in MSE scores: 4.224585730128183e-07



## **4. Insert the ONNX model into SQL Server**
Now, we will store the model in SQL Server. We will create a database ```onnx``` with a ```models``` table to store the ONNX model. You will be prompted to enter your **server address, username, and password**. You will need to also import the *pyodbc* package.

In [5]:
import pyodbc
import getpass

server = input("Enter the server address:")
username = input("Enter username:")
password = getpass.getpass(prompt="Enter password:")

# Connect to the master DB to create the new onnx database
connection_string = "Driver={ODBC Driver 17 for SQL Server};Server=" + server + ";Database=master;UID=" + username + ";PWD=" + password + ";"

conn = pyodbc.connect(connection_string, autocommit=True)
cursor = conn.cursor()

database = 'onnx'
query = 'DROP DATABASE IF EXISTS ' + database
cursor.execute(query)
conn.commit()

# Create onnx database
query = 'CREATE DATABASE ' + database
cursor.execute(query)
conn.commit()

# Connect to onnx database

db_connection_string = "Driver={ODBC Driver 17 for SQL Server};Server=" + server + ";Database=" + database + ";UID=" + username + ";PWD=" + password + ";"

conn = pyodbc.connect(db_connection_string, autocommit=True)
cursor = conn.cursor()

table_name = 'models'

# Drop the table if it exists
query = f'drop table if exists {table_name}'
cursor.execute(query)
conn.commit()

# Create the model table
query = f'create table {table_name} ( ' \
    f'[id] [int] IDENTITY(1,1) NOT NULL, ' \
    f'[data] [varbinary](max) NULL, ' \
    f'[description] varchar(1000))'
cursor.execute(query)
conn.commit()

# Insert the ONNX model into the models table
query = f"insert into {table_name} ([description], [data]) values ('Onnx Model',?)"

model_bits = onnx_model.SerializeToString()

insert_params  = (pyodbc.Binary(model_bits))
cursor.execute(query, insert_params)
conn.commit()


### **Load the data into SQL Server**

We will create two tables, **features** and **target**, to store subsets of the boston housing dataset. 
* Features will contain all data being used to predict the target, median value. 
* Target contains the median value for each record in the dataset. 

You will need to import the *sqlalchemy* package.

In [5]:
    import sqlalchemy
    from sqlalchemy import create_engine
    import urllib

    db_connection_string = "Driver={ODBC Driver 17 for SQL Server};Server=" + server + ";Database=" + database + ";UID=" + username + ";PWD=" + password + ";"

    conn = pyodbc.connect(db_connection_string)
    cursor = conn.cursor()

    features_table_name = 'features'

    # Drop the table if it exists
    query = f'drop table if exists {features_table_name}'
    cursor.execute(query)
    conn.commit()

    # Create the features table
    query = \
        f'create table {features_table_name} ( ' \
        f'    [CRIM] float, ' \
        f'    [ZN] float, ' \
        f'    [INDUS] float, ' \
        f'    [CHAS] float, ' \
        f'    [NOX] float, ' \
        f'    [RM] float, ' \
        f'    [AGE] float, ' \
        f'    [DIS] float, ' \
        f'    [RAD] float, ' \
        f'    [TAX] float, ' \
        f'    [PTRATIO] float, ' \
        f'    [B] float, ' \
        f'    [LSTAT] float, ' \
        f'    [id] int)'

    cursor.execute(query)
    conn.commit()

    target_table_name = 'target'

    # Create the target table
    query = \
        f'create table {target_table_name} ( ' \
        f'    [MEDV] float, ' \
        f'    [id] int)'

    x_train['id'] = range(1, len(x_train)+1)
    y_train['id'] = range(1, len(y_train)+1)

    print(x_train.head())
    print(y_train.head())

Finally, using sqlalchemy, we insert the `x_train` and `y_train` pandas dataframes into tables `features` and `target`, respectively. 

In [22]:
db_connection_string = 'mssql+pyodbc://' + username + ':' + password + '@' + server + '/' + database + '?driver=ODBC+Driver+17+for+SQL+Server'
sql_engine = sqlalchemy.create_engine(db_connection_string)
x_train.to_sql(features_table_name, sql_engine, if_exists='append', index=False)
y_train.to_sql(target_table_name, sql_engine, if_exists='append', index=False)

You will now be able to view the data in SQL Server.