Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Build a predictive model with SQL Server Python

This sample shows how to create a predictive model in Python and operationalize it with SQL Server vNext.

### Contents

[About this sample](#about-this-sample)<br/>
[Before you begin](#before-you-begin)<br/>
[Sample details](#sample-details)<br/>
[Related links](#related-links)<br/>


<a name=about-this-sample></a>

## About this sample

Predictive modeling is a powerful way to add intelligence to your application. It enables applications to predict outcomes against new data.
The act of incorporating predictive analytics into your applications involves two major phases:
model training and model operationalization.

In this sample, you will learn how to create a predictive model in python and operationalize it with SQL Server vNext.


<!-- Delete the ones that don't apply -->
- **Applies to:** SQL Server vNext
- **Key features:**SQL Server Machine Learning Services
- **Workload:** SQL Server Machine Learning Services
- **Programming Language:** T-SQL, Python
- **Authors:** Nellie Gustafsson
- **Update history:** Getting started tutorial for SQL Server ML Services - Python

<a name=before-you-begin></a>

## Before you begin

To run this sample, you need the following prerequisites: </br>
Download a DB backup file and restore it using Setup.sql. [Download DB](https://deve2e.azureedge.net/sqlchoice/static/TutorialDB.bak)

**Software prerequisites:**

<!-- Examples -->
1. SQL Server vNext CTP2.0 (or higher) with Machine Learning Services (Python) installed
2. SQL Server Management Studio
3. Python Tools for Visual Studio

## Run this sample
1. From SQL Server Management Studio or SQL Server Data Tools connect to your SQL Server vNext database and execute setup.sql to restore the sample DB you have downloaded </br>
2. From SQL Server Management Studio or SQL Server Data Tools, open the Predictive Model Python.sql script </br>
This script sets up: </br>
Necessary tables </br>
Creates stored procedure to train a model </br>
Creates a stored procedure to predict using that model </br>
Saves the predicted results to a DB table </br>
3. You can also try the python script on its own. Just remember to point the Python environment to the corresponding path "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES" if you run in-db Python Server, or
"C:\Program Files\Microsoft SQL Server\140\PYTHON_SERVER" if you have the standalone Machine Learning Server installed.

<a name=sample-details></a>

## Sample details

This sample shows how to create a predictive model with Python and generate predictions using the model and deploy that in SQL Server with SQL Server Machine Learning Services.

### rental_prediction.py
The Python script that generates a predictive model and uses it to predict rental counts

### rental_prediction.sql
Takes the Python code in Predictive Model.py and deploys it inside SQL Server. Creating stored procedures and tables for training, storing models and creating stored procedures for prediction.



Service uses Tedious library for data access and built-in JSON functionalities that are available in SQL Server 2016 and Azure SQL Database.

<a name=disclaimers></a>

## Disclaimers
The code included in this sample is not intended demonstrate some general guidance and architectural patterns for web development.
It contains minimal code required to create a REST API.
You can easily modify this code to fit the architecture of your application.


<a name=related-links></a>

## Related Links
<!-- Links to more articles. Remember to delete "en-us" from the link path. -->

For additional content, see these articles:

[SQL Server R Services - Upgrade and Installation FAQ](https://msdn.microsoft.com/en-us/library/mt653951.aspx)
[Other SQL Server R Services Tutorials](https://msdn.microsoft.com/en-us/library/mt591993.aspx)
[Watch a presentation about predictive modeling in SQL Server, that also goes through this sample](https://www.youtube.com/watch?v=YCyj9cdi4Nk&feature=youtu.be)
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from revoscalepy.computecontext.RxInSqlServer import RxInSqlServer
from revoscalepy.computecontext.RxInSqlServer import RxSqlServerData
from revoscalepy.etl.RxImport import rx_import_datasource


def get_rental_predictions():
conn_str = 'Driver=SQL Server;Server=MYSQLSERVER;Database=TutorialDB;Trusted_Connection=True;'
column_info = {
"Year" : { "type" : "integer" },
"Month" : { "type" : "integer" },
"Day" : { "type" : "integer" },
"RentalCount" : { "type" : "integer" },
"WeekDay" : {
"type" : "factor",
"levels" : ["1", "2", "3", "4", "5", "6", "7"]
},
"Holiday" : {
"type" : "factor",
"levels" : ["1", "0"]
},
"Snow" : {
"type" : "factor",
"levels" : ["1", "0"]
}
}

data_source = RxSqlServerData(table="dbo.rental_data",
connectionString=conn_str, colInfo=column_info)
computeContext = RxInSqlServer(
connectionString = conn_str,
numTasks = 1,
autoCleanup = False
)


RxInSqlServer(connectionString=conn_str, numTasks=1, autoCleanup=False)

# import data source and convert to pandas dataframe
df = pd.DataFrame(rx_import_datasource(data_source))
print("Data frame:", df)
# Get all the columns from the dataframe.
columns = df.columns.tolist()
# Filter the columns to remove ones we don't want.
columns = [c for c in columns if c not in ["Year"]]
# Store the variable we'll be predicting on.
target = "RentalCount"
# Generate the training set. Set random_state to be able to replicate results.
train = df.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = df.loc[~df.index.isin(train.index)]
# Print the shapes of both sets.
print("Training set shape:", train.shape)
print("Testing set shape:", test.shape)
# Initialize the model class.
lin_model = LinearRegression()
# Fit the model to the training data.
lin_model.fit(train[columns], train[target])
# Generate our predictions for the test set.
lin_predictions = lin_model.predict(test[columns])
print("Predictions:", lin_predictions)
# Compute error between our test predictions and the actual values.
lin_mse = mean_squared_error(lin_predictions, test[target])
print("Computed error:", lin_mse)

get_rental_predictions()
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@

USE TutorialDB;

-- Table containing ski rental data
SELECT * FROM [dbo].[rental_data];



-------------------------- STEP 1 - Setup model table ----------------------------------------
DROP TABLE IF EXISTS rental_py_models;
GO
CREATE TABLE rental_py_models (
model_name VARCHAR(30) NOT NULL DEFAULT('default model') PRIMARY KEY,
model VARBINARY(MAX) NOT NULL
);
GO


-------------------------- STEP 2 - Train model ----------------------------------------
-- Stored procedure that trains and generates an R model using the rental_data and a decision tree algorithm
DROP PROCEDURE IF EXISTS generate_rental_py_model;
go
CREATE PROCEDURE generate_rental_py_model (@trained_model varbinary(max) OUTPUT)
AS
BEGIN
EXECUTE sp_execute_external_script
@language = N'Python'
, @script = N'

df = rental_train_data

# Get all the columns from the dataframe.
columns = df.columns.tolist()


# Store the variable well be predicting on.
target = "RentalCount"

from sklearn.linear_model import LinearRegression

# Initialize the model class.
lin_model = LinearRegression()
# Fit the model to the training data.
lin_model.fit(df[columns], df[target])

import pickle
#Before saving the model to the DB table, we need to convert it to a binary object
trained_model = pickle.dumps(lin_model)
'

, @input_data_1 = N'select "RentalCount", "Year", "Month", "Day", "WeekDay", "Snow", "Holiday" from dbo.rental_data where Year < 2015'
, @input_data_1_name = N'rental_train_data'
, @params = N'@trained_model varbinary(max) OUTPUT'
, @trained_model = @trained_model OUTPUT;
END;
GO

------------------- STEP 3 - Save model to table -------------------------------------
TRUNCATE TABLE rental_py_models;

DECLARE @model VARBINARY(MAX);
EXEC generate_rental_py_model @model OUTPUT;

INSERT INTO rental_py_models (model_name, model) VALUES('linear_model', @model);

SELECT * FROM rental_py_models;



------------------ STEP 4 - Use the model to predict number of rentals --------------------------
DROP PROCEDURE IF EXISTS py_predict_rentalcount;
GO
CREATE PROCEDURE py_predict_rentalcount (@model varchar(100))
AS
BEGIN
DECLARE @py_model varbinary(max) = (select model from rental_py_models where model_name = @model);

EXEC sp_execute_external_script
@language = N'Python'
, @script = N'


import pickle
rental_model = pickle.loads(py_model)


df = rental_score_data
#print(df)

# Get all the columns from the dataframe.
columns = df.columns.tolist()
# Filter the columns to remove ones we dont want.
# columns = [c for c in columns if c not in ["Year"]]

# Store the variable well be predicting on.
target = "RentalCount"

# Generate our predictions for the test set.
lin_predictions = rental_model.predict(df[columns])
print(lin_predictions)

# Import the scikit-learn function to compute error.
from sklearn.metrics import mean_squared_error
# Compute error between our test predictions and the actual values.
lin_mse = mean_squared_error(linpredictions, df[target])
#print(lin_mse)

import pandas as pd
predictions_df = pd.DataFrame(lin_predictions)
OutputDataSet = pd.concat([predictions_df, df["RentalCount"], df["Month"], df["Day"], df["WeekDay"], df["Snow"], df["Holiday"], df["Year"]], axis=1)
'
, @input_data_1 = N'Select "RentalCount", "Year" ,"Month", "Day", "WeekDay", "Snow", "Holiday" from rental_data where Year = 2015'
, @input_data_1_name = N'rental_score_data'
, @params = N'@py_model varbinary(max)'
, @py_model = @py_model
with result sets (("RentalCount_Predicted" float, "RentalCount" float, "Month" float,"Day" float,"WeekDay" float,"Snow" float,"Holiday" float, "Year" float));

END;
GO


---------------- STEP 5 - Create DB table to store predictions -----------------------
DROP TABLE IF EXISTS [dbo].[py_rental_predictions];
GO
--Create a table to store the predictions in
CREATE TABLE [dbo].[py_rental_predictions](
[RentalCount_Predicted] [int] NULL,
[RentalCount_Actual] [int] NULL,
[Month] [int] NULL,
[Day] [int] NULL,
[WeekDay] [int] NULL,
[Snow] [int] NULL,
[Holiday] [int] NULL,
[Year] [int] NULL
) ON [PRIMARY]
GO


---------------- STEP 6 - Save the predictions in a DB table -----------------------
TRUNCATE TABLE py_rental_predictions;
--Insert the results of the predictions for test set into a table
INSERT INTO py_rental_predictions
EXEC py_predict_rentalcount 'linear_model';

-- Select contents of the table
SELECT * FROM py_rental_predictions;

6 changes: 2 additions & 4 deletions samples/features/r-services/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
# Samples for SQL Server R Services
# Samples for SQL Server Machine Learning Services

Go to the getting started tutorials to learn more about:

[Predictive Modeling with R Services](https://www.microsoft.com/en-us/sql-server/developer-get-started/rprediction)
Go to the getting started tutorials to learn more about:

[Customer Clustering with R Services](https://www.microsoft.com/en-us/sql-server/developer-get-started/rclustering)


[Telco Customer Churn](Telco Customer Churn)

Telco Customer Churn sample using SQL Server R Services.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@

##################### STEP1 - Connect to DB and read data ####################

#Connection string to connect to SQL Server named instance
connStr <- paste("Driver=SQL Server; Server=", "MYSQLSERVER",
";Database=", "Tutorialdb", ";Trusted_Connection=true;", sep = "");

#Get the data from SQL Server Table
SQL_rentaldata <- RxSqlServerData(table = "dbo.rental_data",
connectionString = connStr, returnDataFrame = TRUE);

#Import the data into a data frame
rentaldata <- rxImport(SQL_rentaldata);

#Let's see the structure of the data and the top rows
# Ski rental data, giving the number of ski rentals on a given date
head(rentaldata);


##################### STEP2 - Clean and prepare the data ####################

#Changing the three factor columns to factor types
#This helps when building the model because we are explicitly saying that these values are categorical
rentaldata$Holiday <- factor(rentaldata$Holiday);
rentaldata$Snow <- factor(rentaldata$Snow);
rentaldata$WeekDay <- factor(rentaldata$WeekDay);

#Visualize the dataset after the change
str(rentaldata);

##################### STEP3 - train model ####################

#Now let's split the dataset into 2 different sets
#One set for training the model and the other for validating it
train_data = rentaldata[rentaldata$Year < 2015,];
test_data = rentaldata[rentaldata$Year == 2015,];

#Use this column to check the quality of the prediction against actual values
actual_counts <- test_data$RentalCount;

#Model 1: Use rxLinMod to create a linear regression model. We are training the data using the training data set
model_linmod <- rxLinMod(RentalCount ~ Month + Day + WeekDay + Snow + Holiday, data = train_data);

#Model 2: Use rxDTree to create a decision tree model. We are training the data using the training data set
model_dtree <- rxDTree(RentalCount ~ Month + Day + WeekDay + Snow + Holiday, data = train_data);


#################### STEP4 - Predict using the models ########################

#Use the models we just created to predict using the test data set.
#That enables us to compare actual values of RentalCount from the two models and compare to the actual values in the test data set
predict_linmod <- rxPredict(model_linmod, test_data, writeModelVars = TRUE, extraVarsToWrite = c("Year"));

predict_dtree <- rxPredict(model_dtree, test_data, writeModelVars = TRUE, extraVarsToWrite = c("Year"));

#Look at the top rows of the two prediction data sets.
head(predict_linmod);
head(predict_dtree);

#################### STEP5 - Compare models ########################
#Now we will use the plotting functionality in R to viusalize the results from the predictions
#We are plotting the difference between actual and predicted values for both models to compare accuracy
par(mfrow = c(2, 1));
plot(predict_linmod$RentalCount_Pred - predict_linmod$RentalCount, main = "Difference between actual and predicted. rxLinmod");
plot(predict_dtree$RentalCount_Pred - predict_dtree$RentalCount, main = "Difference between actual and predicted. rxDTree");

Loading