![](../graphics/solutions-microsoft-logo-small.png)

# Data Science Projects with SQL Server Machine Learning Services

## 03 Data Acquisition and Understanding

<p style="border-bottom: 1px solid lightgrey;"></p> 
<dl>
  <dt>Course Outline</dt>
  <dt>1 Overview and Course Setup</dt>
  <dt>2 Business Understanding</dt>
  <dt>3 Data Acquisition and Understanding <i>(This section)</i></dt>
        <dd>3.1 Loading Data into the Solution</dd>
        <dd>3.2 Data Exploration and Profiling</dd>
  <dt>4 Modeling</dt>
  <dt>5 Deployment</dt>
  <dt>6 Customer Acceptance and Model Retraining</dt>
<dl>
<p style="border-bottom: 1px solid lightgrey;"></p> 


From Business Intelligence you're familiar with Extract, Transform and Load(ETL) to prepare data for historical, pre-aggregated storage for ad-hoc queries. For Machine Learning, it's more common to extract the data, load it ito a source, and then transform the data as late as possible in the process (ELT). This allows the most fidelity within the process. 

There are multiple ways to ingest data, depending on the intended location. For SQL Server, data is often generated within base tables by applications, and other data can be loaded via the bcp program or SQL Server Integration Services. 

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/cortanalogo.png"><b>3.1 Loading Data into the Solution</b></p>

In the Data Acquisition and Understanding phase of your process you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, you’ll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, you’ll replace missing values, add and change columns. You’ll cover more extensive Data Wrangling tasks in other courses. In this section, you’ll use a single Database dataset to train your model.


### Goals for Data Acquisition and Understanding

- Produce a clean, high-quality data set whose relationship to the target variables is understood. Locate the data set in the appropriate analytics environment so you are ready to model.
- Develop a solution architecture of the data pipeline that refreshes and scores the data regularly.

### How to do it

- Ingest the data into the target analytic environment.
- Explore the data to determine if the data quality is adequate to answer the question.
- Set up a data pipeline to score new or regularly refreshed data.

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/aml-logo.png"><b>Activity: Restore the Database</b></p>

- Run SSMS or Visual Studio, connect to your SQL Server Instance, and open a new query window. The dataset used in this course is hosted in a SQL Server table.The table contains rental data from previous years. The backup (.bak) file is in the `./data` directory called **TutorialDB.bak**, and save it on a location that SQL Server can access, for example in the folder where SQL Server is installed. 

Example path: *C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Backup*

- Once you have the file saved, open SSMS and a new query window to run the following commands to restore the DB. Make sure to modify the file paths and server name in the script:

<pre>
USE master;
GO
RESTORE DATABASE TutorialDB
 FROM DISK = 'C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\Backup\TutorialDB.bak'
 WITH
 MOVE 'TutorialDB' TO 'C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\TutorialDB.mdf'
,MOVE 'TutorialDB_log' TO 'C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\TutorialDB.ldf';
GO
</pre>

A table named rental_data containing the dataset should exist in the restored SQL Server database. You can verify this by querying the table in SSMS:

<pre>
USE tutorialdb;
SELECT * FROM [dbo].[rental_data];
</pre>

You should see a row of data returned.

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/cortanalogo.png"><b>3.2 Data Exploration and Profiling</b></p>

With the data located and loaded, you can now begin the exploration. You need to know the "shape" of the data, some basic statistics, and very importantly, any missing values.

You can use standard Transact-SQL statements for a majority of the exploration. The SQL language has a rich, declarative structure that will provide most of the information you need.

There are other options for for exploring your data, such as R or Python. R is a data-first language, and most Data Scientists are familiar with using it to explore data.

You can use SQL Server Stored Procedures to hold the R code and run it within SQL Server ML Services as you saw in the previous module. You can also use a series of R Library calls to query the data held in SQL Server and work with it locally to the Data Scientist's workstation in a traditional fashion.

In the graphic below, the Data Scientist works with R locally, and once they determine a good model, deploy that to SQL Server. Clients use the Model by calling a standard SQL Server Stored Procedure, no R client is needed on their machine or device:

<p>
<img src="../graphics/MLServerArchitecture.png" width="500">
<p>

You'll explore the data with this process next. 

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/aml-logo.png"><b>Activity: Explore SQL Server Data using R</b></p>

Step 2.2 Access the data from SQL Server using R

- Open a new R Interactive Window in Visual Studio and run the following R Code. Replace "MYSQLSERVER" with the name of your database instance:

In [None]:
# Connection string to connect to SQL Server instance. 
# Unless you have a SQL Server that you can access using
# Jupyter Notebooks, then you should run this code on your 
# local system. You can see the Notebooks-Only version, check 
# the last cell.

connStr <- paste("Driver=SQL Server; Server=", "MYSQLSERVER", 
                ";Database=", "Tutorialdb", ";Trusted_Connection=true;", sep = "");

# Get the data from a SQL Server Table
SQL_rentaldata <- RxSqlServerData(table = "dbo.rental_data",
                              connectionString = connStr, returnDataFrame = TRUE);

# Import the data into a data frame
rentaldata <- rxImport(SQL_rentaldata);

# Let's see the structure of the data and the top rows
# Ski rental data, giving the number of ski rentals on a given date
head(rentaldata);
str(rentaldata);

- What other explorations can you do? How can you leverage graphical outputs to further show the layout of the data? 

- Can you show the missing data?

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/thinking.jpg"><b>For Further Study</b></p>

- Data Acquisition and Understand Reference: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle-data 

Next, Continue to *04 - Environments and Deployment*

In [2]:
## Activity: Explore SQL Server Data using R

# Connection string to connect to SQL Server instance - Replace WIN2K16DEV with your 
# SQL Server Instance Name
#connStr <- paste("Driver=SQL Server; Server=", "WIN2K16DEV",
                #";Database=", "Tutorialdb", ";Trusted_Connection=true;", sep = "");
# Get the data from a SQL Server Table
#SQL_rentaldata <- RxSqlServerData(table = "dbo.rental_data",
                              #connectionString = connStr, returnDataFrame = TRUE);

# For Jupyter Notebooks, you can just use a file source 
SQL_rentaldata <- read.csv("../data/TutorialDB.csv", header = TRUE)
head(SQL_rentaldata)

# Import the data into a data frame
rentaldata <- rxImport(SQL_rentaldata);

# Let's see the structure of the data and the top rows
# Ski rental data, giving the number of ski rentals on a given date
head(rentaldata);
str(rentaldata);

## Activity:Set three Features to Categorical Data using R
# Changing the three factor columns to factor types
# This helps when building the model because we are explicitly saying that these values are categorical
rentaldata$Holiday <- factor(rentaldata$Holiday);
rentaldata$Snow <- factor(rentaldata$Snow);
rentaldata$WeekDay <- factor(rentaldata$WeekDay);

#Visualize the dataset after the change
str(rentaldata);

## Activity: Create an Experiment with two Algorithms
# Split the dataset into 2 different sets:
# One set for training the model and the other for validating it
train_data = rentaldata[rentaldata$Year < 2015,];
test_data = rentaldata[rentaldata$Year == 2015,];
head(train_data)
head(test_data)

# Use this column to check the quality of the prediction against actual values
actual_counts <- test_data$RentalCount;

# Model 1: Use rxLinMod to create a linear regression model. We are training the data using the training data set
model_linmod <- rxLinMod(RentalCount ~ Month + Day + WeekDay + Snow + Holiday, data = train_data);

# Model 2: Use rxDTree to create a decision tree model. We are training the data using the training data set
model_dtree <- rxDTree(RentalCount ~ Month + Day + WeekDay + Snow + Holiday, data = train_data);

# Use the models you just created to predict using the test data set 
# that enables you to compare actual values of RentalCount from the two models and compare to the actual values in the test data set:
predict_linmod <- rxPredict(model_linmod, test_data, writeModelVars = TRUE, extraVarsToWrite = c("Year"));

predict_dtree <- rxPredict(model_dtree, test_data, writeModelVars = TRUE, extraVarsToWrite = c("Year"));

# Look at the top rows of the two prediction data sets:
head(predict_linmod);
head(predict_dtree);

# Plot the difference between actual and predicted values for both models to compare accuracy:
par(mfrow = c(2, 1));
plot(predict_linmod$RentalCount_Pred - predict_linmod$RentalCount, main = "Difference between actual and predicted. rxLinmod");
plot(predict_dtree$RentalCount_Pred - predict_dtree$RentalCount, main = "Difference between actual and predicted. rxDTree");


ï..Year,Month,Day,RentalCount,WeekDay,Holiday,Snow,FHoliday,FSnow,FWeekDay
2014,1,20,445,2,1,0,1,0,2
2014,2,13,40,5,0,0,0,0,5
2013,3,10,456,1,0,0,0,0,1
2014,3,31,38,2,0,0,0,0,2
2014,4,24,23,5,0,0,0,0,5
2015,2,11,42,4,0,0,0,0,4


ERROR: Error in rxImport(SQL_rentaldata): could not find function "rxImport"
