Who doesn't know this situation: You get loads of newsletters that you sometimes might have a look at - but let's face it: most end up in trash or the spam folder. From the other point of view, that is the scenario every marketing person wants to avoid. As a marketing guru, you want to send out more personalised newsletters that will be read and ideally clicked on as well.<iframe src="https://channel9.msdn.com/Series/Building-Recommendation-Systems-in-Azure/02--Targeted-Marketing-in-Azure-Machine-Learning/player" width="960" height="540" allowFullScreen frameBorder="0"></iframe>
In this scenario, we make use of the AdventureWorks database: We have a wide customer base for our bike shop and want to classify our customers as likely bike buyers or not, and thus run a more targeted marketing campaign.
In this guide, I have gone through the steps on setting up the AdventureWorks Warehouse database in an Azure SQL Database. This is a prerequisite for running the following ML experiment.
- AdventureWorks Warehouse set up in an Azure SQL Database (see here for a step-by-step guide)
- SQL Server Management Studio
- Live ID, e.g. @outlook.com, @yahoo.com, etc.
- Azure subscription or free trial to set up the Azure SQL Database
The database of interest to us is [dbo].[vTargetMail] contained in the AdventureWorks database. Running
SELECT TOP 1000 * FROM [dbo].[vTargetMail] gives us some information on vTargetMail:
We are particularly interested in the last column BikeBuyer indicating if a customer ended up buying a bike or not (hence binary classification).
Moving on to the ML Studio and having already created a new experiment, we now drag the Reader module into the canvas of ML studio to read in the data directly from the view [dbo].[vTargetMail] contained in the Azure SQL Database AdventureWorksDW2014 (see here for how to set it up).
On the right hand side (in the Properties pane) we specify the connection string as well as the credentials to read from the Azure SQL Database called AdventureWorksDW2014. In the database query we specify which columns are of interest to us when building the ML model, thus dropping columns, such as Title, FirstName etc.:
SELECT [CustomerKey] ,[GeographyKey] ,[CustomerAlternateKey] ,[MaritalStatus] ,[Gender] ,cast ([YearlyIncome] as int) as SalaryYear ,[TotalChildren] ,[NumberChildrenAtHome] ,[EnglishEducation] ,[EnglishOccupation] ,[HouseOwnerFlag] ,[NumberCarsOwned] ,[CommuteDistance] ,[Region] ,[Age] ,[BikeBuyer] FROM AdventureWorksDW2014.[dbo].[vTargetMail]
Now the part comes that usually is the most time consuming one: cleaning the data. It often takes up 80% of the time of the data scientists: figuring out what of your data is actually relevant. For the sake of focussing on merely building the ML model, cleaning the data here is kept very trivial: we simply drop some more columns within the ML studio, which is the use of the Project Columns module:
By clicking on Launch column selector in the Properties pane (on the right hand side), you can specify which columns you wish to drop or select (blacklist vs. whitelist). In this case, we only want to exclude two columns; thus, we start with all columns and exclude the particular two columns (that conveniently come up as a dropdown menu): CustomerAlternateKey and GeographyKey.
Note that there are plenty more modules for cleaning data within AzureML, all listed in the catalogue pane on the left hand side under Data Transformation.
Once the data has been cleaned (although in this case admittedly a very trivial step), the model can be built based on the given data. Thinking a few steps ahead, how can we tell if a model is well performing or not? Hence, we split the data in 80-20: we use 80% to train a machine learning model but reserve the remaining 20% for testing the model:
Since we are dealing with a binary classification problem, the machine learning algorithms of interest to us are the "Two-Class ..." algorithms marked in red. Here, we choose the Two-Class Boosted Decision Tree (in green):
Consider these algorithms as some empty templates that are useless without any data to be trained on. Hence, the module Train Model explicitly specifies which ML algorithm to use (1st input) and which data to train the algorithm on (2nd input):
Notice the red exclamation mark in the "Train Model" module: We need to specify which column the model is supposed to "predict". Click on Launch Column Selector in the Properties pane:
Similar to the Column Selector in the Project Columns module, we can choose a column name from a dropdown menu. In our case, it is BikeBuyer (remember, we want to classify customers based on their demographic information if he/she is likely to buy a bike or not):
Now that the model has been trained on specific data (i.e. [dbo].[vTargetMail]), we can test how well the model is doing on test data, i.e. the remaining 20% that we have put aside when splitting the data. For this purpose, you use the module called Score Model - another way of saying "Apply trained model on data...".
Now we come to one of the killer features of AzureML: You can easily compare different machine learning algorithms with each other. So we just copy paste the modules Train Model and Score Model within the ML studio: Select the modules "Train Model" and "Score Model"...
Finally, the module Evaluate Model allows one to quickly compare the two trained models using various evaluation metrics, such as the ROC-curve, the ration between precision and recall, and the lift-curve - more in the next section:
The four values in the blue box represent the confusion matrix. It is a 2x2-matrix since it is a binary classification problem; it contains the absolute number of true positives (i.e. customer has been correctly predicted to be a bike buyer), true negatives (i.e. customer has been correctly predicted to be a no-buyer), false positives (i.e. customer has falsely been predicted to be a bike buyer) and false negatives. The values in the green box a typical evaluation measure: accuracy, precision, recall and F1.
Just by looking at the ROC-curve (as well as looking at the other two diagrams), the first model outperforms the second one. Let's assume we have found our best performing machine learning model for our given problem.
Let's see what the better performing model has actually predicted. Click on the button of the left hand side Score Model and then click on Visualize: (Note you can also save the output as a dataset)
The calculated probability indicates the the likelihood that a given customer is a bike buyer. For instance, the very first customer is predicted to be a bike buyer with a probability of 78%, whereas the third customer has a chance of being a bike buyer of merly 2,4% - making him a "no buyer". The threshold for the probability is set at 50% by default but can also be adjusted.
Comparing the columns BikeBuyer (so-called The Truth" and Scored Labels (the predictions), one can see how well a model has performed.
Save the current experiment as a new experiment, since we are now undergoing preparations for a web service that will be called as we gather information about new customers who we want to classify as a likely buyer or not:
In the web service experiment, there is no need for the "loser model" anymore - we will only require the model we have saved as a trained model. Thus, delete the modules in the canvas associated with the "loser model":
Since we have saved the trained Two-Class Boosted Decision Tree, we can replace the two modules "Two-Class Boosted Decision Tree" and "Train Model" with the trained model Targeted Marketing Model. It is listed in the catalogue pane (on the left hand side) under Trained Models:
Here, we also delete the Split module since we are now only applying the trained module - no need to train and test it in the web service.
Let's take a step back: We want to have a web service that takes in all the demographic information that we can get on a new customer. What the web service is then supposed to give out is a prediction if we are dealing with a bike buyer or not. Hence, the BikeBuyer column needs to be excluded in the input of the web service. We accomplish this by clicking on the Project Columns module, launching the column selector and excluding the column BikeBuyer:
The experiment is almost finished. Only things to specify are the input and output of the web service. Expand Web Service in the catalogue pane (on the left hand side) and drag in Input and Output as follows:
The result is a web service with its associated API key. We will go through the three (or rather four) options of using the web service: 1) Manual test, 2) downloading the Excel workbook, and 3) request/response help page.
######7.2. Excel workbook The advantage of the Excel workbook is, you can simply copy paste multiple data rows from the AdventureWorks database (no need for manual typing). When opening the workbook, first enable the content:
Back in the SQL Server Management Studio with the AdventureWorks database connected, run the same query as when reading the data in ML studio. The only difference is: exclude the columns [CustomerAlternateKey], [GeographyKey] and [BikeBuyer]. Also I selected the first 10 customers with a customer key higher than some random set number:
Here the script:
SELECT Top 10 [CustomerKey] ,[MaritalStatus] ,[Gender] ,cast ([YearlyIncome] as int) as SalaryYear ,[TotalChildren] ,[NumberChildrenAtHome] ,[EnglishEducation] ,[EnglishOccupation] ,[HouseOwnerFlag] ,[NumberCarsOwned] ,[CommuteDistance] ,[Region] ,[Age] ,[BikeBuyer] FROM AdventureWorksDW2014.[dbo].[vTargetMail] WHERE CustomerKey > 11110
Copy all entries except for the last column (this is the column we aim to predict with the web service) into the Excel workbook. Give it a few minutes while the last two columns (Predicted Values in green) are being calculated by calling the web service:
It turns out that our web service has incorrectly predicted bike buyers in two cases (marked in red). In fact, these have been identified as bike buyers with a probability of less than 90% - maybe the threshold for classifying a customer as a buyer needs to be raised from 50% to 90%?
And the final option (or more precisely the final two options) are documentation pages for request/response or batch execution manners. Hence, when you want to integrate your machine learning models in productive code, e.g. apps, dashboards, etc., these are the pages to refer to.