# Introduction

Modeling data is often heterogeneous. Using multimodal clustering in DataRobot helps visualize the various distinctive segments in the data. The example use case in this series is about analyzing and modeling customer churn in a retail (CPG) company. In the first part of the series, you will learn how to use DataRobot clustering to segment the customer base and explore behavioural data for each segment. Code-first approach shown in this notebook helps you get from data to value even faster.

## Summary

This notebook outlines how to:

1. Get data from your data source (the two datasets used in this notebook are provided in the DataRobot public AWS S3 bucket)
2. Run clustering
3. Retrieve insights from your clustering models
4. Deploy a chosen clustering model and test the stability of the clusters on the new period

## Requirements

If you are using this notebook in DataRobot Notebooks and AWS files provided, choose the latest available environment (Python 3.9.16 as of August 1, 2023). The latest version of the datarobot python client will be provided automatically with the DataRobot notebooks (3.2 as of August 1, 2023).
Python client version 3.x is recommended to take advantage of the newer convenience methods.

## References

* [DataRobot video: Overview of Multimodal Clustering feature release](https://www.youtube.com/watch?v=kz3Zt7LoN4s)
* [DataRobot platform documentation: Unsupervised Clustering](https://docs.datarobot.com/en/docs/modeling/special-workflows/unsupervised/clustering.html)
* [DataRobot python client documentation: API methods for Clustering](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/reference/modeling/spec/unsupervised_clustering.html?highlight=clustering#unsupervised-projects-clustering)


## Setup: import libraries and get data

In [1]:
import datarobot as dr
import pandas as pd
import time
from datarobot.enums import UnsupervisedTypeEnum
from datarobot import ClusteringModel
print ("DataRobot client version: ", dr.__version__)


DataRobot client version:  3.2.0b1


In [None]:
from datetime import date
today = date.today()
d1 = today.strftime("%Y%m%d")

After specifying necessary libraries, read your data into a dataframe in the notebook. In this example the data is ingested from a public AWS S3 bucket with DataRobot demo datasets. Note that DataRobot supports a variety of other methods and connectors to ingest data. See more details here:

* [DataRobot UI documentation: Data Connection](https://docs.datarobot.com/en/docs/data/connect-data/index.html)

In [3]:
df = pd.read_csv('https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/Retail_Clustering_Training_data.csv')
df.head()
#df.describe()

Unnamed: 0,ID,churn,Customer_Start_Date,ZipCode,Residency_Category,Service_District,Gender,Customer_Age_Group,Brand_aware_Cat,Camp_AvgClick_LastYear,...,LastResponse_Flag,Loyalty_YM,LoyaltyBonus_Redeem_LastYear,LoyaltyPurch_Cnt,MostFreq_Purch_Cat,NPS_Cat_Last,NPS_Score_Avg,NPS_Score_Last,PromoCode_Cat,Sales_Channel_Cat
0,243624,1,05.09.2018,2DAA,Cat3_100_500K,AC,M,01:35,1-New,,...,,01.09.2014,,3.0,1,0.0,-1.0,1.0,Extra_10,Distribution2
1,242193,1,18.09.2018,2AA6,Cat3_100_500K,AC,F,3: 45-55,4-Know (no prompt),0.5,...,0.0,01.09.2014,1.0,2.0,1,,-1.0,,Extra_10,Distribution2
2,197222,1,26.09.2018,2AA8,Cat3_100_500K,AC,M,,4-Know (no prompt),,...,1.0,01.09.2014,,,1,0.0,-1.0,1.0,Extra_10,Distribution2
3,900083,1,11.10.2018,263A,Cat3_100_500K,AC,M,,1-New,,...,0.0,01.10.2014,,,1,,-1.0,,Extra_10,Distribution2
4,59631,0,05.10.2018,2AA6,Cat3_100_500K,AC,M,01:35,4-Know (no prompt),,...,1.0,01.10.2014,,1.0,1,1.0,7.72,8.0,Extra_10,Distribution2


You can start modeling directly from your dataset. However it's usually best to register the dataset in DataRobot AI Catalog where you can manage data, keep track of versions, profile data, and manage feature lists used across modeling projects.  See more details about AI Catalog here:

* [DataRobot UI documentation: AI Catalog](https://docs.datarobot.com/en/docs/data/ai-catalog/index.html)

In [None]:
dr_dataset = dr.Dataset.create_from_in_memory_data(df)
dr_dataset.modify(name = 'DR_AI_Accelerator_Retail_Clustering_training.csv')
print("Dataset name: ", dr_dataset.name)
print("Dataset URL: ", dr_dataset.get_uri())

Dataset name:  DR_AI_Accelerator_Retail_Clustering_training.csv
Dataset URL:  https://app.eu.datarobot.com/ai-catalog/64c79a8115dfd2afa7006800


Below is an example of getting data from a previousy created dataset in DataRobot AI Catalog. You can use this in any repetitive runs of the notebook on the same dataset.

In [4]:
dataset_id = '64c77ae13d57ce3f5fb9ae25' # specify your dataset ID, which is listed after ai-catalog in the URL

dr_dataset = dr.Dataset.get(dataset_id)
print("Dataset name: ", dr_dataset.name)
print("Dataset URL: ", dr_dataset.get_uri())
df = dr_dataset.get_as_dataframe()

Dataset name:  Retail_Clustering_Training_data.csv
Dataset URL:  https://app.eu.datarobot.com/ai-catalog/64c77ae13d57ce3f5fb9ae25


## Run DataRobot clustering to segment your data
You could start the clusering project immediately to let DataRobot build informative feature lists and run several versions of each model to detect the best number of segments. However in most cases, you will want to take advantage of exploratory data analysis and narrow down your feature list before starting a new segmentation project.
(You could even select the target temporarily, explore the features that are predictive, and then switch back to your segmentation project guarded with this information. This step is out of scope of this notebook.)

You commence with starting a new Clustering project in Manual mode.

In [5]:

project = dr.Project.create_from_dataset(dataset_id = dr_dataset.id, project_name="DR_AI_Accelerator_Retail_Clustering_part1_{}".format(d1))
project

Project(DR_AI_Accelerator_Retail_Clustering_part1_20230801)

In [6]:
# reference: https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/reference/modeling/spec/unsupervised_clustering.html?highlight=clustering#unsupervised-projects-clustering

project.analyze_and_model(unsupervised_mode=True,
                                   unsupervised_type=UnsupervisedTypeEnum.CLUSTERING,
                                    mode = dr.AUTOPILOT_MODE.MANUAL,
                                      max_wait=2400)

Project(DR_AI_Accelerator_Retail_Clustering_part1_20230801)

As a result of the manual run, DataRobot split your dataset into training and validation partitions and created a feature list with Informative features. You can check it and create your own feature lists as needed. The best practice is to start from the smaller feature list, redacting the list of promising features according to your domain expertise and correlations shown in Feature Associations.

See more details here:

* [DataRobot Tutorial: Working with Feature Lists](https://docs.datarobot.com/en/docs/get-started/gs-get-help/tutorials/prep-learning-data/work-with-feature-lists.html#work-with-feature-lists)
* [DataRobot Tutorial: Analyze Feature Associations](https://docs.datarobot.com/en/docs/get-started/gs-get-help/tutorials/prep-learning-data/analyze-feature-associations.html#analyze-feature-associations)

In [7]:
proj_feature_lists = project.get_modeling_featurelists()
for fl in proj_feature_lists:
  print(fl.name)
feature_list_1 = [fl for fl in proj_feature_lists if 'Informative' in fl.name][0].features
feature_list_1

Raw Features
Informative Features
['ID',
 'churn',
 'Customer_Start_Date (Year)',
 'ZipCode',
 'Residency_Category',
 'Service_District',
 'Gender',
 'Customer_Age_Group',
 'Brand_aware_Cat',
 'Camp_AvgClick_LastYear',
 'Camp_AvgResp_LastYear',
 'Contact_First_Date (Year)',
 'Contact_Last_Date (Year)',
 'CustComm_Accept_Flag',
 'CustComm_Count_LastYear',
 'Customer_Tenure',
 'CustReq_Count',
 'CustReq_Payment_Flag',
 'CustReq_Prod_flag',
 'CustReq_Prod_Support_Flag',
 'CustReq_Product_Closed_Flag',
 'CustReq_Resp_Flag',
 'Date Last Purchase (Year)',
 'GEO_CAT',
 'LastResponse_Flag',
 'Loyalty_YM (Year)',
 'LoyaltyBonus_Redeem_LastYear',
 'LoyaltyPurch_Cnt',
 'MostFreq_Purch_Cat',
 'NPS_Cat_Last',
 'NPS_Score_Avg',
 'NPS_Score_Last',
 'PromoCode_Cat',
 'Sales_Channel_Cat',
 'Contact_First_Date (Day of Month)',
 'Contact_First_Date (Day of Week)',
 'Contact_First_Date (Month)',
 'Contact_Last_Date (Day of Month)',
 'Contact_Last_Date (Day of Week)',
 'Contact_Last_Date (Month)',
 'Custom

In [None]:
features_to_remove= ['ID', 'ZipCode']

features_to_keep= ['churn',
 'Residency_Category',
 'Service_District',
 'Gender',
 'Customer_Age_Group',
 'Brand_aware_Cat',
 'Camp_AvgClick_LastYear',
 'Camp_AvgResp_LastYear',
 'CustComm_Count_LastYear',
 'Customer_Tenure',
 'CustReq_Count',
 'GEO_CAT',
 'LastResponse_Flag',
 'LoyaltyBonus_Redeem_LastYear',
 'LoyaltyPurch_Cnt',
 'MostFreq_Purch_Cat',
 'NPS_Cat_Last',
 'NPS_Score_Avg',
 'PromoCode_Cat',
 'Sales_Channel_Cat']

Use the example code below to generate the new feature list that only included your chosen features and excludes the features marked for removal.

In [9]:
feature_list_2 = [fle for fle in feature_list_1 if fle not in features_to_remove and fle in features_to_keep]
feature_list_2

['churn',
 'Residency_Category',
 'Service_District',
 'Gender',
 'Customer_Age_Group',
 'Brand_aware_Cat',
 'Camp_AvgClick_LastYear',
 'Camp_AvgResp_LastYear',
 'CustComm_Count_LastYear',
 'Customer_Tenure',
 'CustReq_Count',
 'GEO_CAT',
 'LastResponse_Flag',
 'LoyaltyBonus_Redeem_LastYear',
 'LoyaltyPurch_Cnt',
 'MostFreq_Purch_Cat',
 'NPS_Cat_Last',
 'NPS_Score_Avg',
 'PromoCode_Cat',
 'Sales_Channel_Cat']

In [None]:
feature_list_v2 = project.create_featurelist('clustering_fl_v1', features=list(feature_list_2))

After checking the feature list, re-run the project again in Comprehensive mode. Note that this is the only mode available for Unsupervised Clustering projects, so in the UI you will only see the option 'Re-run modeling'.

See more details on supervised Autopilot modes here:

* [DataRobot Tutorial: Set the modeling mode](https://docs.datarobot.com/en/docs/get-started/gs-get-help/tutorials/creating-ai-models/tut-model-mode.html#autopilot)


In [None]:
project.start_autopilot(featurelist_id=feature_list_v2.id,     #specify a custom feature list
                                               mode = 'comprehensive',
                                  blend_best_models=False, scoring_code_only=False, prepare_model_for_deployment=False, consider_blenders_in_recommendation=False, run_leakage_removed_feature_list=False, autopilot_cluster_list=[3,5,7,9])


In [12]:
project.wait_for_autopilot()

In progress: 0, queued: 0 (waited: 0s)
In progress: 0, queued: 0 (waited: 0s)
In progress: 6, queued: 0 (waited: 1s)
In progress: 6, queued: 0 (waited: 2s)
In progress: 6, queued: 0 (waited: 3s)


In progress: 6, queued: 0 (waited: 5s)


In progress: 6, queued: 0 (waited: 8s)


In progress: 6, queued: 0 (waited: 15s)


In progress: 6, queued: 0 (waited: 28s)


In progress: 6, queued: 0 (waited: 49s)


In progress: 6, queued: 0 (waited: 69s)


In progress: 2, queued: 0 (waited: 89s)


In progress: 0, queued: 0 (waited: 110s)


## Inspect resulting segmentation models
After this project run is finished, you typically want to select a few top-performing models. The models are ranked by Silhouette score. This metric evaluates two key aspects of a clustering model: the quality of separation between clusters and similarity inside the identified clusters.

If you prefer a specific modeling algorithm for your segmentation project, you can restrict the analysis to only this model type.
K-Means is usually a good algorithm to start with if you don't have any specific requirements, as it tends to result in bigger clusters and it's easily interpretable.

In the code section below, we provide an example to select top 3 K-Means models from the segmentation project and then one best model from this selection.

See more details about this metric here:

* [DataRobot documentation: Silhouette score](https://docs.datarobot.com/en/docs/modeling/reference/model-detail/opt-metric.html#silhouette-score)

In [None]:
# get the best model
models = project.get_models()
met = project.metric
opt_group = 'validation' # clustering projects only have 'validation' and 'holdout'
# sort by the project metric in descsending order (select best models by Silhouette score) and optionally restrict to specific model types

top_models = sorted(
                            [m for m in models if m.metrics[met][opt_group] and 'K-Means' in m.model_type],  
                            key=lambda m: m.metrics[met][opt_group],
                            reverse = True)[0:3]
chosen_model = [m for m in top_models if 'K-Means' in m.model_type][0]
model_to_explore = chosen_model

At this point, you could also work in mixed mode: switch into DataRobot GUI to explore the project outputs and insights, and then go back to the notebook to specify your chosen model by ID.

In [14]:
print("Link to the project in the GUI: ", project.get_uri())

Link to the project in the GUI:  https://app.eu.datarobot.com/projects/64c944582e634bd37173f704/models


In [15]:
modelID = '64c9478a875736eaede9f64c'
model_to_explore_GUI = ClusteringModel.get(project.id, modelID)
model_to_explore_GUI.__dict__

{'id': '64c9478a875736eaede9f64c',
 'processes': ['One-Hot Encoding',
  'Truncated Singular Value Decomposition',
  'Missing Values Imputed',
  'Standardize',
  'K-Means Clustering'],
 'featurelist_name': 'clustering_fl_v1',
 'featurelist_id': '64c946e512cae5ad990d173c',
 'project_id': '64c944582e634bd37173f704',
 'sample_pct': 89.99822,
 'training_row_count': 45261,
 'training_duration': None,
 'training_start_date': None,
 'training_end_date': None,
 'model_type': 'K-Means Clustering',
 'model_category': 'model',
 'is_frozen': False,
 'is_n_clusters_dynamically_determined': False,
 'blueprint_id': 'c9176801ea1f284280db8a5367f936bb',
 'metrics': {'Silhouette Score': {'validation': 0.2297700047492981,
   'crossValidation': None,
   'holdout': None,
   'training': 0.2266400009393692,
   'backtestingScores': None,
   'backtesting': None}},
 'monotonic_increasing_featurelist_id': None,
 'monotonic_decreasing_featurelist_id': None,
 'n_clusters': 3,
 'has_empty_clusters': False,
 'supports

Next, proceed to trigger the calculation of Clustering Insights. These outputs are the most valuable part of any segmentation project. Check out the most impactful features in Feature Impact and then head over to Insights to explore the visualizations.

In [None]:
model = ClusteringModel.get(model_to_explore.project_id, model_to_explore.id)
try:
    model.compute_insights()
    jobs_list = project.get_all_jobs()
    for job in jobs_list:
        if job.job_type =='clusterInsights':
            insights = job.get_result_when_complete(max_wait=max_wait)

except:
    insights=model.insights


[TBD, August 01] It's a good practice to use function definitions for more complex data manipulations such as the one shown above. This function definition needs to be adjusted to handle try and except clause correctly.

In [None]:
# helper function
# def request_clustering_insights(model, max_wait = None):
#     model = ClusteringModel.get(model.project_id, model.id)
#     try:
#         model.compute_insights(max_wait = max_wait)
#     except:
#         insights=model.insights
#     jobs_list = project.get_all_jobs()  # gives all jobs queued or inprogress
#     for job in jobs_list:
#         if job.job_type =='clusterInsights':
#             insights = job.get_result_when_complete(max_wait=max_wait)

#     return insights

# model_insights = request_clustering_insights(model_to_explore)
# model_insights

In [19]:
for model in top_models:
  print(model.model_type, ', Silhouette: ', model.metrics[met][opt_group])
  try:
    model = ClusteringModel.get(model.project_id, model.id)
    insights = model.compute_insights(max_wait = 600)
  except:
    model = ClusteringModel.get(model.project_id, model.id)
    insights=model.insights


K-Means Clustering , Silhouette:  0.2297700047492981
K-Means Clustering , Silhouette:  0.20029999315738678


K-Means Clustering , Silhouette:  0.18297000229358673


In [21]:
# helper function
def print_summary(name, percent):
    if not percent:
        percent = "?"
    print("'{}' holds {} % of data".format(name, percent))

for model in top_models:
  print(model.model_type, ', Silhouette: ', model.metrics[met][opt_group])
  # request_clustering_insights(model)
  model = ClusteringModel.get(model.project_id, model.id)
  for cluster in model.clusters:
      print_summary(cluster.name, cluster.percent)

K-Means Clustering , Silhouette:  0.2297700047492981
'Cluster 1' holds 37.62179359713661 % of data
'Cluster 2' holds 35.158304058681864 % of data
'Cluster 3' holds 27.219902344181524 % of data
K-Means Clustering , Silhouette:  0.20029999315738678
'Cluster 1' holds 20.841342436092884 % of data
'Cluster 2' holds 20.039327456308964 % of data
'Cluster 3' holds 31.656392921057865 % of data
'Cluster 4' holds 18.09283930978105 % of data
'Cluster 5' holds 9.37009787675924 % of data
K-Means Clustering , Silhouette:  0.18297000229358673
'Cluster 1' holds 5.393164092706745 % of data
'Cluster 2' holds 19.851527805395374 % of data
'Cluster 3' holds 9.071827842955305 % of data
'Cluster 4' holds 10.048386027705972 % of data
'Cluster 5' holds 14.772099600097214 % of data
'Cluster 6' holds 1.8912529550827424 % of data
'Cluster 7' holds 14.997459181193522 % of data
'Cluster 8' holds 14.571043503236782 % of data
'Cluster 9' holds 9.403238991626345 % of data


## Complete the segmentation work
As described above, you can either head to the DataRobot web interface (GUI) and inspect your clusters or work via API. Your typical workflow with any segmentation project can include testing several feature lists, enriching the model with additional data or further feature reduction, selecting models that result in bigger clusters, and other iterations.
In the final segmentation model, you can also rename your clusters and assign the meaninful labels for future use.

In [None]:

# choose a specific model by ID
model_id = '64c9478a875736eaede9f64c'
model_to_rename = ClusteringModel.get(project.id, model_id)

# after exploring insights, update multiple cluster labels
cluster_name_mappings = [
    ("Cluster 1", "High NPS & Low Campaign Response"),
    ("Cluster 2", "Low NPS & High Churn Rate"),
    ("Cluster 3", "Med/High NPS & High Campaign Response")
]

clusters = model_to_rename.update_cluster_names(cluster_name_mappings)


## Deploy your chosen model for future use

After you have explored your data and finalized your segmentation model, you can deploy it in DataRobot.


In [23]:
# choose a specific model by ID
model_id = '64c9478a875736eaede9f64c'

model_to_deploy = ClusteringModel.get(project.id, model_id)
model_to_deploy.__dict__

{'id': '64c9478a875736eaede9f64c',
 'processes': ['One-Hot Encoding',
  'Truncated Singular Value Decomposition',
  'Missing Values Imputed',
  'Standardize',
  'K-Means Clustering'],
 'featurelist_name': 'clustering_fl_v1',
 'featurelist_id': '64c946e512cae5ad990d173c',
 'project_id': '64c944582e634bd37173f704',
 'sample_pct': 89.99822,
 'training_row_count': 45261,
 'training_duration': None,
 'training_start_date': None,
 'training_end_date': None,
 'model_type': 'K-Means Clustering',
 'model_category': 'model',
 'is_frozen': False,
 'is_n_clusters_dynamically_determined': False,
 'blueprint_id': 'c9176801ea1f284280db8a5367f936bb',
 'metrics': {'Silhouette Score': {'validation': 0.2297700047492981,
   'crossValidation': None,
   'holdout': None,
   'training': 0.2266400009393692,
   'backtestingScores': None,
   'backtesting': None}},
 'monotonic_increasing_featurelist_id': None,
 'monotonic_decreasing_featurelist_id': None,
 'n_clusters': 3,
 'has_empty_clusters': False,
 'supports

In [None]:
if model_to_deploy.sample_pct<100:
    job = model_to_deploy.train(sample_pct = 99)
    new_model_job = [j for j in project.get_all_jobs() if j.id == int(job)][0]
    new_model = new_model_job.get_result_when_complete()
new_model

Model('K-Means Clustering')

In [None]:
project.start_prepare_model_for_deployment(model_to_deploy.id)

In [None]:
while project.get_all_jobs():
    time.sleep(10)

[]

In [None]:
final_model_to_deploy = [m for m in project.get_models() if m.sample_pct == 100
                         #and m.blueprint_id == model_to_deploy.blueprint_id
                        ][0]
final_model_to_deploy.get_uri()

'https://app.eu.datarobot.com/projects/64c944582e634bd37173f704/models/64c9517fa4ef5ff19c817501'

In [None]:
model = ClusteringModel.get(project.id, final_model_to_deploy.id)

try:
    model.compute_insights()
    jobs_list = project.get_all_jobs()
    for job in jobs_list:
        if job.job_type =='clusterInsights':
            insights = job.get_result_when_complete(max_wait=max_wait)
except:
    insights=model.insights


In [None]:
model_to_rename = ClusteringModel.get(project.id, final_model_to_deploy.id)

# after exploring insights, update multiple cluster labels
cluster_name_mappings = [
    ("Cluster 1", "Low NPS & High Churn Rate"),
    ("Cluster 2", "High NPS & Low Campaign Response"),
    ("Cluster 3", "Med/High NPS & High Campaign Response")
]

clusters = model_to_rename.update_cluster_names(cluster_name_mappings)


In [None]:
deployment = dr.Deployment.create_from_learning_model(model_id = final_model_to_deploy.id, label = "DR_AI_Accelerator_Retail_Clustering_part1_{}".format(d1), default_prediction_server_id =  dr.PredictionServer.list()[-1].id)

In [None]:
all_deployments = dr.Deployment.list()
deployment = [d for d in all_deployments if 'AI_Accelerator_Retail_Clustering' in d.label 
              and d.model['deployed_at']>'2023-07-30'
             ][0]
deployment.__dict__

{'id': '64c7f9591111bf00935a6fe1',
 'label': 'DR_AI_Accelerator_Retail_Clustering_part1_20230731',
 'status': 'active',
 'description': None,
 'default_prediction_server': {'id': '5c77bc2100f096002619ceac',
  'url': 'https://cfds.orm.eu.datarobot.com',
  'datarobot-key': '35dde409-9091-697e-31f3-a9fd3b67842e'},
 'model': {'id': '64c7e983a073e408e6c4f41a',
  'type': 'K-Means Clustering',
  'project_id': '64c7c878d4853a7cc2006861',
  'target_type': 'Multiclass',
  'project_name': 'DR_AI_Accelerator_Retail_Clustering_part1_20230731',
  'unsupervised_mode': True,
  'unstructured_model_kind': False,
  'build_environment_type': 'DataRobot',
  'deployed_at': '2023-07-31T18:11:37.492000Z',
  'is_deprecated': False},
 '_capabilities': {'supports_model_replacement': False,
  'supports_target_drift_tracking': False,
  'supports_feature_drift_tracking': True,
  'supports_prediction_intervals': False,
  'supports_humility_rules': False,
  'supports_humility_rules_default_calculations': True,
  'sup

In [None]:
deployment.update_drift_tracking_settings(feature_drift_enabled = True, max_wait = 600)
deployment.get_drift_tracking_settings()

{'target_drift': {'enabled': False}, 'feature_drift': {'enabled': True}}

In [None]:
features_to_track = ['Customer_Age_Group', 'GEO_CAT']
deployment.update_segment_analysis_settings(segment_analysis_enabled = True, segment_analysis_attributes = features_to_track, max_wait = 600) 
deployment.get_segment_analysis_settings()

{'enabled': True,
 'attributes': ['Customer_Age_Group', 'GEO_CAT'],
 'custom_attributes': []}

## Score new data and inspect cluster stability

In [None]:
scoring_df = pd.read_csv('https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/Retail_Clustering_Scoring_data.csv')
scoring_df.head()


Unnamed: 0,ID,PREDICT_BATCH,Customer_Start_Date,ZipCode,Residency_Category,Service_District,Gender,Customer_Age_Group,Brand_aware_Cat,Brand_CompetitorAware_Flag,...,GEO_CAT,LastResponse_Flag,Loyalty_YM,LoyaltyBonus_Redeem_LastYear,LoyaltyPurch_Cnt,MostFreq_Purch_Cat,NPS_Cat_Last,NPS_Score_Last,PromoCode_Cat,Sales_Channel_Cat
0,624958,1,20.11.2018,8DDA,Cat2_500K_1M,NB,M,,3-Know,1.0,...,GEO2,1.0,01.11.2014,,,1,1.0,7.0,DISC_20,Distribution2
1,351574,1,27.12.2019,2875A,Cat1_1M+,DM,F,,3-Know,0.0,...,GEO1,0.0,01.12.2015,,,1,0.0,3.0,DISC_20,Distribution1 VIP
2,238135,1,30.10.2019,YY2AY,Cat4_50_100K,1Z,M,,1-New,0.0,...,OTHER,0.0,01.10.2015,1.0,,1,,,DISC_20,Distribution1
3,94479,1,29.12.2019,DYAAD,Cat4_50_100K,AS,F,3: 45-55,4-Know (no prompt),0.0,...,GEO4,1.0,01.09.2015,,1.0,2,0.0,3.0,-,Web3
4,427142,1,05.08.2019,39AA6,Cat5_50K-,CT,F,01:35,1-New,1.0,...,OTHER,1.0,01.08.2015,,,1,0.0,4.0,DISC_20,Distribution1 VIP


In [None]:
scoring_df.columns

Index(['ID', 'PREDICT_BATCH', 'Customer_Start_Date', 'ZipCode',
       'Residency_Category', 'Service_District', 'Gender',
       'Customer_Age_Group', 'Brand_aware_Cat', 'Brand_CompetitorAware_Flag',
       'Camp_AvgClick_LastYear', 'Camp_AvgResp_LastYear', 'Contact_First_Date',
       'Contact_Last_Date', 'CustComm_Accept_Flag', 'CustComm_Count_LastYear',
       'Customer_Tenure', 'CustReq_Count', 'CustReq_Payment_Flag',
       'CustReq_Prod_flag', 'CustReq_Prod_Support_Flag',
       'CustReq_Product_Closed_Flag', 'CustReq_Resp_Flag',
       'Date Last Purchase', 'GEO_CAT', 'LastResponse_Flag', 'Loyalty_YM',
       'LoyaltyBonus_Redeem_LastYear', 'LoyaltyPurch_Cnt',
       'MostFreq_Purch_Cat', 'NPS_Cat_Last', 'NPS_Score_Last', 'PromoCode_Cat',
       'Sales_Channel_Cat'],
      dtype='object')

In [None]:
df.columns

Index(['ID', 'churn', 'Customer_Start_Date', 'ZipCode', 'Residency_Category',
       'Service_District', 'Gender', 'Customer_Age_Group', 'Brand_aware_Cat',
       'Camp_AvgClick_LastYear', 'Camp_AvgResp_LastYear', 'Contact_First_Date',
       'Contact_Last_Date', 'CustComm_Accept_Flag', 'CustComm_Count_LastYear',
       'Customer_Tenure', 'CustReq_Count', 'CustReq_Payment_Flag',
       'CustReq_Prod_flag', 'CustReq_Prod_Support_Flag',
       'CustReq_Product_Closed_Flag', 'CustReq_Resp_Flag',
       'Date Last Purchase', 'GEO_CAT', 'LastResponse_Flag', 'Loyalty_YM',
       'LoyaltyBonus_Redeem_LastYear', 'LoyaltyPurch_Cnt',
       'MostFreq_Purch_Cat', 'NPS_Cat_Last', 'NPS_Score_Avg', 'NPS_Score_Last',
       'PromoCode_Cat', 'Sales_Channel_Cat'],
      dtype='object')

[TBD, August 01] Scoring dataset is missing 2 features that are present in training. This needs to be corrected before using this in the scoring workflow. Below is an example method to run predictions, using our training dataset again.

In [None]:
# new_predictions = deployment.predict_batch(scoring_df)
new_predictions = deployment.predict_batch(df)



In [None]:
new_predictions.head()

Unnamed: 0,ID,churn,Customer_Start_Date,ZipCode,Residency_Category,Service_District,Gender,Customer_Age_Group,Brand_aware_Cat,Camp_AvgClick_LastYear,...,NPS_Cat_Last,NPS_Score_Avg,NPS_Score_Last,PromoCode_Cat,Sales_Channel_Cat,Low NPS & High Churn Rate_PREDICTION,High NPS & Low Campaign Response_PREDICTION,Med/High NPS & High Campaign Response_PREDICTION,PREDICTION,DEPLOYMENT_APPROVAL_STATUS
0,243624,1,05.09.2018,2DAA,Cat3_100_500K,AC,M,01:35,1-New,,...,0.0,-1.0,1.0,Extra_10,Distribution2,0.44987,0.349225,0.200906,Low NPS & High Churn Rate,APPROVED
1,242193,1,18.09.2018,2AA6,Cat3_100_500K,AC,F,3: 45-55,4-Know (no prompt),0.5,...,,-1.0,,Extra_10,Distribution2,0.356736,0.257998,0.385266,Med/High NPS & High Campaign Response,APPROVED
2,197222,1,26.09.2018,2AA8,Cat3_100_500K,AC,M,,4-Know (no prompt),,...,0.0,-1.0,1.0,Extra_10,Distribution2,0.621288,0.263381,0.11533,Low NPS & High Churn Rate,APPROVED
3,900083,1,11.10.2018,263A,Cat3_100_500K,AC,M,,1-New,,...,,-1.0,,Extra_10,Distribution2,0.694539,0.192745,0.112715,Low NPS & High Churn Rate,APPROVED
4,59631,0,05.10.2018,2AA6,Cat3_100_500K,AC,M,01:35,4-Know (no prompt),,...,1.0,7.72,8.0,Extra_10,Distribution2,0.113081,0.724056,0.162862,High NPS & Low Campaign Response,APPROVED


In [None]:
deployment.get_uri()

'https://app.eu.datarobot.com/deployments/64c7f9591111bf00935a6fe1/overview'

Lastly, for each individual prediction period you can use DataRobot drift monitoring to understand difference in customer profiles that may result in customers migrating from one segment to another.