## Bulk PTB Model Runner

This notebook is capable of training and testing any PTB model as part of the PlaybookIQ framework.  Today, this includes any product with purchase history captured in DIM_OPPORTUNITY.  This notebook is the heart of PlaybookIQ and is where all the data collection and feature engineering up to this point come together with the model fitting.

This notebook also takes care of the heavy lifting of running PTB models in parallel (i.e. "in bulk").  There is a PTB model for every possible cross-sell target so potentially thousands of them (examples are "Sales Cloud - Unlimited Edition", "ISVForce", "Service Cloud Prefessional Edition", etc).  Since PlaybookIQ is a general framework, when we make changes we want to see the impact across all PTB models.  Thereforce, being able to run a large number PTB models quickly (in parallel) is important.

Each PTB model requires 20-30GB of memory, so if you run 20 at once you would need 400-600GB of memory.  Petronas (rserver1) has no problems running even 20 models simultaneously.


In [1]:
%pylab inline

import pandas as pd

pd.set_option('max_rows',3000)
pd.set_option('max_columns',3000)
pd.set_option('colwidth',3000)

MODEL_DIR = 'models'

Populating the interactive namespace from numpy and matplotlib


### Get Behavioral Features

Every PTB model requires behavioral data, i.e. data about what things accounts are using in the SFDC stack. There is a core set of these behaviors captured in the dataset "temporal_behaviors.tsv".  For efficiency, we read this file in once then use the loaded dataset in every parallel run of PTB models.  

In [2]:
! date
behav = pd.read_csv('temporal_behaviors.tsv',sep='\t')
! date

Tue Jun 21 22:44:41 UTC 2016
Tue Jun 21 22:47:13 UTC 2016


  interactivity=interactivity, compiler=compiler, result=result)


This allows use to merge the behavioral dataset with the purchase history.  We want to use the behaviors of the account at the time that they purchased the cross-sell target, so we merge by DATE_KEY which is just the year-month of the observed behavior.

In [3]:
! date
behav['QRY_DATE'] = np.where(behav.QRY_DATE == '2016-03-31 00:00:00','2016-04-01 00:00:00',behav.QRY_DATE)
behav['DATE_KEY'] = behav['QRY_DATE'].map(lambda x: x[0:7])
! date

Tue Jun 21 22:47:14 UTC 2016
Tue Jun 21 22:47:33 UTC 2016


### Utility functions for modeling

These functions just convert the purchase history encoded in strings to a usable python dictionary

In [4]:
def get_purchases(purch_str):
    purchases = purch_str.split(';')
    lookup = {}
    
    for purchase in purchases:
        parts = purchase.split('`')
    
        if len(parts) == 1:
            continue
        
        lookup[parts[0]] = parts[1]
    return lookup
        

def get_date_for_prod(purch_str,prod):
    lookup = get_purchases(purch_str)
    if prod in lookup:
        return lookup[prod]
    return "NA"


### Run Model
Here is where all the model training and cross validation happens. The "run_model" function is run for each PTB model, so it takes the name of the cross-sell TARGET as input. Your can track progress in a file called output.log.

What happens in this function is pretty simple:

* It creates the training set needed to associate the cross-sell TARGET with the same account's purchase history and temporal behaviors
* It trains a linear model that predicts purchase of the TARGET based on past purchase history and account behaviors
* It tests the linear model agains a holdout dataset
* It reports on AUC and other metrics in a file called threaded_model_results.tsv

In [5]:
import time
from os import environ
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib


def run_model(TARGET):
   
    ! echo `date`,"{TARGET} - Start Model" >> output.log

    strt = time.time()
    
    ###########################################################
    #
    # Feature Engineering
    #
    ###########################################################    
    
    # Take the 10 products most frequently purchased before the TARGET product (lead ins)
    num_lead_ins = 10

    # The lead ins come from the Sequental_Frequent_Itemsets* notebooks which produced this rules file
    top_lead = pd.read_csv('purchase_rules_enh.tsv',sep='\t')

    # The account purchase history come from these 2 files.  The test file contains a sample of accounts
    # held out from the training set.
    train0 = pd.read_csv('acct_purchases_flat.tsv',sep='\t',header=None)
    test0 = pd.read_csv('acct_purchases_flat_test.tsv',sep='\t',header=None)

    train0.columns = ['ACCOUNT_ID','PURCH']
    test0.columns = ['ACCOUNT_ID','PURCH']

    # Grab 10 lead in for product, also exclude any lead ins that do not have sufficient lift
    top_features = top_lead[(top_lead.TARGET == TARGET) & (top_lead.lift > 1.2)].\
        sort_values(by='rank',ascending=False).head(num_lead_ins)
    purch_features = top_features['LEAD_IN'].tolist()
     
    # Get the dates when each account bought the target product, if at all
    train0[TARGET] = train0.PURCH.map(lambda x: get_date_for_prod(x,TARGET))
    test0[TARGET] = test0.PURCH.map(lambda x: get_date_for_prod(x,TARGET))
    
    # Get the dates when each account bought the lean in products, if at all
    for f in purch_features:
        train0[f] = train0.PURCH.map(lambda x: get_date_for_prod(x,f))
        test0[f] = test0.PURCH.map(lambda x: get_date_for_prod(x,f))

    # Generate the DATE_KEY when the target prodict was purchased, so we can merge
    # the account behaviors on his date.  Key assumption: we use the latest account
    # behaviors we have (March 2016) if the account never purchased the target product
    train0['DATE_KEY'] = np.where(train0[TARGET] == 'NA','2016-03',train0[TARGET].map(lambda x: x[0:7]))
    test0['DATE_KEY'] = np.where(test0[TARGET] == 'NA','2016-03',test0[TARGET].map(lambda x: x[0:7]))

    # Here is where the merge of the TARGET, purchase history, and temporal behaviors happens.
    # Note that we do this for every model dynamically because the TARGET dictates which temporal behaviors
    # to merge in, and every model has a different TARGET.  By merging dynamically we avoid creating 2000+
    # training sets stored on disk.
    #
    # This is also the slowest step so we track performance in log
    ! echo `date`,"{TARGET} - Start Merging " >> output.log
    train00 = train0.merge(behav,on=['ACCOUNT_ID','DATE_KEY'])
    test00 = test0.merge(behav,on=['ACCOUNT_ID','DATE_KEY'])
    ! echo `date`,"{TARGET} - Finished Merging " >> output.log
    
    train1 = pd.DataFrame(train00[TARGET])
    test1 = pd.DataFrame(test00[TARGET])

    message = "%s - train: %s, test: %s" % (TARGET,len(train1),len(test1))
    ! echo `date`,"{message}" >> output.log
    
    # To simplfy calculations, we set the date of purchase to a far future date if the account
    # did not buy the product.  We later set these to a value of 0 for "no purchase".  
    train1[TARGET] = np.where(train1[TARGET] == 'NA','2099-01-01',train1[TARGET])
    test1[TARGET] = np.where(test1[TARGET] == 'NA','2099-01-01',test1[TARGET])

    train1['ACCT_ID'] = train00.ACCOUNT_ID
    test1['ACCT_ID'] = test00.ACCOUNT_ID

    # Here is where we convert the purchase history dates to binary features
    for c in purch_features:
        train1[c] = np.where(train00[c] == 'NA',0,np.where(train00[c] < train1[TARGET],1,0))
        test1[c] = np.where(test00[c] == 'NA',0,np.where(test00[c] < test1[TARGET],1,0))
   
    behav_features = ['NUM_ACCOUNTS',
 'NUM_CAMPAIGNS',
 'NUM_CASE_QUEUES',
 'NUM_CASE_RECORD_TYPES',
 'NUM_CASES',
 'NUM_CONTRACTS',
 'NUM_CUSTOM_OBJECTS',
 'NUM_DASHBOARDS',
 'NUM_ESCALATION_RULES',
 'NUM_FORECASTS',
 'NUM_LEADS',
 'NUM_MASS_EMAILS']
    
    # Temporarily only consider one behavioral feature for now
    #behav_features = ['NUM_ACCOUNTS']

    # Log transformation of behavioral variables
    for c in behav_features:
        #train00[c] = np.where(train00[c] <= 0,0.01,train00[c])
        #test00[c] = np.where(test00[c] <= 0,0.01,test00[c])
        #train1[c] = np.where(train00[c] > 0,log10(train00[c].map(float)),0)
        #test1[c] = np.where(test00[c] > 0,log10(test00[c].map(float)),0)
        train1[c] = train00[c]
        test1[c] = test00[c]


    # If we ever need to do any additional filtering on train/test rows, do it here
    local_train = pd.DataFrame(train1)
    local_test = pd.DataFrame(test1)

    local_train.fillna(value=0,inplace=True)
    local_test.fillna(value=0,inplace=True)
    
    # Convert target to binary features
    local_train[TARGET] = np.where(local_train[TARGET] == '2099-01-01',0,1)
    local_test[TARGET] = np.where(local_test[TARGET] == '2099-01-01',0,1)

    ###########################################################
    #
    # Model Fitting
    #
    ###########################################################    

    
    #cls = LogisticRegression(class_weight='balanced')
    from sklearn.ensemble import GradientBoostingClassifier
    cls = GradientBoostingClassifier()

    # List of both purchase history and behavioral features
    features = purch_features + behav_features

    # Fit the model
    result = cls.fit(local_train[features], local_train[TARGET])
    
    file = MODEL_DIR + '/' + TARGET + '.pkl'
    joblib.dump(cls, file) 

    importances = [y for (x,y) in sorted(zip(result.feature_importances_,features),reverse=True)]
    importances_str = ','.join(importances)
    
    # A test of serialization
    cls = joblib.load(file)
    
    # Make the predictions
    local_test['proba'] = cls.predict_proba(local_test[features])[:,1]
    local_test['pred'] = cls.predict(local_test[features])

    # CALCULATE AUC
    from sklearn import metrics
    y = local_test[TARGET]
    fpr, tpr, thresholds = metrics.roc_curve(y, local_test['proba']) 
    auc=metrics.auc(fpr, tpr)


    # Report on performance of top 10% of predictions

    cuts = [0,.90,1]
    rank2 = pd.DataFrame(local_test.sort_values(by='proba'))
    rank2['num_rank'] = range(len(rank2))
    rank2['bucket'] = pd.qcut(rank2.num_rank,cuts,labels=[str(i) for i in range(1,len(cuts))])
    res = rank2.groupby('bucket')[TARGET].agg([mean,len]).reset_index()

    bottom, top = res['mean'].tolist()

    baseline = rank2[TARGET].mean()

    dur = str(time.time() - strt)
    
    pos_train = local_train[TARGET].sum()
    pos_test = local_test[TARGET].sum()    
    
    ! echo "{TARGET}\t{auc}\t{baseline}\t{top}\t{bottom}\t{pos_train}\t{pos_test}\t{importances_str}" >> bulk_model_results.tsv
    
    

    ! echo `date`,"{TARGET} - Finish Model" >> output.log
    
    return (local_train,local_test,train00,purch_features,behav_features,result,cls)

This model wrapper informs us if there was an error.  To troubleshoot an error, run run_model(TARGET) in a notebook cell.

In [6]:
import traceback,sys

def run_model_wrapper(TARGET):
    try:
        run_model(TARGET)
    except:
        e = sys.exc_info()[0]
        message = "{%s} - Error %s" % (TARGET,e)
        ! echo '{message}' >> output.log

### Parallel Execution

Start by grabbing the top 50 most purchased products and create models for each one.

In [7]:
purchases = pd.read_csv('acct_purchases_long.tsv',sep='\t')
prods = purchases.groupby('PROD_NM').size().reset_index()
TARGETS = prods.sort_values(by=0,ascending=False).PROD_NM.tolist()

This next cell runs each PTB model in a separate process.  The key tunable is the batch variable which dictates how many models will run in parallel at a time.  In the code below, 20 models be run in a batch, and when they are done the next 20 will run in batch.

The batch size is controlled by running every nth model synchronously (in the foreground). The last model in TARGETS will also be run in the foreground, so you can be reasonably sure that when the cell finishes executing, all models have run.  This scheme generally works fin as long as all models run in roughly the same amount of tme (which is true). 

In [8]:
#import thread
import multiprocessing

! echo "PROD_NM\tAUC\tBASELINE\tTOP_10perc\tREST\tPOS_TRAIN\tPOS_TEST\tFEAT_IMP" > bulk_model_results.tsv
! echo "Start All Models" > output.log


batch = 30

cnt = 1
jobs = []

prods = TARGETS
#prods = TARGETS[0:2]
! date
try:
    for p in prods:
        if cnt % batch == 0 or cnt == len(prods):
            print p,"========= Foreground run"
            run_model_wrapper(p)
        else:
            print p,"Background run"
            p = multiprocessing.Process(target=run_model_wrapper,args=(p,  ))
            jobs.append(p)
            p.start()
        cnt += 1

except:
    print "Error: unable to start thread"

! date

Tue Jun 21 22:47:38 UTC 2016
Sales Cloud - Professional Edition Background run
Sales Cloud - Group Edition Background run
Sales Cloud - Enterprise Edition Background run
Sales Cloud - Contact Manager Edition Background run
Foundation Enterprise Edition Power of 10 Donation Background run
Force.com Platform Embedded Edition Background run
ISVForce Background run
Premier+ Success Plan (Support & Admin) - Sales Cloud Background run
Force.com - Enterprise Edition Background run
Data.com Corporate Prospector Background run
Service Cloud - Professional Edition Background run
Mobile Background run
Service Cloud - Enterprise Edition Background run
Sandbox Background run
Premier Success Plan (Support) - Sales Cloud Background run
Premier+ Success Plan (Support & Admin) Background run
Sales Cloud - Unlimited Edition Background run
Chatter Plus Background run
Desk.com - Pro Background run
Pardot Background run
Foundation Enterprise Edition Background run
Premier Success Plan (Support) Background 

Wait until all jobs done ...

In [9]:
import time 
while sum([1 for j in jobs if j.exitcode == 0]) != len(jobs):
    print "Waiting 10 secs for all jobs to complete"
    print [j.exitcode for j in jobs]
    time.sleep(10)  


### Analyze Results

In [10]:
result = pd.read_csv('bulk_model_results.tsv',sep='\t')
result['LIFT'] = result.TOP_10perc/result.BASELINE
result.sort_values(by='LIFT',ascending=False).head(20)

Unnamed: 0,PROD_NM,AUC,BASELINE,TOP_10perc,REST,POS_TRAIN,POS_TEST,FEAT_IMP,LIFT
52,Analytics - 5 Additional Dynamic Dashboards,0.962269,0.00173,0.016609,7.7e-05,286,75,"NUM_ACCOUNTS,NUM_DASHBOARDS,NUM_CASES,NUM_LEADS,NUM_CONTRACTS,NUM_CASE_QUEUES,NUM_CAMPAIGNS,NUM_CUSTOM_OBJECTS,NUM_FORECASTS,NUM_MASS_EMAILS,NUM_CASE_RECORD_TYPES,ISVForce,Sandbox,Service Cloud - Enterprise Edition,Sales Cloud - Enterprise Edition,Marketing Cloud,Force.com - Unlimited Edition,Knowledge,Sales Cloud - Unlimited Edition,Service Cloud - Unlimited Edition,NUM_ESCALATION_RULES,Force.com Platform Embedded Edition",9.599779
38,Force.com Platform Unlimited Edition,0.975189,0.003022,0.028835,0.000154,607,131,"NUM_LEADS,NUM_ACCOUNTS,NUM_CUSTOM_OBJECTS,NUM_DASHBOARDS,NUM_CASE_QUEUES,NUM_CAMPAIGNS,Force.com - Unlimited Edition,Knowledge,NUM_CASE_RECORD_TYPES,NUM_CASES,NUM_CONTRACTS,Premier Success Plan (Support) - Force.com Edition,NUM_MASS_EMAILS,Premier Success Plan (Support) - CP Enterprise Admin (Named User),NUM_ESCALATION_RULES,NUM_FORECASTS,Force.com Unlimited Edition,Additional API Calls - 1,000 per day,Sales Cloud - Unlimited Edition,Premier Success Plan (Support) - Enterprise Ed. (Chatter Plus),Service Cloud - Unlimited Edition,Premier Success Plan (Support) - Enterprise Edition (Restricted)",9.541985
64,Analytics Cloud - Sales Wave Analytics App,0.955273,0.000369,0.003227,5.1e-05,82,16,"NUM_MASS_EMAILS,NUM_CASE_QUEUES,NUM_CUSTOM_OBJECTS,NUM_DASHBOARDS,NUM_CAMPAIGNS,NUM_CASES,NUM_LEADS,Mobile,NUM_CONTRACTS,Service Cloud - Enterprise Edition,NUM_FORECASTS,Force.com Platform Embedded Edition,ISVForce,NUM_ACCOUNTS,Premier+ Success Plan (Support & Admin) - Sales Cloud,NUM_CASE_RECORD_TYPES,Sandbox,Sales Cloud - Unlimited Edition,Pardot,Sales Cloud - Enterprise Edition,NUM_ESCALATION_RULES,Data.com Corporate Prospector",8.749193
4,Sales Cloud - Contact Manager Edition,0.977941,0.090821,0.792009,0.012902,15512,3864,"NUM_DASHBOARDS,NUM_LEADS,NUM_CUSTOM_OBJECTS,NUM_ACCOUNTS,NUM_CASES,NUM_MASS_EMAILS,NUM_CAMPAIGNS,NUM_CONTRACTS,NUM_CASE_QUEUES,NUM_FORECASTS,NUM_ESCALATION_RULES,NUM_CASE_RECORD_TYPES",8.720507
42,Analytics Cloud,0.935036,0.002927,0.025121,0.000461,469,127,"NUM_DASHBOARDS,NUM_CUSTOM_OBJECTS,NUM_CASES,NUM_CASE_RECORD_TYPES,NUM_CONTRACTS,NUM_CAMPAIGNS,NUM_MASS_EMAILS,NUM_CASE_QUEUES,NUM_LEADS,NUM_ACCOUNTS,NUM_FORECASTS,Sandbox,Sales Cloud - Enterprise Edition,Knowledge,Service Cloud - Unlimited Edition,NUM_ESCALATION_RULES,Chatter Plus,Force.com Platform Embedded Edition,Sales Cloud - Unlimited Edition,ISVForce,Customer Community,Service Cloud - Enterprise Edition",8.581886
9,Force.com Platform Enterprise Edition,0.94762,0.013167,0.10837,0.002588,2403,571,"NUM_CUSTOM_OBJECTS,NUM_CASES,NUM_DASHBOARDS,NUM_CASE_RECORD_TYPES,NUM_LEADS,NUM_ACCOUNTS,NUM_CAMPAIGNS,NUM_CASE_QUEUES,NUM_MASS_EMAILS,Premier+ Success Plan (Support & Admin) - Cust Cmty (2K Logins/mo),NUM_CONTRACTS,Premier Success Plan (Support) - CP Ent. Admin (1 Login/month),Premier Success Plan (Support) - CP Enterprise Admin (Named User),NUM_ESCALATION_RULES,Premier Success Plan (Support) - Enterprise Edition (Restricted),NUM_FORECASTS",8.230414
67,Data.com Premium Records Additional,0.868724,0.000576,0.00438,0.000154,120,25,"NUM_FORECASTS,NUM_ACCOUNTS,Data.com Premium Clean,NUM_LEADS,NUM_CASES,NUM_CONTRACTS,NUM_CAMPAIGNS,Data.com Premium Prospector,NUM_CASE_QUEUES,NUM_CASE_RECORD_TYPES,NUM_DASHBOARDS,ISVForce,Service Cloud - Enterprise Edition,Data.com Corporate Prospector,NUM_MASS_EMAILS,NUM_CUSTOM_OBJECTS,Sales Cloud - Unlimited Edition,Sales Cloud - Professional Edition,Sales Cloud - Enterprise Edition,Premier+ Success Plan (Support & Admin) - Sales Cloud,NUM_ESCALATION_RULES,Force.com Platform Embedded Edition",7.6
63,Premier Success Plan (Support) - Service Cloud - Knowledge Pack,0.790983,0.000277,0.002075,7.7e-05,41,12,"NUM_CASES,NUM_ACCOUNTS,NUM_LEADS,NUM_DASHBOARDS,NUM_CONTRACTS,NUM_CUSTOM_OBJECTS,NUM_CASE_RECORD_TYPES,Marketing Cloud,NUM_MASS_EMAILS,Service Cloud - Unlimited Edition,NUM_ESCALATION_RULES,NUM_CASE_QUEUES,NUM_CAMPAIGNS,Service Cloud - Enterprise Edition,Knowledge,Sandbox,Sales Cloud - Unlimited Edition,Premier Success Plan (Support) - Service Cloud,Premier Success Plan (Support) - Sales Cloud,NUM_FORECASTS,Customer Community,Chatter Plus",7.500173
74,Additional 10 Objects for Partner Community (Member),0.807868,0.000277,0.002074,7.7e-05,46,12,"NUM_DASHBOARDS,NUM_CONTRACTS,NUM_CUSTOM_OBJECTS,NUM_CASE_QUEUES,NUM_ACCOUNTS,NUM_LEADS,NUM_CASES,NUM_CASE_RECORD_TYPES,Partner Community,Force.com - Enterprise Edition,ISVForce,Sandbox,Service Cloud - Unlimited Edition,NUM_CAMPAIGNS,NUM_FORECASTS,Service Cloud - Enterprise Edition,Sales Cloud - Unlimited Edition,Sales Cloud - Enterprise Edition,NUM_MASS_EMAILS,NUM_ESCALATION_RULES,Force.com Platform Embedded Edition,Force.com - Unlimited Edition",7.499654
28,Foundation Enterprise Edition,0.90196,0.010914,0.081327,0.00309,1848,467,"NUM_ACCOUNTS,NUM_CAMPAIGNS,Foundation Enterprise Edition Power of 10 Donation,NUM_CUSTOM_OBJECTS,NUM_DASHBOARDS,NUM_LEADS,NUM_CASES,NUM_CASE_RECORD_TYPES,NUM_CASE_QUEUES,NUM_CONTRACTS,Premier Success Plan (Support) - Sales Cloud for Nonprofits,Force.com Edition for Nonprofits,Force.com (Light Applications),NUM_MASS_EMAILS,Premier Success Plan (Support) - Cust Cmty (100 Members),Premier Success Plan (Support) - Cust Cmty (2K Logins/mo),Premier Success Plan (Support) for Nonprofits - Fee,Premier Success Plan (Support) - Knowledge-only,Premier+ Success Plan (Support & Admin) - Sales Cloud for Nonprofits,NUM_FORECASTS,NUM_ESCALATION_RULES",7.45182


In [11]:
# Save interesting models to HTML
cols = ['PROD_NM','AUC','BASELINE','TOP_10perc','LIFT','POS_TRAIN','POS_TEST','FEAT_IMP']
#res = result[result.POS_TEST > 90]
result[cols].sort_values('LIFT',ascending=False).to_html('Model_Results.html')

In [12]:
import json
with open('dominostats.json', 'wb') as f:
    f.write(json.dumps({"Avg AUC (%s models)" % len(result): '%0.4f' % result.AUC.mean()}))

In [13]:
result[result.PROD_NM == 'Marketing Cloud'].head()

Unnamed: 0,PROD_NM,AUC,BASELINE,TOP_10perc,REST,POS_TRAIN,POS_TEST,FEAT_IMP,LIFT
20,Marketing Cloud,0.828307,0.002476,0.01342,0.00126,416,107,"NUM_ACCOUNTS,NUM_DASHBOARDS,NUM_CASES,NUM_CAMPAIGNS,NUM_CUSTOM_OBJECTS,NUM_CONTRACTS,NUM_MASS_EMAILS,NUM_CASE_QUEUES,NUM_LEADS,NUM_CASE_RECORD_TYPES,Service Cloud - Performance Edition,NUM_FORECASTS,Force.com - Unlimited Edition,Sandbox,Sales Cloud - Unlimited Edition,NUM_ESCALATION_RULES,Data.com Premium Prospector,Live Agent,Data.com Premium Clean,Additional API Calls - 1,000 per day,Knowledge,Service Cloud - Unlimited Edition",5.419683


In [14]:
result[result.PROD_NM.str.contains('Shield')].head()

Unnamed: 0,PROD_NM,AUC,BASELINE,TOP_10perc,REST,POS_TRAIN,POS_TEST,FEAT_IMP,LIFT
69,Salesforce Shield,0.906885,0.000761,0.005531,0.00023,147,33,"NUM_ACCOUNTS,NUM_CASES,NUM_DASHBOARDS,NUM_CASE_RECORD_TYPES,NUM_CASE_QUEUES,NUM_CUSTOM_OBJECTS,NUM_CONTRACTS,NUM_CAMPAIGNS,NUM_LEADS,Service Cloud - Unlimited Edition,NUM_MASS_EMAILS,Service Cloud - Enterprise Edition,NUM_ESCALATION_RULES,Chatter Plus,Sales Cloud - Unlimited Edition,Sandbox,Premier+ Success Plan (Support & Admin) - Sales Cloud,Knowledge,Sales Cloud - Enterprise Edition,NUM_FORECASTS,ISVForce,Force.com Platform Embedded Edition",7.272057


### Debug Model Example

* Configure the Parallel Execution cell to contains the product you want to run.
* Run "(local_train,local_test,train00,features,result) = run_model(TARGET)"
* Tail output.log in a separate window after you do this
* Fit models 

In [15]:
#TARGET = 'Mobile'
#TARGET = 'Analytics Cloud - Wave Base Capacity'
#! date
#(local_train,local_test,train00,purch_features,behav_features,result,cls) = run_model(TARGET)
#! date

In [16]:
'''
1
features0 = features`a

features = ['Sales Cloud - Enterprise Edition',
 'Service Cloud - Enterprise Edition',
 'Premier Success Plan (Support) - Sales Cloud',
 'Premier Success Plan (Support) - Service Cloud',
 'Dreamforce Pass',
 'Force.com - Enterprise Edition',
 'Mobile',
 'Premier Success Plan (Support) - Force.com Edition',
 'Chatter Plus',
 'Sales Cloud - Unlimited Edition','NUM_ACCOUNTS']

features = ['Sales Cloud - Enterprise Edition']

cls = LogisticRegression(class_weight='balanced')

#import statsmodels.api as sm
#cls = sm.Logit(local_train[TARGET],local_train[features])
#result = cls.fit()


cls.fit(local_train[features],local_train[TARGET])
local_test['proba'] = cls.predict_proba(local_test[features])[:,1]
local_test['pred'] = cls.predict(local_test[features])

# CALCULATE AUC
from sklearn import metrics
y = local_test[TARGET]
fpr, tpr, thresholds = metrics.roc_curve(y, local_test['proba']) 
auc=metrics.auc(fpr, tpr)

auc
'''





"\n1\nfeatures0 = features`a\n\nfeatures = ['Sales Cloud - Enterprise Edition',\n 'Service Cloud - Enterprise Edition',\n 'Premier Success Plan (Support) - Sales Cloud',\n 'Premier Success Plan (Support) - Service Cloud',\n 'Dreamforce Pass',\n 'Force.com - Enterprise Edition',\n 'Mobile',\n 'Premier Success Plan (Support) - Force.com Edition',\n 'Chatter Plus',\n 'Sales Cloud - Unlimited Edition','NUM_ACCOUNTS']\n\nfeatures = ['Sales Cloud - Enterprise Edition']\n\ncls = LogisticRegression(class_weight='balanced')\n\n#import statsmodels.api as sm\n#cls = sm.Logit(local_train[TARGET],local_train[features])\n#result = cls.fit()\n\n\ncls.fit(local_train[features],local_train[TARGET])\nlocal_test['proba'] = cls.predict_proba(local_test[features])[:,1]\nlocal_test['pred'] = cls.predict(local_test[features])\n\n# CALCULATE AUC\nfrom sklearn import metrics\ny = local_test[TARGET]\nfpr, tpr, thresholds = metrics.roc_curve(y, local_test['proba']) \nauc=metrics.auc(fpr, tpr)\n\nauc\n"

In [17]:
'''
(local_train,local_test,train00,features,result) = run_model(TARGET)

features3 = ['NUM_MASS_EMAILS','NUM_CUSTOM_OBJECTS','NUM_CUSTOM_OBJECT_RECORDS']

from sklearn.ensemble import RandomForestClassifier

cls = RandomForestClassifier(class_weight='balanced')

# Fit the model
! date
result = cls.fit(local_train[features3], local_train[TARGET])
! date



# Make the predictions
local_test['proba'] = cls.predict_proba(local_test[features3])[:,1]
local_test['pred'] = cls.predict(local_test[features3])

# CALCULATE AUC
from sklearn import metrics
y = local_test[TARGET]
fpr, tpr, thresholds = metrics.roc_curve(y, local_test['pred']) 
auc=metrics.auc(fpr, tpr)

auc

pd.DataFrame(zip(features,result.feature_importances_)).sort_values(by=1)
'''

"\n(local_train,local_test,train00,features,result) = run_model(TARGET)\n\nfeatures3 = ['NUM_MASS_EMAILS','NUM_CUSTOM_OBJECTS','NUM_CUSTOM_OBJECT_RECORDS']\n\nfrom sklearn.ensemble import RandomForestClassifier\n\ncls = RandomForestClassifier(class_weight='balanced')\n\n# Fit the model\n! date\nresult = cls.fit(local_train[features3], local_train[TARGET])\n! date\n\n\n\n# Make the predictions\nlocal_test['proba'] = cls.predict_proba(local_test[features3])[:,1]\nlocal_test['pred'] = cls.predict(local_test[features3])\n\n# CALCULATE AUC\nfrom sklearn import metrics\ny = local_test[TARGET]\nfpr, tpr, thresholds = metrics.roc_curve(y, local_test['pred']) \nauc=metrics.auc(fpr, tpr)\n\nauc\n\npd.DataFrame(zip(features,result.feature_importances_)).sort_values(by=1)\n"