# Apprentissage Random Forest avec API SWAT avec Python

Ce pipeline permet d'enchaîner l'apprentissage (sur 400 arbres et une profondeur de 13 maximum), de générer le fichier astore de scoring pour mise en production directe sur la plateforme SAS9.4M5. L'exécution a été effectuée sur un dataset d'environ 500 000 lignes et 200 variables avec une distribution équivalente de celle partagée.

Un apprentissage avec autotuning a été lancé afin de trouver une combinatoire d'hyperparamètre plus optimale. Cette autotuning a permis en 900s environ d'obtenir les hypeparamètres suivants pour la random forest : 

Parameter                        Name              Value
0                          Evaluation  Evaluation     31
1                     Number of Trees       NTREE    111
2          Number of Variables to Try           M     46
3                           Bootstrap   BOOTSTRAP    0.1
4                 Maximum Tree Levels    MAXLEVEL     29
5  Misclassification Error Percentage   Objective  12.40

Chaque sortie de chaque étape a été gardée dans le notebook pour constater les temps de traitement et les tailles des fichiers produits.

Par exemple : l'apprentissage de la foret sur 500000 lignes et 200 variables prends 78 secondes sur une plateforme AWS équivalente à 16 coeurs physique et 128 Go de RAM.

### La documentation générale sur la programmation en Viya en SAS/Python/R est située ici : 
http://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=pgmsashome&docsetTarget=home.htm&locale=en

### La documentation de l'API SWAT Python pour Viya est située ici :  
https://sassoftware.github.io/python-swat/



In [3]:
import os
import pandas as pd
import swat
import sys
from sasctl import Session

In [21]:
target          = "CIBLE"
class_inputs    = ["var1","var21","var37","var38","var89","var90","var91","var103","var163","var165","var166","var167","var168","var169","var170","var171","var172","var173","var174","var175","var176","var177","var180","var182","var183","var184","var185","var186","var187","var188","var189","var190","var191","var192","var193","var194","var195","var196"]
class_vars      = [target] + class_inputs
interval_inputs = ["var3","var4","var5","var6","var7","var8","var9","var10","var11","var12","var13","var14","var15","var16","var17","var18","var19","var20","var22","var23","var24","var25","var26","var27","var28","var29","var30","var31","var32","var33","var34","var35","var36","var39","var40","var41","var42","var43","var44","var45","var46","var47","var48","var49","var50","var51","var52","var53","var54","var55","var56","var57","var58","var59","var60","var61","var62","var63","var64","var65","var66","var67","var68","var69","var70","var71","var72","var73","var74","var75","var76","var77","var78","var79","var80","var81","var82","var83","var84","var85","var86","var87","var88","var92","var93","var94","var95","var96","var97","var98","var99","var100","var101","var102","var104","var105","var106","var107","var108","var109","var110","var111","var112","var113","var114","var115","var116","var117","var118","var119","var120","var121","var122","var123","var124","var125","var126","var127","var128","var129","var130","var131","var132","var133","var134","var135","var136","var137","var138","var139","var140","var141","var142","var143","var144","var145","var146","var147","var148","var149","var150","var151","var152","var153","var154","var155","var156","var157","var158","var159","var160","var161","var162","var164","var178","var179","var181"]
all_inputs      = interval_inputs + class_inputs

indata = 'train_sample_retail_banking'


In [9]:
cashost='frasepviya34.aws.sas.com'
casport=5570
sess = swat.CAS(cashost, casport)

# Load the needed action sets for this example:
sess.loadactionset('datapreprocess')
sess.loadactionset('sampling')
sess.loadactionset('decisiontree')
sess.loadactionset('astore')
sess.loadactionset('autotune')

NOTE: Added action set 'datapreprocess'.
NOTE: Added action set 'sampling'.
NOTE: Added action set 'decisiontree'.
NOTE: Added action set 'astore'.
NOTE: Added action set 'autotune'.


In [4]:
# On récupère l'aide en ligne sur les actions (par actionset) disponibles avec l'API avec les actionsset chargés précedemment

sess.help()

NOTE: Available Action Sets and Actions:
NOTE:    accessControl
NOTE:       accessPersonalCaslibs - Provides administrative access to all personal caslibs (CASUSER and CASUSERHDFS)
NOTE:       assumeRole - Assumes a role
NOTE:       dropRole - Relinquishes a role
NOTE:       showRolesIn - Shows the currently active role
NOTE:       showRolesAllowed - Shows the roles that a user is a member of
NOTE:       isInRole - Shows whether a role is assumed
NOTE:       isAuthorized - Shows whether access is authorized
NOTE:       isAuthorizedActions - Shows whether access is authorized to actions
NOTE:       isAuthorizedTables - Shows whether access is authorized to tables
NOTE:       isAuthorizedColumns - Shows whether access is authorized to columns
NOTE:       listAllPrincipals - Lists all principals that have explicit access controls
NOTE:       whatIsEffective - Lists effective access and explanations (Origins)
NOTE:       whatCheckoutsExist - Lists checkouts held on an object, its parents, 

Unnamed: 0,name,description
0,accessPersonalCaslibs,Provides administrative access to all personal...
1,assumeRole,Assumes a role
2,dropRole,Relinquishes a role
3,showRolesIn,Shows the currently active role
4,showRolesAllowed,Shows the roles that a user is a member of
5,isInRole,Shows whether a role is assumed
6,isAuthorized,Shows whether access is authorized
7,isAuthorizedActions,Shows whether access is authorized to actions
8,isAuthorizedTables,Shows whether access is authorized to tables
9,isAuthorizedColumns,Shows whether access is authorized to columns

Unnamed: 0,name,description
0,download,Downloads a remote store to a local store
1,upload,Uploads a local store to a remote store
2,describe,Describes some of the contents of the analytic...
3,score,Uses an analytic store to score an input table

Unnamed: 0,name,description
0,tuneSvm,Automatically adjusts support vector machine p...
1,tuneForest,Automatically adjusts forest parameters to tun...
2,tuneDecisionTree,Automatically adjusts decision tree parameters...
3,tuneNeuralNet,Automatically adjusts neural network parameter...
4,tuneGradientBoostTree,Automatically adjusts gradient boosting tree p...
5,tuneFactMac,Automatically adjusts factorization machine pa...
6,tuneBnet,Automatically adjusts Bayesian network classif...

Unnamed: 0,name,description
0,addNode,Adds a machine to the server
1,removeNode,Remove one or more machines from the server
2,help,Shows the parameters for an action or lists al...
3,listNodes,Shows the host names used by the server
4,loadActionSet,Loads an action set for use in this session
5,installActionSet,Loads an action set in new sessions automatically
6,log,Shows and modifies logging levels
7,queryActionSet,Shows whether an action set is loaded
8,queryName,Checks whether a name is an action or action s...
9,reflect,Shows detailed parameter information for an ac...

Unnamed: 0,name,description
0,setServOpt,sets a server option
1,getServOpt,displays the value of a server option
2,listServOpts,Displays the server options and server values

Unnamed: 0,name,description
0,rustats,"Computes robust univariate statistics, central..."
1,impute,Performs data matrix (variable) imputation
2,outlier,Performs outlier detection and treatment
3,binning,Performs unsupervised variable discretization
4,discretize,Performs supervised and unsupervised variable ...
5,catTrans,Groups and encodes categorical variables using...
6,histogram,Generates histogram bins and simple bin-based ...
7,transform,"Performs pipelined variable imputation, outlie..."
8,kde,Computes kernel density estimation
9,highCardinality,Performs randomized cardinality estimation

Unnamed: 0,name,description
0,runCodeTable,Runs DATA step code stored in a CAS table
1,runCode,Runs DATA step code

Unnamed: 0,name,description
0,dtreeTrain,Trains a decision tree
1,dtreeScore,Scores a table using a decision tree model
2,dtreeSplit,Splits decision tree nodes
3,dtreePrune,Prune a decision tree
4,dtreeMerge,Merges decision tree nodes
5,dtreeCode,Generates DATA step scoring code from a decisi...
6,forestTrain,Trains a forest
7,forestScore,Scores a table using a forest model
8,forestCode,Generates DATA step scoring code from a forest...
9,gbtreeTrain,Trains a gradient boosting tree

Unnamed: 0,name,description
0,percentile,Calculate quantiles and percentiles
1,boxPlot,"Calculate quantiles, high and low whiskers, an..."
2,assess,Assess and compare models

Unnamed: 0,name,description
0,srs,Samples a proportion of data from the input t...
1,stratified,Samples a proportion of data or partitions the...
2,oversample,Samples a user-specified proportion of data fr...
3,kfold,K-fold partitioning.

Unnamed: 0,name,description
0,searchIndex,Searches for a query against an index and retr...
1,searchAggregate,Aggregates certain fields in a table that is u...
2,valueCount,value count for multiple fields
3,buildIndex,Creates an empty index using a schema (the fir...
4,getSchema,Gets the schema of an index
5,appendIndex,Loads data to an index after the buildIndex ac...
6,deleteDocuments,Deletes a portion of documents from an index

Unnamed: 0,name,description
0,listSessions,Displays a list of the sessions on the server
1,addNodeStatus,Lists details about machines currently being a...
2,timeout,Changes the time-out for a session
3,actionstatus,Get action status for a session
4,endSession,Ends the current session
5,sessionId,Displays the name and UUID of the current session
6,sessionName,Changes the name of the current session
7,sessionStatus,Displays the status of the current session
8,listresults,Lists the saved results for a session
9,batchresults,Change current action to batch results

Unnamed: 0,name,description
0,setSessOpt,Sets a session option
1,getSessOpt,Displays the value of a session option
2,listSessOpts,Displays the session options and session values
3,addFmtLib,Adds a format library
4,listFmtLibs,Lists the format libraries that are associated...
5,setFmtSearch,Sets the format libraries to search
6,listFmtSearch,Shows the format library search order
7,dropFmtLib,Drops a format library from global scope for a...
8,deleteFormat,Deletes a format from a format library
9,addFormat,Adds a format to a format library

Unnamed: 0,name,description
0,mdSummary,Calculates multidimensional summaries of numer...
1,numRows,Shows the number of rows in a Cloud Analytic S...
2,summary,Generates descriptive statistics of numeric va...
3,correlation,Computes Pearson product-moment correlations.
4,regression,Performs a linear regression up to 3rd-order p...
5,crossTab,Performs one-way or two-way tabulations
6,distinct,Computes the distinct number of values of the ...
7,topK,Returns the top-K and bottom-K distinct values...
8,groupBy,Builds BY groups in terms of the variable valu...
9,freq,Generates a frequency distribution for one or ...

Unnamed: 0,name,description
0,view,Creates a view from files or tables
1,attribute,Manages extended table attributes
2,upload,Transfers binary data to the server to create ...
3,loadTable,Loads a table from a caslib's data source
4,tableExists,Checks whether a table has been loaded
5,index,Create indexes on one or more table variables
6,columnInfo,Shows column information
7,fetch,Fetches rows from a table or view
8,save,Saves a table to a caslib's data source
9,addTable,Add a table by sending it from the client to t...


In [11]:
# Si la table n'est pas chargée, on peut aussi se charger une version en session privée
out = sess.loadtable(indata+'.sas7bdat', caslib='MYDATA', casout=dict(name=indata, caslib='MYDATA'))

# Déclaration de la table CAS déjà chargée en mémoire et accessible par les utilisateurs Python/SAS/R
casindata = sess.CASTable('TRAIN_SAMPLE_RETAIL_BANKING', caslib='MYDATA')

NOTE: Cloud Analytic Services made the file train_sample_retail_banking.sas7bdat available as table TRAIN_SAMPLE_RETAIL_BANKING in caslib MYDATA.


In [6]:
# On récupère l'aide sur l'action permettant d'échantillonner avec une stratification

sess.help(action="stratified")

NOTE: Information for action 'sampling.stratified':
NOTE: The following parameters are accepted.  Default values are shown.
NOTE:    list table={
NOTE:       specifies the input data table.
NOTE:       string name=NULL (required),
NOTE:       specifies the name of the table to use.
NOTE:       string caslib=NULL,
NOTE:       specifies the caslib that contains the table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib.
NOTE:       string where=NULL,
NOTE:       specifies an expression for subsetting the input data.
NOTE:       array of groupBy={
NOTE:       specifies the names of the variables to use for grouping results.
NOTE:          string name=NULL (required),
NOTE:       specifies the name for the variable.
NOTE:          string label=NULL,
NOTE:       specifies the descriptive label for the variable.
NOTE:          int32 formattedLength=0,
NOTE:       specifies the length of for

In [12]:
# Partitionnement des données sur le serveur CAS et création de la table partitionnée en mémoire

casindata_part = sess.CASTable('train_sample_retail_banking_part', replace=True)

casindata.groupby(by='CIBLE').sampling.stratified(
  output=dict(casout='train_sample_retail_banking_part', copyvars='all'),
  samppct=75,
  partind=True
)

NOTE: Using SEED=1361439120 for sampling.


Unnamed: 0,ByGrpID,CIBLE,NObs,NSamp
0,0,0,517797,388348
1,1,1,76488,57366

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,CASUSER(viyademo01),train_sample_retail_banking_part,,594285,214,"CASTable('train_sample_retail_banking_part', c..."


In [13]:
# On crée une vue sur les données de training
casindata_part_1 = casindata_part.query('_partind_ = 1')


In [14]:
# On lance l'apprentissage de modèle de random forest sur 400 arbres et 13 en profondeur maximum. 
#D'autres paramètres sont mis avec leurs valeurs par défaut

forest_model = sess.CASTable('retail_banking_forest_model', replace=True)

casindata_part_1.decisiontree.foresttrain(
    inputs=all_inputs,  
    nominals=class_vars,
    prune="FALSE",
    alpha=0,
    missing='USEINSEARCH',
    leafsize=5,
    oob="TRUE",
    minuseinsearch=1,
    maxBranch=2,
    target='CIBLE',  
    ntree=400,  
    maxlevel=13,  
    crit='GAINRATIO',
    vote='PROB',
    nbins=20,
    bootstrap=0.6,
    savestate=dict(name='forest_model_state', compress='true', replace='true') #sauvegarde de l'état du modèle pour le déploiement
)

NOTE: Wrote 37282860 bytes to the savestate file forest_model_state.


Unnamed: 0,Descr,Value
0,Number of Trees,400.0
1,Number of Selected Variables (M),14.0
2,Random Number Seed,0.0
3,Bootstrap Percentage (%),60.0
4,Number of Bins,20.0
5,Number of Variables,195.0
6,Confidence Level for Pruning,0.25
7,Max Number of Tree Nodes,315.0
8,Min Number of Tree Nodes,33.0
9,Max Number of Branches,2.0

Unnamed: 0,TreeID,Trees,NLeaves,MCR,LogLoss,ASE,RASE,MAXAE
0,0.0,1.0,104.0,0.127918,0.362312,0.105695,0.325107,1.000000
1,1.0,2.0,147.0,0.128169,0.361195,0.105731,0.325163,1.000000
2,2.0,3.0,211.0,0.128372,0.357686,0.105141,0.324255,1.000000
3,3.0,4.0,286.0,0.127886,0.353683,0.104307,0.322965,1.000000
4,4.0,5.0,364.0,0.128141,0.351819,0.104105,0.322653,1.000000
5,5.0,6.0,415.0,0.128328,0.351155,0.104098,0.322643,1.000000
6,6.0,7.0,455.0,0.128277,0.350309,0.103960,0.322428,0.981983
7,7.0,8.0,484.0,0.128436,0.350134,0.103994,0.322480,0.981983
8,8.0,9.0,585.0,0.128489,0.349813,0.103929,0.322381,1.000000
9,9.0,10.0,618.0,0.128465,0.350794,0.104096,0.322638,0.980395


In [15]:
# On lance l'apprentissage de modèle de random forest en réalisant un autotuning (optimisation des paramètres automatique par SAS avec

forest_model_autotune = sess.CASTable('retail_banking_forest_model_autotune', replace=True)

result = sess.autotune.tuneForest( 
    trainOptions=
            dict(
                table=dict(name="train_sample_retail_banking_part",where="_partind_ = 1"),
                inputs=all_inputs,
                target="CIBLE",
                nominals=class_vars,
                casOut=dict(name="retail_banking_forest_model_autotune",replace="True"),
                savestate=dict(name='forest_model_state_autotune', compress='true', replace='true')
             ),
             tunerOptions=dict(seed=54321)
     )


NOTE: Autotune is started for 'Forest' model.
NOTE: Autotune option SEARCHMETHOD='GA'.
NOTE: Autotune option MAXTIME=36000 (sec.).
NOTE: Autotune option SEED=54321.
NOTE: Autotune objective is 'Misclassification Error Percentage'.
NOTE: Autotune number of parallel evaluations is set to 4, each using 0 worker nodes.
         Iteration       Evals     Best Objective  Elapsed Time
                 0           1             12.783         44.68
                 1          19             12.495        620.32
                 2          35              12.48       1384.60
                 3          50             12.465       1955.00
                 4          65             12.465       2413.32
                 5          79             12.459       2927.83
NOTE: Data was partitioned during tuning, to tune based on validation score; the final model is trained and scored on all data.
NOTE: Wrote 16501732 bytes to the savestate file forest_model_state_autotune.
NOTE: Autotune time is 2973.0

In [16]:
# The following commands display the tables that are produced by this action call:

print(result.TunerInfo)
print(result.TunerResults)
print(result.IterationHistory)
print(result.EvaluationHistory)
print(result.BestConfiguration)
print(result.TunerSummary)
print(result.TunerTiming)
print(result.TunerCasOutputTables)

Tuner Information

                           Parameter                               Value
0                         Model Type                              Forest
1           Tuner Objective Function  Misclassification Error Percentage
2                      Search Method                                  GA
3                    Population Size                                  10
4                 Maximum Iterations                                   5
5     Maximum Tuning Time in Seconds                               36000
6                    Validation Type                    Single Partition
7      Validation Partition Fraction                                0.30
8                          Log Level                                   2
9                               Seed                               54321
10    Number of Parallel Evaluations                                   4
11  Number of Workers per Subsession                                   0
Tuner Results

    Evaluation  N

In [18]:
# On sérialise le modèle choisi produit sur disque pour le déploiement sur le serveur batch de production

rfstore = sess.astore.download(rstore='forest_model_state')

path = '/opt/shared/astore/'
file_name = os.path.join(path, indata+'.sasast')
with open(file_name, 'wb') as file:
    file.write(rfstore['blob'])
    
rfstoretuned = sess.astore.download(rstore='forest_model_state_autotune')
file_name = os.path.join(path, indata+'_tuned.sasast')
with open(file_name, 'wb') as file:
    file.write(rfstoretuned['blob'])

In [19]:
# This is the same as sess.endsession(); sess.close();
sess.terminate()