This MLHub package contains a decision tree model from the Rattle package for R. It is used in the Rattle book to demonstrate a classification model use-case in financial audit. A sample dataset of audit outcomes is used to train the model to predict the outcome of audits. A successful audit identifies missing or incorrectly reported financial data.
A classification (decision) tree model to represent the discovered knowledge is built using a recursive partitioning algorithm. Decision trees are recognised as an easily understandable representation of the discovered knowledge. They are widely popular in situations where insight and explanations are important.
Visit the github repository for more details: https://github.com/gjwgit/audit
-
To install mlhub
$ pip3 install mlhub
-
To install and configure the pre-built model:
$ ml install audit $ ml configure audit
$ ml demo audit
=========================
Audit Decision Tree Model
=========================
A common machine learning task is classification where we classify people,
for example, into two classes. A decision tree model can be trained to
predict whether a person belongs to one class or the other. In this MLHub
package a pre-built decision tree model is loaded to predict the likely
outcome of a financial audit of a tax payer, as an example.
Press Enter to continue:
=================================
Textual Presentation of the Model
=================================
The textual presentation of the model is the default output from the R package
for decision trees. It begins with a record of the number of observations
used to build the model (n=). The following two lines of text are a legend
to assist with the interpretation of the tree.
n= 2000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 2000 463 no (0.76850000 0.23150000)
2) married=no 1083 61 no (0.94367498 0.05632502) *
3) married=yes 917 402 no (0.56161396 0.43838604)
6) occupation=labourer 396 92 no (0.76767677 0.23232323)
12) deductions< 1708 389 85 no (0.78149100 0.21850900) *
13) deductions>=1708 7 0 yes (0.00000000 1.00000000) *
7) occupation=office 521 211 yes (0.40499040 0.59500960)
14) education=secondary 155 61 no (0.60645161 0.39354839)
28) age< 37.5 50 10 no (0.80000000 0.20000000) *
29) age>=37.5 105 51 no (0.51428571 0.48571429)
58) hours< 37 22 3 no (0.86363636 0.13636364) *
59) hours>=37 83 35 yes (0.42168675 0.57831325) *
15) education=tertiary 366 117 yes (0.31967213 0.68032787)
30) age< 31.5 53 23 no (0.56603774 0.43396226) *
31) age>=31.5 313 87 yes (0.27795527 0.72204473) *
Press Enter to continue:
=============
Decision Tree
=============
A visual representation of a model can often be more insightful than the
textual representation. For a decision tree model, representing the
discovered knowledge as a decision tree, we read the tree from top to
bottom, traversing the path corresponding to the answer to the question
presented at each node. The leaf node has the final decision together with
the class probabilities.
Press Enter to display the decision tree:
Close the graphic window using Ctrl-w.
Press Enter to continue:
===================
Variable Importance
===================
An understanding of the relative importance of each of the variables
adds further insight into the data. The actual numeric values mean little
but the relativities are significant.
Press Enter to display the plot:
Close the graphic window using Ctrl-W.
Press Enter to continue:
=====================
Predict Audit Outcome
=====================
We can use this model to predict the outcome of an audit. Below we show the
predictions after applying the pre-built decision tree model to a random
subset of a dataset of previously unseen audit case outcomes. This provides
an insight into the expected future performance of the model when we decide
to deploy the model into a production system.
Actual Predicted Error
1 yes yes
2 no no
3 no no
4 yes no <----
5 no no
6 no no
7 no no
8 yes no <----
9 yes no <----
10 no no
11 yes no <----
12 yes yes
Press Enter to continue:
================
Confusion Matrix
================
A confusion matrix summarises the performance of the model on this evluation
dataset. All figures in the table are percentages and are calculated across
the predicitions made by the model for each observation and compared to the
actual or known values of the target variable. The first column reports the
true negative and false negative rates whilst the second column reports the
false positive and true positive rates.
The Error column calculates the error across each class. We also report the
overall error which is calculated as the number of errors over the number of
observations. The average of the class errors is also reported.
Predicted
Actual no yes Error
no 72.8 5.8 7.4
yes 8.5 12.8 39.8
Overall error: 14%
Average class error: 24%
Press Enter to continue:
==========
Risk Chart
==========
A risk chart presents a cumulative performance view of the model.
The x-axis is the percentage of caseload as we progress (left to right)
through cases from the highest probability of an adjustment being made to
the financial data to the lowest probability of an adjustment.
The y-axis is the expected performance of the model in selecting customers
to audit. It is the percentage of the known positive outcomes that are
predicted by the model for the given caseload (the recall).
To deploy the model the decision maker will trade recall against caseload
in accordance with availalbe auditing resources and risk tolerance.
The more area under the curve (both adjusted and adjustment) the better
the model performance. A perfect model would follow the grey line (for
adjusted) and the pink line (for adjustment). The Precision line represents
the lift offered by the model, with the lift values on the right hand axis.
Press Enter to display the risk chart:
Close the graphic window using Ctrl-w.
Thank you for exploring the 'audit' package.
A few commands are in the pipeline and available from the github repository.
-
To score your own dataset of cases:
$ ml score audit <data.csv>
-
To rebuild the model using your own dataset in a csv file that has a column named TARGET:
$ ml train audit <data.csv>
It actually does not matter what the other columns of the file are, the command will build a decision tree to predict the TARGET!.
End-to-End Data Science: Audit Use Case