## Machine Learning Demo

This file demonstrates the process of training the model, and invoking the saved model under the `model/` directory to make inferences.

First we need to import the dependencies:

In [1]:
from keras.models import load_model
import csv
import numpy as np
from keras.utils import to_categorical

Using TensorFlow backend.


Different categories of violations should be given different indices, as below:

In [2]:
violation = {
    '采购策略问题': 0,
    '串标':1,
    '虚假业务':2,
    '收受回扣':3,
    '流程违规':4,
    '成本偏高':5,
}

minval = [1.0, 0, 3401001, -100000000, 101, 1000, -100000000]
denom = [1000052162559.0, 88717952, 1100000, 200000000, 300, 2780, 200000000]

Below is a useful method to read csv data as `numpy` arrays, so we define it first:

In [3]:
def read_data(fname, mode='binary'):
    # open the file and read contents as a list
    r = csv.reader(open(fname, 'r'), delimiter=',', quotechar='"')
    raw_data = np.array(list(r))
    
    # process the given data and convert it to floats
    str_data = np.delete(raw_data[1:, 1:], [7], axis=1)
    if mode == 'binary':
        for i in range(str_data.shape[0]):
            str_data[i,5] = 0 if str_data[i,5] == 'normal' else 1
    else:
        for i in range(str_data.shape[0]):
            str_data[i,5] = violation[str_data[i,5].split('；')[0]]
    
    data = str_data.astype('float32')
    
    # y_data is the ground truth
    y_data = data[:, 5]
    data = np.delete(data, [5], axis=1)
    
    # x_data is the vector to feed the model
    x_data = np.empty(data.shape)
    
    # normalize all data
    for i in range(data.shape[1]):
        min_arr = np.full((data.shape[0],), minval[i])
        x_data[:,i] = (data[:,i] - min_arr) / denom[i]
    
    return x_data, y_data

#### Training

Check that there is no existing model in the directory now:

In [4]:
!rm -f model/*; ls model/

Train the binary classifier for 20 epochs, and make it save the trained model:

In [5]:
!python mlp.py --binary --save --epochs=20

Using TensorFlow backend.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               4096      
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 1026      
Total params: 267,778.0
Trainable params: 267,778.0
Non-trainable params: 0.0
_________________________________________________________________
Epoch 1/20
2017-06-23 15:48:17.376110: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlo

#### Inference

Now load the saved binary file of the model, namely `model/binary.h5`:

In [6]:
model_b = load_model('model/binary.h5')

Read the testing csv data, namely `data/test_b.csv`:

In [15]:
x_data_b, y_data_b = read_data('data/test_b.csv')

Select 10 random columns denoted by `col`, and make inference using the loaded model:

In [8]:
col = range(100)  # can change indices
y_predict_b = np.argmax(model_b.predict(x_data_b[col]), axis=1)

The results almost (or exactly) match:

In [9]:
print('Inferenced:\t', y_predict_b)
print('Ground truth:\t', y_data_b[col].astype('int'))

Inferenced:	 [0 0 0 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1
 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0 1
 0 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1]
Ground truth:	 [0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 1 1
 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1
 0 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1]


In [10]:
correct = 0
for i in range(y_predict_b.shape[0]):
    correct += 1 if y_predict_b[i] == y_data_b[col][i] else 0

print('accuracy:', correct/y_predict_b.shape[0])

accuracy: 0.86


#### Multiclass

The multiclass classifier distinguishes between different kinds of frauds. The training and inference process is similar; below is the training:

In [11]:
!python mlp.py --multiclass --save --epochs=20

Using TensorFlow backend.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               4096      
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 6)                 3078      
Total params: 269,830.0
Trainable params: 269,830.0
Non-trainable params: 0.0
_________________________________________________________________
Epoch 1/20
2017-06-23 15:48:32.751403: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlo

And below is the inference:

In [12]:
model_m = load_model('model/multi.h5')
x_data_m, y_data_m = read_data('data/test_m.csv', mode='multi')

col = range(100)  # can change indices
y_predict_m = np.argmax(model_m.predict(x_data_m[col]), axis=1)

Again, the results almost (or exactly) match:

In [13]:
print('Inferenced:\t', y_predict_m)
print('Ground truth:\t', y_data_m[col].astype('int'))

Inferenced:	 [3 1 3 0 3 3 3 1 3 3 0 3 1 1 0 3 0 1 1 1 1 3 3 1 0 0 3 1 1 3 1 3 1 0 1 1 1
 1 0 1 3 1 1 0 0 0 0 1 1 1 1 1 3 0 1 0 3 1 1 3 3 0 1 1 1 1 1 3 1 0 1 1 0 3
 1 0 1 1 1 1 1 1 3 0 1 1 1 0 3 1 1 3 0 3 1 1 3 3 1 1]
Ground truth:	 [3 1 3 0 3 3 3 1 3 3 0 3 1 1 3 3 0 1 1 1 1 3 3 1 0 0 3 1 1 3 1 3 1 0 1 1 1
 1 0 1 3 1 1 0 0 0 0 1 1 1 1 1 3 0 1 0 3 1 1 3 3 0 1 1 1 1 1 3 1 0 1 1 0 3
 1 0 1 1 1 1 1 1 3 0 1 1 1 0 3 1 1 3 0 3 1 1 3 3 1 1]


In [14]:
correct = 0
for i in range(y_predict_m.shape[0]):
    correct += 1 if y_predict_m[i] == y_data_m[col][i] else 0

print('accuracy:', correct/y_predict_m.shape[0])

accuracy: 0.99
