# gcForest Aligorithm
The gcForest algorithm was suggested in Zhou and Feng 2017 ([https://arxiv.org/abs/1702.08835 , refer for this paper for technical details](https://arxiv.org/abs/1702.08835 , refer for this paper for technical details)) and I provide here a python3 implementation of this algorithm.<br/>I chose to adopt the scikit-learn syntax for ease of use and hereafter I present how it can be used.

In [1]:
from GCForest import gcForest
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Before starting, a word about sizes.
*Note* : I recommend the reader to look at this section with the original paper next to the computer to see what I am talking about.
The main technical problem in the present gcForest implementation so far is the memory usage when slicing the input data. A naive calculation can actually give you an idea of the number and sizes of objects the algorithm will be dealing with.
Starting with a dataset of $N$ samples of size $[l,L]$ and with $C$ classes, the initial "size" is:
$S_{D} = N.l.L$

## Slicing Step
If my window is of size $[w_l,w_L]$ and the chosen stride are $[s_l,sL]$ then the number of slices per sample is :

$n{slices} = \left(\frac{l-w_l}{s_l}+1\right)\left(\frac{L-w_L}{s_L}+1\right)$

Obviously the size of slice is $w_l.wL$ hence the total size of the sliced data set is :

$S{sliced} = N.w_l.w_L.\left(\frac{l-w_l}{s_l}+1\right)\left(\frac{L-w_L}{sL}+1\right)$
This is when the memory consumption is its peak maximum.

## Class Vector after Multi-Grain Scanning
Now all slices are fed to the random forest to generate class vectors. The number of class vector per random forest per window per sample is simply equal to the number of slices given to the random forest $n{cv}(w) = n{slices}(w)$. Hence, if we have $N{RF}$ random forest per window the size of a class vector is (recall we have $N$ samples and $C$ classes):

$S{cv}(w) = N.n{cv}(w).N{RF}.C$

And finally the total size of the Multi-Grain Scanning output will be:

$S{mgs} = N.\sum{w} N{RF}.C.n_{cv}(w)$

This short calculation is just meant to give you an idea of the data processing during the Multi-Grain Scanning phase. The actual memory consumption depends on the format given (aka float, int, double, etc.) and it might be worth looking at it carefully when dealing with large datasets

# Iris example

The iris data set is actually not a very good example as the gcForest algorithm is better suited for time series and images where informations can be found at different scales in one sample.
Nonetheless it is still an easy way to test the method.

In [2]:
# loading data
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape,y.shape)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33)

(150, 4) (150,)


In [3]:
gcf = gcForest(shape_1X=4, window=2,tolerance=0.0)
gcf.fit(X_tr, y_tr)

Slicing Sequence...
Training MGS Random Forests...
Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9


Now checking the prediction for the test set:

In [4]:
pred_X = gcf.predict(X_te)
print(pred_X)

Slicing Sequence...
[2 1 2 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 2 0 0 1 2 1 2 2 2 1 0 0 2 2 2 1 1
 1 2 0 2 1 0 0 0 1 0 2 2 2]


In [5]:
# evaluating accuracy
accuarcy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuarcy : {}'.format(accuarcy))

gcForest accuarcy : 1.0


# Digits Example
A much better example is the digits data set containing images of hand written digits. The scikit data set can be viewed as a mini-MNIST for training purpose.

In [6]:
# loading the data
digits = load_digits()
X = digits.data
y = digits.target
print(X.shape,y.shape)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4)
print(X_tr.shape,X_te.shape)

(1797, 64) (1797,)
(1078, 64) (719, 64)


... taining gcForest ... (can take some time...)

In [7]:
gcf = gcForest(shape_1X=[8,8], window=[4,6], tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
gcf.fit(X_tr, y_tr)

Slicing Images...
Training MGS Random Forests...
Slicing Images...
Training MGS Random Forests...
Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9861111111111112
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9907407407407407
Adding/Training Layer, n_layer=3
Layer validation accuracy = 0.9907407407407407


... and predicting classes ..

In [8]:
pred_X = gcf.predict(X_te)
print(pred_X)

Slicing Images...
Slicing Images...
[5 9 6 5 9 0 6 2 4 9 9 7 1 4 0 4 5 6 6 6 5 4 1 3 9 3 0 3 6 6 2 8 9 8 4 6 2
 5 7 5 3 7 8 9 8 8 6 3 1 7 8 4 9 5 7 3 2 5 6 3 6 6 9 2 5 6 1 0 6 6 9 5 1 8
 1 8 9 8 9 0 0 0 0 7 1 4 9 9 6 1 4 4 2 1 2 0 8 8 7 6 9 4 7 8 6 0 8 5 5 0 9
 7 1 7 5 4 8 2 2 6 4 0 8 1 6 3 7 2 2 3 6 8 0 9 0 0 8 8 6 3 4 2 9 8 0 9 4 8
 4 5 1 1 9 9 7 1 2 0 6 0 2 9 9 0 0 1 3 4 9 5 4 6 7 8 1 6 5 8 7 6 5 7 1 1 2
 6 3 1 7 3 2 7 3 3 6 5 2 3 7 6 9 6 6 7 7 0 5 9 5 6 0 8 2 0 3 1 6 1 6 8 9 3
 3 8 6 3 4 8 8 7 7 2 0 1 6 3 6 1 1 0 0 3 4 7 2 8 6 5 9 0 4 1 6 7 1 3 2 2 3
 2 0 0 4 6 1 6 3 0 5 5 5 2 7 2 2 0 8 4 7 4 4 2 4 8 4 0 9 6 3 4 2 8 9 2 0 0
 0 0 0 1 6 8 5 2 3 1 2 4 9 0 1 6 5 8 3 4 4 8 5 2 6 6 3 5 1 4 7 3 7 9 0 7 4
 5 1 9 8 9 7 4 6 1 4 2 4 2 6 1 6 1 9 4 0 0 7 1 7 1 0 9 2 8 3 6 4 0 7 3 0 6
 8 9 7 0 9 6 0 1 1 7 6 9 2 5 8 4 9 3 4 4 5 7 0 5 1 0 5 5 1 6 6 8 7 4 3 4 2
 5 7 2 5 5 4 1 8 7 4 2 5 7 9 9 8 3 0 4 1 4 5 6 3 5 7 1 5 8 3 7 7 6 5 2 5 1
 7 7 7 7 7 9 4 1 6 7 7 6 1 5 7 9 9 4 9 1 0 0 6 4 7 8 6 3 1 5 1 3

In [9]:
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))

gcForest accuracy : 0.980528511821975


# Saving Models to Disk

You probably don't want to re-train your classifier every day especially if you're using it on large data sets. Fortunately there is a very easy way to save and load models to disk using `sklearn.externals.joblib`

**Saving model:**

In [10]:
from sklearn.externals import joblib
joblib.dump(gcf, 'gcf_model.sav')

['gcf_model.sav']

**Loading model:**

In [11]:
gcf = joblib.load('gcf_model.sav')

# Using mg-scanning and cascade_forest Sperately
As the Multi-Grain scanning and the cascade forest modules are quite independent it is possible to use them seperately.<br>If a target y is given the code automaticcaly use it for training otherwise it recalls the last trained Random Forests to slice the data.

In [12]:
gcf = gcForest(shape_1X=[8,8],window=5,min_samples_mgs=10,min_samples_cascade=7)
X_tr_mgs = gcf.mg_scanning(X_tr, y_tr)
print(X_tr_mgs.shape)
print(X_tr.shape)

Slicing Images...
Training MGS Random Forests...
(1078, 320)
(1078, 64)


In [13]:
X_te_mgs = gcf.mg_scanning(X_te)

Slicing Images...


It is now possible to use the mg_scanning output as input for cascade forests using different parameters. Note that the cascade forest module does not directly return predictions but probability predictions from each Random Forest in the last layer of the cascade. Hence the need to first take the mean of the output and then find the max.

In [14]:
gcf = gcForest(tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
_ = gcf.cascade_forest(X_tr_mgs,y_tr)

Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9814814814814815
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9814814814814815


In [15]:
pred_proba = gcf.cascade_forest(X_te_mgs)
#print(X_te_mgs.shape)
#print(pred_proba[1])
tmp = np.mean(pred_proba,axis=0)
#print(tmp.shape)
preds = np.argmax(tmp,axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

0.97496522948539643

In [16]:
gcf = gcForest(tolerance=0.0, min_samples_mgs=20, min_samples_cascade=10)
_ = gcf.cascade_forest(X_tr_mgs, y_tr)

Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9907407407407407
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9953703703703703
Adding/Training Layer, n_layer=3
Layer validation accuracy = 0.9953703703703703


In [17]:
pred_proba = gcf.cascade_forest(X_te_mgs)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

0.98191933240611962

# Skipping mg_scanning

It is also possible to directly use the cascade forest and skip the multi grain scanning step.

In [18]:
gcf = gcForest(tolerance=0.0, min_samples_cascade=20)
_ = gcf.cascade_forest(X_tr, y_tr)

Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9675925925925926
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9629629629629629


In [19]:
pred_proba = gcf.cascade_forest(X_te)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

0.95132127955493739

* **window的选择决定了不同的粒度，如5，为只用5的窗口去滑动，而[4,5]则是用4和5分别滑动然后append**
* **参考[股指期货涨跌](https://blog.csdn.net/woddle/article/details/71122698)**
## gcForest参数说明

**shape_1X: **
单个样本元素的形状[n_lines，n_cols]。 调用mg_scanning时需要！对于序列数据，可以给出单个int。

**n_mgsRFtree: **
多粒度扫描期间随机森林中的树木数量。

**window：int（default = None） **
多粒度扫描期间使用的窗口大小列表。如果“无”，则不进行切片。

**stride：int（default = 1） **
切片数据时使用的步骤。

**cascade_test_size：float或int（default = 0.2）** 
级联训练集分裂的分数或绝对数。

**n_cascadeRF：int（default = 2） **
级联层中随机森林的数量,对于每个伪随机森林，创建完整的随机森林，因此一层中随机森林的总数将为2 * n_cascadeRF。

**n_cascadeRFtree：int（default = 101） **
级联层中单个随机森林中的树数。

**min_samples_mgs：float或int（default = 0.1） **
节点中执行拆分的最小样本数 在多粒度扫描随机森林训练期间。 如果int number_of_samples = int。 如果float，min_samples表示要考虑的初始n_samples的分数。

**min_samples_cascade：float或int（default = 0.1） **
节点中执行拆分的最小样本数 在级联随机森林训练期间。 如果int number_of_samples = int。 如果float，min_samples表示要考虑的初始n_samples的分数。

**cascade_layer：int（default = np.inf） **
允许的最大级联级数。 有用的限制级联的结构。

**tolerance：float（default= 0.0） **
联生长的精度差,整个级联的性能将在验证集上进行估计， 如果没有显着的性能增益，训练过程将终止

**n_jobs：int（default = 1） **
任意随机森林适合并预测的并行运行的工作数量。 如果为-1，则将作业数设置为核心数。