This notebook considers:
1. __data.x2__ in data.py. It contains 69 features (initiator exclusive). 
2. From the result, we can find there is no difference with manual selected 23 features. We don't generate features in this notebook but we get rules for 23 features.
3. In this notebook, the prediction order is (sphere, worm, vesicle, other). 

In [1]:
import data1 as data
import random
from common import *
from rules import *
from realkd.patch import RuleFit
from sklearn.model_selection import cross_validate, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeaveOneOut
import numpy as np
import matplotlib.colors as mcolors

In [2]:
import warnings
warnings.filterwarnings("ignore")

## Full phase prediction


In [3]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.multioutput import ClassifierChain
from multilabel import BinaryRelevanceClassifier, ProbabilisticClassifierChain
from gam import LogisticGAM

STATE = np.random.RandomState(seed=1000)

lr = LogisticRegressionCV(penalty='l1', solver='saga', random_state=STATE)
lr_ind = BinaryRelevanceClassifier(lr)
lr_chain = ClassifierChain(lr, order=[0, 2, 1, 3])
lr_pcc = ProbabilisticClassifierChain(lr) 

# gams not fixed, remove this part.
# gam_ind = BinaryRelevanceClassifier(LogisticGAM(lam=20.0, max_iter=250))
# gam_chain = ClassifierChain(LogisticGAM(lam=20.0, max_iter=250))
# gam_pcc = ProbabilisticClassifierChain(LogisticGAM(lam=20.0, max_iter=250)) 

rf = RandomForestClassifier(random_state=STATE, min_samples_leaf=1, n_estimators=100)
rf_ind = BinaryRelevanceClassifier(rf)
rf_chain = ClassifierChain(rf, order=[0, 2, 1, 3])
rf_pcc = ProbabilisticClassifierChain(rf)

# Rulefit
rufit_pcc = RuleFitWrapper(mode='chain')

full_estimators = [lr_ind, lr_pcc, lr_chain, rf_ind, rf_pcc, rf_chain, rufit_pcc]
full_names = ['LR_ind', 'LR_pcc', 'LR_chain', 'RanF_ind', 'RanF_pcc', 'Ranf_chain', 'Rufit_pcc']

This following code under "2.6 GHz 6-Core Intel Core i7" runs ~5 hours. You can simply use saved result to re-run the result. (See below instructions)

```
import pickle
cur_save=open('./' + 'interpolation_30folder' + '.p', 'rb')
interpolation = pickle.load(cur_save)
```
After running these three-line code, you can ignore the following __interpolation__ code and re-run the rest.

Due to the individual prediction accuracy

In [10]:
from common import Experiment, LogLikelihoodEvaluator
from sklearn.model_selection import KFold


print("Current Prediction Order is:", data.y.columns.tolist())
print('Num of predictors:, ', data.x2.shape[1])

interpolation = Experiment(full_estimators, 
                    full_names,
                    KFold(30, shuffle=True, random_state=STATE),
                    data.x2, data.y.replace(-1.0, 0.0),
                    groups=data.comp_ids.array, 
                    evaluators=['accuracy', LogLikelihoodEvaluator(2, neg=True)],
                    verbose=True).run()

Current Prediction Order is: ['sphere', 'worm', 'vesicle', 'other']
Num of predictors:,  69
Running experiment with 30 repetitions
******************************


In [11]:
# import pickle
# with open('interpolation_full_swvo.pkl', 'wb') as f:   
#     pickle.dump(interpolation, f)