In this demo, we will focus on CF ensemble learning methods based on predicting reliability first (instead of guesstimating labels first). 

Specicially, given `T` (rating matrix of the test set) for which we wish to predict its corresponding class labels, we will break down the ensemble prediction with the following subproblems: 

- Predict reliability of T; that is, predict T's probability filter (or preference matrix) where 0s represent unreilable entries (e.g., FPs and FNs) and 1s represent reliable entries (e.g., TPs and TNs)

- Run a chosen collaborative filtering algorithm to reestimate the rating matrix combining the ratings from the training set (`R`) and those from the test set (`T`)
  - Why? Recall from **Demo Part 1 and 2** that the purpose of probaiblity filter is to help us select the entries that enter the optimization objective (e.g. approximating ratings/probability scores or approximating labels) or get left out of the optimization objective. Reliable entries should enter the optimization objectve while unreliable entries are typically left out (unless your loss function somehow can take into account of these entries, see C-square loss for an example)

- Once we get `Th` (the re-estimated `T`), we will then combine their ratings to formulate our final class label predictions as usual (e.g., mean, majority vote, stacking)

In [1]:
#@title Import Basic Libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import os, sys

# Colab 
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

# Plotting
import matplotlib.pylab as plt
# %matplotlib inline

from matplotlib.pyplot import figure
import seaborn as sns
from IPython.display import display

# Progress
from tqdm import tqdm

################################################################
# Configure system environment
# - Please modify input_dir according to your local enviornment
#
################################################################

cur_dir = os.getcwd()
project_dir = 'machine_learning_examples/cf_ensemble'
if IN_COLAB: 
    # Run this demo on Google Colab
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Parameters for data
    input_dir = f"/content/drive/MyDrive/Colab Notebooks/{project_dir}"
    # /content/drive/MyDrive/Colab Notebooks/machine_learning_examples/data/data-is-life

    sys.path.append(input_dir)
else: 
    input_dir = cur_dir
    
if input_dir != cur_dir: 
    sys.path.append(input_dir)
    print(f"> Adding {input_dir} to sys path ...")
    print(sys.path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
> Adding /content/drive/MyDrive/Colab Notebooks/machine_learning_examples/cf_ensemble to sys path ...
['', '/content', '/env/python', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.7/dist-packages/IPython/extensions', '/root/.ipython', '/content/drive/MyDrive/Colab Notebooks/machine_learning_examples/cf_ensemble', '/content/drive/MyDrive/Colab Notebooks/machine_learning_examples/cf_ensemble']


In [2]:
#@title Import Tensorflow and CF-Related Libraries
import tensorflow as tf
print(tf.__version__)
# import tensorflow_probability as tfp
# tfd = tfp.distributions
from tensorflow import keras

# from tensorflow.keras import layers
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Flatten, Dense, Dropout, Lambda, Embedding
from tensorflow.keras.optimizers import RMSprop
from keras.utils.vis_utils import plot_model
from tensorflow.keras import backend as K
#################################################################

# Scikit-learn 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_predict, cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
#################################################################

# CF-ensemble-specific libraries
import utils_stacking as ustk
import utils_classifier as uclf
import utils_sys as usys
import utils_cf as uc 
import polarity_models as pmodel
from polarity_models import Polarity
import scipy.sparse as sparse
from utils_sys import highlight
#################################################################

# Misc
import pprint
import tempfile
from typing import Dict, Text

np.set_printoptions(precision=3, edgeitems=5, suppress=True)

2.8.0


In [3]:
#@title Generate Training Data
import data_pipeline as dp

max_class_ratio=0.99

# get the dataset
X0, y0 = dp.generate_imbalanced_data(class_ratio=max_class_ratio, verbose=1)

> n_classes: 2
[0 1]

> counts:
Counter({0: 4465, 1: 535})



In [4]:
#@title Define and Choose Base Classifiers
base_learners = [
                 ('RF', RandomForestClassifier(n_estimators= 200, 
                                                   oob_score = True, 
                                                   class_weight = "balanced", 
                                                   random_state = 20, 
                                                   ccp_alpha = 0.1)), 
                 ('KNNC', KNeighborsClassifier(n_neighbors = len(np.unique(y0))
                                                     , weights = 'distance')),
                #  ('SVC', SVC(kernel = 'linear', probability=True,
                #                    class_weight = 'balanced'
                #                   , break_ties = True)), 

                 ('GNB', GaussianNB()), 
                 ('QDA',  QuadraticDiscriminantAnalysis()), 
                 ('MLPClassifier', MLPClassifier(alpha=1, max_iter=1000)), 
                 # ('DT', DecisionTreeClassifier(max_depth=5)),
                 # ('GPC', GaussianProcessClassifier(1.0 * RBF(1.0))),
                ]

In [5]:
#@title Generate Rating Matrices
import cf_models as cm

tLoadPretrained = False
######################
fold_number = 0
n_iterations = 1
data_dir = os.path.join(input_dir, 'data')
######################

if not tLoadPretrained:  
    # Use the previously selected base predictors (`base_learners`) to generate the level-1 dataset
    R, T, U, L_train, L_test = cm.demo_cf_stacking(input_data=(X0, y0), 
                                                   input_dir=input_dir, n_iter=n_iterations, 
                                                   base_learners=base_learners, # <<< base classifiers selected
                                                   verbose=1)
else: 
    R, T, U, L_train, L_test = dp.load_pretrained_level1_data(fold_number=fold_number, verbose=1, data_dir=data_dir)

# Derived quantities
n_train = R.shape[1]
p_threshold = uc.estimateProbThresholds(R, L=L_train, pos_label=1, policy='fmax')
lh = uc.estimateLabels(T, p_th=p_threshold) # We cannot use L_test (cheating), but we have to guesstimate
L = np.hstack((L_train, lh)) 
X = np.hstack((R, T))

assert len(U) == X.shape[0]
print(f"> shape(R):{R.shape} || shape(T): {T.shape} => shape(X): {X.shape}")

2.8.0


  0%|          | 0/1 [00:00<?, ?it/s]

(BaseCF) base est | name: RF, estimator: RandomForestClassifier(ccp_alpha=0.1, class_weight='balanced', n_estimators=200,
                       oob_score=True, random_state=20)
(BaseCF) base est | name: KNNC, estimator: KNeighborsClassifier(n_neighbors=2, weights='distance')
(BaseCF) base est | name: GNB, estimator: GaussianNB()
(BaseCF) base est | name: QDA, estimator: QuadraticDiscriminantAnalysis()
(BaseCF) base est | name: MLPClassifier, estimator: MLPClassifier(alpha=1, max_iter=1000)
(BaseCF) Base predictors:
[1]  RF: RandomForestClassifier(ccp_alpha=0.1, class_weight='balanced', n_estimators=200,
                       oob_score=True, random_state=20)
[2]  QDA: QuadraticDiscriminantAnalysis()
[3]  MLPClassifier: MLPClassifier(alpha=1, max_iter=1000)
[4]  KNNC: KNeighborsClassifier(n_neighbors=2, weights='distance')
[5]  GNB: GaussianNB()




[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   24.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[info] Saving X_meta (shape=(3750, 5)) at:
/content/drive/MyDrive/Colab Notebooks/machine_learning_examples/cf_ensemble/data/train-0.npz



[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   35.1s finished


[info] Saving X_meta (shape=(1250, 5)) at:
/content/drive/MyDrive/Colab Notebooks/machine_learning_examples/cf_ensemble/data/test-0.npz

[info] Saving X_meta (shape=(1250, 5)) at:
/content/drive/MyDrive/Colab Notebooks/machine_learning_examples/cf_ensemble/data/test-0.npz

[result] 0.09859154929577464
(cf_write) Adding new attribute y:
[0 0 0 0 0 ... 0 1 0 0 0]
...
(cf_write) Saving X_meta at:
/content/drive/MyDrive/Colab Notebooks/machine_learning_examples/cf_ensemble/data/test-0.npz



100%|██████████| 1/1 [01:13<00:00, 73.16s/it]

[info] list of base classifiers:
['RF' 'KNNC' 'GNB' 'QDA' 'MLPClassifier']

R: Rating/probability matrix for the TRAIN set
> shape(R):(5, 3750) || shape(T): (5, 1250) => shape(X): (5, 5000)





In [6]:
#@title Confidence Matrix
# import utils_cf as uc
# import polarity_models as pmodel

n_factors = 100
alpha = 100.0 
conf_measure = 'brier' # Options: 'brier', 'uniform'
policy_threshold = 'fmax'

Pc, C0, Cw, Cn, *rest = \
    uc.evalConfidenceMatrices(R, L_train, alpha=alpha, 
                                    p_threshold=p_threshold, 
                                    conf_measure=conf_measure, policy_threshold=policy_threshold, 
                                    
                                    # Optional debug/test parameters 
                                    U=U, n_train=n_train, fold_number=fold_number, 
                                    is_cascade=True,
                                    verbose=0)
assert C0.shape == R.shape
y_colors = pmodel.verify_colors(Pc)  # [log] status: ok

(make_cn) Using UNWEIGHTED confidence matrix (with all C[i][j] having equal weights) to approximate ratings ...


In [7]:
#@ Training CFNet
import cf_models as cm

n_users, n_items = R.shape

fold_number = 0
test_size = 0.1

policy_threshold = 'fmax'
conf_measure = 'brier'
n_factors = 100
alpha = 100

lr = 0.001 
batch_size = 64
epochs = 200

loss_fn = tf.keras.losses.BinaryCrossentropy() # Options: tf.keras.losses.BinaryCrossentropy(), tf.keras.losses.MeanSquaredError(), ...
target_type = 'label' # if we use BCE, then the model approximates the label

cf_model = cm.get_cfnet_compiled(n_users, n_items, n_factors, loss=loss_fn, lr=lr)
cf_model = cm.training_pipeline(input_model=(cf_model, loss_fn), 
                                input_data=(R, T, U, L_train, L_test), 

                          # Should we combine R and T into a single matrix X? Set to True if so
                          is_cascade = False, # Set to False here because we attempt to re-estimate T using a polarity model

                          # lh = lh, # Estimated labels by default are the majority vote 
                          
                          # SGD optimization parameters
                          test_size = test_size,
                          epochs = epochs, 
                          batch_size=batch_size, 

                          # CF hyperparameters
                          # n_factors=n_factors, # this is factored into model definition
                          alpha=alpha, 
                          conf_measure=conf_measure, 
                          # conf_type='Cn', # default sparse confidence matrix (Cn)
                          target_type=target_type,
                          
                          policy_threshold=policy_threshold, 
                          fold_number=fold_number) 


(make_cn) Using UNWEIGHTED confidence matrix (with all C[i][j] having equal weights) to approximate ratings ...
[info] Confidence matrix type: Cn, target data type: label
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch

- Up until this point, the workflow remains the same as **Part 3**

### Using Seq2seq as the polarity model [todo]

In [8]:
assert Pc.shape == R.shape

P = uc.to_preference(Pc) # color matrix to probability filter (where {TP, TN} maps to 1 and {FP, FN} maps to 0)
Xs, Ys = pmodel.make_seq2seq_training_data(R, Po=P, L=L_train, include_label=False, verbose=1)

print(f"> shape(R): {R.shape}")
print(f"> shape(Xs): {Xs.shape}, shape(Ys): {Ys.shape}")

[info] shape(X): (3750, 5, 1), shape(Y): (3750, 5, 1)
> shape(R): (5, 3750)
> shape(Xs): (3750, 5, 1), shape(Ys): (3750, 5, 1)
