We assume there are only two predictors $X_0$ and $X_1$. Points in class $0$ fall on the line segment $X_1 = X_0 + 20$ with $X_0 \in [0,100]$. Points in class $1$ fall on the line segment $X_1 = X_0 - 20$ with $X_0 \in [20,120]$. Please note that in both cases $X_0$ are uniformly distributed. We assume $X_0$ and $X_1$ are comonotonic. In this notebook, we compare CIBer with Naive Bayes.
Let's first think of an example. If we would like to classify point $(55,35)$. We know that it should be class $1$ since it falls on the line segment $X_1 = X_0 - 20$. Furtherly, suppose we discretize $X_0$ into equal bins with length $10$. Then each conditional marginal probability equals to $0.1$.

$\textbf{Naive Bayes}$

$\mathbb{P}(X_0,X_1|Y=0)\cdot \mathbb{P}(Y=0)=0.1\cdot0.1\cdot0.5=0.0005$

$\mathbb{P}(X_0,X_1|Y=1)\cdot \mathbb{P}(Y=1)=0.1\cdot0.1\cdot0.5=0.0005$

$\textbf{CIBer}$

$\mathbb{P}(X_0,X_1|Y=0)\cdot \mathbb{P}(Y=0)= Leb([0.5,0.6]\cap[0.1,0.2])\cdot0.5=0$

$\mathbb{P}(X_0,X_1|Y=1)\cdot \mathbb{P}(Y=1)= Leb([0.3,0.4]\cap[0.3,0.4])\cdot0.5=0.05$

So we can see that Naive Bayes can not make a decision but CIBer will definitely predict it as class $1$.

In [1]:
import numpy as np
import pandas as pd
import sys
sys.path.insert(1, '/Users/chengpeng/Desktop/Research/STAT/CIBer')
import comonotonic as cm
import random
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tool_box as tb
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, roc_auc_score
from multiprocessing import Pool

In [2]:
def experiment(bin_num):
    n_sample_each_class = 500000
    class0_X0 = np.random.uniform(0,100,n_sample_each_class).reshape(-1,1)
    class0_X1 = class0_X0 + 20
    class1_X0 = np.random.uniform(20,120,n_sample_each_class).reshape(-1,1)
    class1_X1 = class1_X0 - 20
    class0 = np.array([0 for i in range(n_sample_each_class)]).reshape(-1,1)
    class1 = np.array([1 for i in range(n_sample_each_class)]).reshape(-1,1)
    class0 = np.concatenate((class0_X0, class0_X1, class0), axis = 1)
    class1 = np.concatenate((class1_X0, class1_X1, class1), axis = 1)
    data = np.concatenate((class0, class1), axis = 0)
    X = data[:,:-1]
    Y = data[:,-1]
    cont_col = [i for i in range(X.shape[1])]
    categorical = []
    discrete_feature_val = None
    allocation_book = {0: bin_num, 1: bin_num}
    X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,shuffle = True)
    ciber = cm.clustered_comonotonic(X_train,Y_train,discrete_feature_val,cont_col,categorical,
                                     0,None,corrtype = 'pearson',discrete_method = "custom",
                                     allocation_book = allocation_book)
    ciber.run()
    ciber_predict = ciber.predict(X_test)

    ciber_nb = cm.clustered_comonotonic(X_train,Y_train,discrete_feature_val,cont_col,categorical,
                                        1,None,corrtype = 'pearson',discrete_method = "custom",
                                        allocation_book = allocation_book)
    ciber_nb.run()
    ciber_nb_predict = ciber_nb.predict(X_test)
    
    return accuracy_score(Y_test, ciber_predict), accuracy_score(Y_test, ciber_nb_predict)

In [3]:
acc,auc = experiment(1000)

In [4]:
acc

1.0

In [5]:
auc

0.706725

In [3]:
bin_num = 1000

In [None]:
ciber_result = []
ciber_nb_result = []

params = [(bin_num,) for i in range(1000)]
pool = Pool()
results = pool.starmap(experiment, params)

for result in results:
    ciber_result.append(result[0])
    ciber_nb_result.append(result[1])
data_to_plot = [ciber_result, ciber_nb_result]

In [None]:
fig = plt.figure()
fig.suptitle('Performance comparison', fontsize=14, fontweight='bold')

ax = fig.add_subplot(111)
ax.boxplot(data_to_plot)

ax.set_xlabel('Methods')
ax.set_ylabel('Accuracy Score')
ax.set_xticklabels(['CIBer','NB'])

plt.show()