 ## Exploring the SIR Data Reduction Method for Visualization
In his 1991 paper, Prof. Ker-Chau Li (UCLA) introduced a fascinating method for supervised dimensionality reduction called SIR (Sliced Inverse Regression) assuming the following model:

$$Y  = f(\beta_1 \mathbf{X}, \dots, \beta_K \mathbf{X},\epsilon ) $$

Where $\mathbf{X} \in \mathbb{R}^{n \times p}$, and $f$ can be any function on $\mathbb{R}^{K+1}.$ Without delving into theory the method works by considering the inverse regression curve $E(\mathbf{X}|Y)$, and estimating $E(\mathbf{X}|Y)$ via slicing. Under some mild assumptions (and assuming $\mathbf{X}$ has been standardized), the inverse regression curve is contained in the subspace spanned by the $\beta_1,\dots, \beta_K$. The method uses a principal components analysis on the covariance matrix of the inverse regression curve to estimate its orientation. 

Implemented the basic SIR method in python with three methods:
- fit
- transform
- fit_transform

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from SIR import SIR
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt2
import matplotlib.pyplot as plt
from sklearn import preprocessing
%matplotlib inline 

from SufficientDR import sdr
from sliced import SlicedAverageVarianceEstimation
from sliced import SlicedInverseRegression

dim_k=2

In [2]:
boston = load_boston(return_X_y=False)
print(np.size(np.array(boston['target'])))
data = []
target = []
rr = np.random.permutation(506)
for x in rr:
    data.append(boston['data'][x])
    target.append(boston['target'][x])
# training data
X1 = np.array(data)
Y1 = np.array(target)
print(X1.shape)
print(Y1.shape)

506
(506, 13)
(506,)


In [3]:
breast = load_breast_cancer(return_X_y=False)
data = []
target = []
rr = np.random.permutation(569)
for x in rr:
    data.append(breast['data'][x])
    target.append(breast['target'][x])
# training data
X2 = np.array(data)
Y2 = np.array(target)
print(X2.shape)
print(Y2.shape)

(569, 30)
(569,)


In [4]:
diabetes = load_diabetes(return_X_y=False)
data = []
target = []
rr = np.random.permutation(442)
for x in rr:
    data.append(diabetes['data'][x])
    target.append(diabetes['target'][x])
# training data
X3 = np.array(data)
Y3 = np.array(target)
print(X3.shape)
print(Y3.shape)

(442, 10)
(442,)


In [5]:
heart = pd.read_csv('heart.csv')
X = heart.iloc[:,:-1].values
y = heart.iloc[:,13].values
print(len(y))
data = []
target = []
rr = np.random.permutation(303)
for x in rr:
    data.append(X[x])
    target.append(y[x])
# training data
X4 = np.array(data)
Y4 = np.array(target)
print(X4.shape)
print(Y4.shape)

303
(303, 13)
(303,)


In [6]:
from sklearn.datasets import load_wine

breast = load_wine(return_X_y=False)
data = []
target = []
rr = np.random.permutation(178)
for x in rr:
    data.append(breast['data'][x])
    target.append(breast['target'][x])
# training data
X5 = np.array(data)
Y5 = np.array(target)
print(X5.shape)
print(Y5.shape)

(178, 13)
(178,)


In [7]:
ionosphere = pd.read_csv('ionosphere.data', header=None)
X = ionosphere.iloc[:,:-1].values
y = ionosphere.iloc[:,34].values
print(len(y))
data = []
target = []
rr = np.random.permutation(351)
for x in rr:
    data.append(X[x].astype(np.float)+0.000001*np.random.rand(34))
    if y[x] == 'g':
        target.append(1)
    else:
        target.append(0)
# training data
X6 = np.array(data)
Y6 = np.array(target)
print(X6.shape)
print(Y6.shape)

351
(351, 34)
(351,)


In [8]:
def euclidean_distance(pnt1, pnt2):
    return sum((pnt1 - pnt2) ** 2)

from collections import defaultdict

def find_majority(labels):
    counter = defaultdict(int)
    for label in labels:
        counter[label] += 1

    majority_count = max(counter.values())
    for key, value in counter.items():
        if value == majority_count:
            return key

def new_predict(k, train_pnts, train_labels, test_pnts):
    distances = [(euclidean_distance(test_pnts, pnt), label)
                    for (pnt, label) in zip(train_pnts, train_labels)]
    compare = lambda distance: distance[0]
    by_distances = sorted(distances, key=compare)
    k_labels = [label for (_, label) in by_distances[:k]]
    return find_majority(k_labels)

def new_predict_regr(k, train_pnts, train_labels, test_pnts):
    distances = [(euclidean_distance(test_pnts, pnt), label)
                    for (pnt, label) in zip(train_pnts, train_labels)]
    compare = lambda distance: distance[0]
    by_distances = sorted(distances, key=compare)
    k_labels = [label for (_, label) in by_distances[:k]]
    return np.mean(k_labels)

def score_knn(reduced_train_x, reduced_train_y, reduced_test_x, reduced_test_y):
    i = 0
    total_correct = 0
    for test_image in reduced_test_x:
        pred = new_predict(10, reduced_train_x, reduced_train_y, test_image)
        if pred == reduced_test_y[i]:
            total_correct += 1
        i += 1
    score = (total_correct / i) * 100
    return score

def score_knn_regr(reduced_train_x, reduced_train_y, reduced_test_x, reduced_test_y):
    test_y_prediction = []
    for test_image in reduced_test_x:
        pred = new_predict_regr(10, reduced_train_x, train_y, test_image)
        test_y_prediction.append(pred)
    score = 100*(1.0-np.mean((np.array(test_y_prediction)- reduced_test_y)**2)/np.var(reduced_test_y))
    return score

In [9]:
def pca_sir_save_as(train_x, train_y, test_x, test_y, batch_size, if_regr = 0):
    if if_regr==1:
        score_knn_function = score_knn_regr
    else:
        score_knn_function = score_knn
    
    scaler = preprocessing.StandardScaler().fit(train_x)
    train_x = scaler.transform(train_x)
    test_x = scaler.transform(test_x)

    pca = PCA(n_components =dim_k)
    pca.fit(train_x)
    pca_train_x = pca.transform(train_x)
    pca_test_x = pca.transform(test_x)
    score_pca = score_knn_function(pca_train_x, train_y, pca_test_x, test_y)
    scores_pca.append(score_pca)
    print('Accuracy of PCA-KNN:', str(round(score_pca, 2))+'%')

    sir_1 = SlicedInverseRegression(n_directions=dim_k)
    sir_1.fit(train_x,train_y)
    sir_train_x = np.real(sir_1.transform(train_x))
    sir_test_x = np.real(sir_1.transform(test_x))
    score_sir = score_knn_function(sir_train_x, train_y, sir_test_x, test_y)
    scores_sir.append(score_sir)
    print('Accuracy of SIR-KNN:', str(round(score_sir, 2))+'%')

    save_1 = SlicedAverageVarianceEstimation(n_directions=dim_k)
    save_1.fit(train_x,train_y)
    save_train_x = save_1.transform(train_x)
    save_test_x = save_1.transform(test_x)
    score_save = score_knn_function(save_train_x, train_y, save_test_x, test_y)
    scores_save.append(score_save)
    print('Accuracy of SAVE-KNN:', str(round(score_save, 2))+'%')
    
    if if_regr==1:
        O = sdr(train_x, train_y, k=dim_k, Lambda = 10.0, number_of_neurons = 50, BATCH_SIZE=batch_size, num_epochs = 30, classify=False)
    else:
        O = sdr(train_x, train_y, k=dim_k, Lambda = 10.0, number_of_neurons = 50, BATCH_SIZE=batch_size, num_epochs = 30, classify=True)
    as_train_x = np.matmul(train_x, O)
    as_test_x = np.matmul(test_x, O)
    score_as = score_knn_function(as_train_x, train_y, as_test_x, test_y)
    scores_as.append(score_as)
    print('Accuracy of AS-KNN:', str(round(score_as, 2))+'%')

### Example 1: Boston 

We have data $\mathbf{X1}\in \mathbb{R}^{506x13}$ and:
$$Y1 = f(X1 +\epsilon)$$

with $\epsilon \sim N(0,1)$ 

Using Boston data $(\mathbf{X1},Y1)_{i=2}^{506}$:

In [10]:
scores_pca = []
scores_sir = []
scores_save = []
scores_as = []
for part in range(10):
    print("Current part is %d\n" % part)
    train_x = np.concatenate((X1[0:(part*50)], X1[(part+1)*50:506]), axis=0)
    train_y = np.concatenate((Y1[0:(part*50)], Y1[(part+1)*50:506]), axis=0)
    test_x = X1[part*50:(part+1)*50]
    test_y = Y1[part*50:(part+1)*50]
    pca_sir_save_as(train_x, train_y, test_x, test_y, batch_size=38, if_regr=1)
print ("Average PCA score on test %.4f\n" %(np.mean(np.array(scores_pca))))
print ("Average SIR score on test %.4f\n" %(np.mean(np.array(scores_sir))))
print ("Average SAVE score on test %.4f\n" %(np.mean(np.array(scores_save))))
print ("Average AS score on test %.4f\n" %(np.mean(np.array(scores_as))))

Current part is 0

Accuracy of PCA-KNN: 41.28%
Accuracy of SIR-KNN: 65.74%
Accuracy of SAVE-KNN: 51.26%
Accuracy of AS-KNN: 70.74%
Current part is 1

Accuracy of PCA-KNN: 68.76%
Accuracy of SIR-KNN: 84.07%
Accuracy of SAVE-KNN: 48.32%
Accuracy of AS-KNN: 81.02%
Current part is 2

Accuracy of PCA-KNN: 61.15%
Accuracy of SIR-KNN: 79.58%
Accuracy of SAVE-KNN: 14.79%
Accuracy of AS-KNN: 74.38%
Current part is 3

Accuracy of PCA-KNN: 70.95%
Accuracy of SIR-KNN: 74.51%
Accuracy of SAVE-KNN: 36.48%
Accuracy of AS-KNN: 77.6%
Current part is 4

Accuracy of PCA-KNN: 58.92%
Accuracy of SIR-KNN: 84.59%
Accuracy of SAVE-KNN: 60.24%
Accuracy of AS-KNN: 78.59%
Current part is 5

Accuracy of PCA-KNN: 48.95%
Accuracy of SIR-KNN: 82.26%
Accuracy of SAVE-KNN: 41.64%
Accuracy of AS-KNN: 83.99%
Current part is 6

Accuracy of PCA-KNN: 18.45%
Accuracy of SIR-KNN: 56.65%
Accuracy of SAVE-KNN: 29.49%
Accuracy of AS-KNN: 51.13%
Current part is 7

Accuracy of PCA-KNN: 41.15%
Accuracy of SIR-KNN: 60.2%
Accuracy o

### Example 2: Breast cancer 

We have data $\mathbf{X2}\in \mathbb{R}^{569x30}$ and:
$$Y2 = f(X2 +\epsilon)^2$$

with $\epsilon \sim N(0,1)$ 

Generating the data $(\mathbf{X2},Y2)_{i=2}^{569}$:

In [11]:
scores_pca = []
scores_sir = []
scores_save = []
scores_as = []
for part in range(8):
    print("Current part is %d\n" % part)
    train_x = np.concatenate((X2[0:(part*65)], X2[(part+1)*65:569]), axis=0)
    train_y = np.concatenate((Y2[0:(part*65)], Y2[(part+1)*65:569]), axis=0)
    test_x = X2[part*65:(part+1)*65]
    test_y = Y2[part*65:(part+1)*65]
    pca_sir_save_as(train_x, train_y, test_x, test_y, batch_size=56, if_regr=0)
print ("Average PCA score on test %.4f\n" %(np.mean(np.array(scores_pca))))
print ("Average SIR score on test %.4f\n" %(np.mean(np.array(scores_sir))))
print ("Average SAVE score on test %.4f\n" %(np.mean(np.array(scores_save))))
print ("Average AS score on test %.4f\n" %(np.mean(np.array(scores_as))))

Current part is 0

Accuracy of PCA-KNN: 96.92%
Accuracy of SIR-KNN: 98.46%
Accuracy of SAVE-KNN: 89.23%
Accuracy of AS-KNN: 98.46%
Current part is 1

Accuracy of PCA-KNN: 90.77%
Accuracy of SIR-KNN: 100.0%
Accuracy of SAVE-KNN: 93.85%
Accuracy of AS-KNN: 100.0%
Current part is 2

Accuracy of PCA-KNN: 90.77%
Accuracy of SIR-KNN: 95.38%
Accuracy of SAVE-KNN: 76.92%
Accuracy of AS-KNN: 96.92%
Current part is 3

Accuracy of PCA-KNN: 95.38%
Accuracy of SIR-KNN: 95.38%
Accuracy of SAVE-KNN: 90.77%
Accuracy of AS-KNN: 95.38%
Current part is 4

Accuracy of PCA-KNN: 93.85%
Accuracy of SIR-KNN: 96.92%
Accuracy of SAVE-KNN: 87.69%
Accuracy of AS-KNN: 96.92%
Current part is 5

Accuracy of PCA-KNN: 92.31%
Accuracy of SIR-KNN: 96.92%
Accuracy of SAVE-KNN: 90.77%
Accuracy of AS-KNN: 98.46%
Current part is 6

Accuracy of PCA-KNN: 93.85%
Accuracy of SIR-KNN: 96.92%
Accuracy of SAVE-KNN: 86.15%
Accuracy of AS-KNN: 98.46%
Current part is 7

Accuracy of PCA-KNN: 93.85%
Accuracy of SIR-KNN: 98.46%
Accuracy

### Следующая ячейка с экспериментом k-NN: он не удался, не могу найти ошибку

### Example 3: Diabetes

We have data $\mathbf{X3}\in \mathbb{R}^{442x10}$ and:
$$Y3 = f(X3 +\epsilon)$$

with $\epsilon \sim N(0,1)$ 

Generating the data $(\mathbf{X},Y)_{i=2}^{442}$:

In [12]:
scores_pca = []
scores_sir = []
scores_save = []
scores_as = []
for part in range(10):
    print("Current part is %d\n" % part)
    train_x = np.concatenate((X3[0:(part*37)], X3[(part+1)*37:442]), axis=0)
    train_y = np.concatenate((Y3[0:(part*37)], Y3[(part+1)*37:442]), axis=0)
    test_x = X3[part*37:(part+1)*37]
    test_y = Y3[part*37:(part+1)*37]
    pca_sir_save_as(train_x, train_y, test_x, test_y, batch_size=45, if_regr=1)
print ("Average PCA score on test %.4f\n" %(np.mean(np.array(scores_pca))))
print ("Average SIR score on test %.4f\n" %(np.mean(np.array(scores_sir))))
print ("Average SAVE score on test %.4f\n" %(np.mean(np.array(scores_save))))
print ("Average AS score on test %.4f\n" %(np.mean(np.array(scores_as))))

Current part is 0

Accuracy of PCA-KNN: 26.26%
Accuracy of SIR-KNN: 27.51%
Accuracy of SAVE-KNN: -10.35%
Accuracy of AS-KNN: 37.43%
Current part is 1

Accuracy of PCA-KNN: 38.13%
Accuracy of SIR-KNN: 46.29%
Accuracy of SAVE-KNN: 11.87%
Accuracy of AS-KNN: 45.73%
Current part is 2

Accuracy of PCA-KNN: 22.42%
Accuracy of SIR-KNN: 56.03%
Accuracy of SAVE-KNN: -5.95%
Accuracy of AS-KNN: 47.24%
Current part is 3

Accuracy of PCA-KNN: 51.72%
Accuracy of SIR-KNN: 68.07%
Accuracy of SAVE-KNN: -6.29%
Accuracy of AS-KNN: 66.59%
Current part is 4

Accuracy of PCA-KNN: 18.14%
Accuracy of SIR-KNN: 49.2%
Accuracy of SAVE-KNN: 20.44%
Accuracy of AS-KNN: 41.0%
Current part is 5

Accuracy of PCA-KNN: -8.69%
Accuracy of SIR-KNN: 25.8%
Accuracy of SAVE-KNN: 0.56%
Accuracy of AS-KNN: 22.83%
Current part is 6

Accuracy of PCA-KNN: 22.36%
Accuracy of SIR-KNN: 22.23%
Accuracy of SAVE-KNN: -21.7%
Accuracy of AS-KNN: 36.06%
Current part is 7

Accuracy of PCA-KNN: 26.06%
Accuracy of SIR-KNN: 30.04%
Accuracy of

### Example 4: Heart Disease 

We have data $\mathbf{X4}\in \mathbb{R}^{303x13}$ and:
$$Y4 = f(X4 +\epsilon)$$

with $\epsilon \sim N(0,1)$ 

Using Heart Disease data $(\mathbf{X4},Y4)_{i=2}^{303}$:

In [13]:
scores_pca = []
scores_sir = []
scores_save = []
scores_as = []
for part in range(9):
    print("Current part is %d\n" % part)
    train_x = np.concatenate((X4[0:(part*33)], X4[(part+1)*33:303]), axis=0)
    train_y = np.concatenate((Y4[0:(part*33)], Y4[(part+1)*33:303]), axis=0)
    test_x = X4[part*33:(part+1)*33]
    test_y = Y4[part*33:(part+1)*33]
    pca_sir_save_as(train_x, train_y, test_x, test_y, batch_size=30, if_regr=0)
print ("Average PCA score on test %.4f\n" %(np.mean(np.array(scores_pca))))
print ("Average SIR score on test %.4f\n" %(np.mean(np.array(scores_sir))))
print ("Average SAVE score on test %.4f\n" %(np.mean(np.array(scores_save))))
print ("Average AS score on test %.4f\n" %(np.mean(np.array(scores_as))))

Current part is 0

Accuracy of PCA-KNN: 87.88%
Accuracy of SIR-KNN: 90.91%
Accuracy of SAVE-KNN: 84.85%
Accuracy of AS-KNN: 93.94%
Current part is 1

Accuracy of PCA-KNN: 72.73%
Accuracy of SIR-KNN: 81.82%
Accuracy of SAVE-KNN: 78.79%
Accuracy of AS-KNN: 78.79%
Current part is 2

Accuracy of PCA-KNN: 78.79%
Accuracy of SIR-KNN: 84.85%
Accuracy of SAVE-KNN: 84.85%
Accuracy of AS-KNN: 69.7%
Current part is 3

Accuracy of PCA-KNN: 75.76%
Accuracy of SIR-KNN: 75.76%
Accuracy of SAVE-KNN: 72.73%
Accuracy of AS-KNN: 78.79%
Current part is 4

Accuracy of PCA-KNN: 78.79%
Accuracy of SIR-KNN: 66.67%
Accuracy of SAVE-KNN: 75.76%
Accuracy of AS-KNN: 75.76%
Current part is 5

Accuracy of PCA-KNN: 81.82%
Accuracy of SIR-KNN: 78.79%
Accuracy of SAVE-KNN: 75.76%
Accuracy of AS-KNN: 78.79%
Current part is 6

Accuracy of PCA-KNN: 81.82%
Accuracy of SIR-KNN: 81.82%
Accuracy of SAVE-KNN: 81.82%
Accuracy of AS-KNN: 84.85%
Current part is 7

Accuracy of PCA-KNN: 81.82%
Accuracy of SIR-KNN: 87.88%
Accuracy 

In [14]:
scores_pca = []
scores_sir = []
scores_save = []
scores_as = []
for part in range(8):
    print("Current part is %d\n" % part)
    train_x = np.concatenate((X5[0:(part*22)], X5[(part+1)*22:178]), axis=0)
    train_y = np.concatenate((Y5[0:(part*22)], Y5[(part+1)*22:178]), axis=0)
    test_x = X5[part*22:(part+1)*22]
    test_y = Y5[part*22:(part+1)*22]
    pca_sir_save_as(train_x, train_y, test_x, test_y, batch_size=26, if_regr=1)
print ("Average PCA score on test %.4f\n" %(np.mean(np.array(scores_pca))))
print ("Average SIR score on test %.4f\n" %(np.mean(np.array(scores_sir))))
print ("Average SAVE score on test %.4f\n" %(np.mean(np.array(scores_save))))
print ("Average AS score on test %.4f\n" %(np.mean(np.array(scores_as))))

Current part is 0

Accuracy of PCA-KNN: 95.6%
Accuracy of SIR-KNN: 100.0%
Accuracy of SAVE-KNN: 35.7%
Accuracy of AS-KNN: 98.07%
Current part is 1

Accuracy of PCA-KNN: 99.59%
Accuracy of SIR-KNN: 99.73%
Accuracy of SAVE-KNN: 43.25%
Accuracy of AS-KNN: 99.73%
Current part is 2

Accuracy of PCA-KNN: 89.29%
Accuracy of SIR-KNN: 100.0%
Accuracy of SAVE-KNN: 62.56%
Accuracy of AS-KNN: 90.64%
Current part is 3

Accuracy of PCA-KNN: 95.21%
Accuracy of SIR-KNN: 99.91%
Accuracy of SAVE-KNN: 76.4%
Accuracy of AS-KNN: 98.67%
Current part is 4

Accuracy of PCA-KNN: 91.74%
Accuracy of SIR-KNN: 98.76%
Accuracy of SAVE-KNN: 18.18%
Accuracy of AS-KNN: 95.52%
Current part is 5

Accuracy of PCA-KNN: 99.27%
Accuracy of SIR-KNN: 94.74%
Accuracy of SAVE-KNN: 26.56%
Accuracy of AS-KNN: 99.92%
Current part is 6

Accuracy of PCA-KNN: 86.63%
Accuracy of SIR-KNN: 96.62%
Accuracy of SAVE-KNN: 59.98%
Accuracy of AS-KNN: 97.88%
Current part is 7

Accuracy of PCA-KNN: 93.93%
Accuracy of SIR-KNN: 99.71%
Accuracy of

In [15]:
scores_pca = []
scores_sir = []
scores_save = []
scores_as = []
for part in range(10):
    print("Current part is %d\n" % part)
    train_x = np.concatenate((X6[0:(part*35)], X6[(part+1)*35:351]), axis=0)
    train_y = np.concatenate((Y6[0:(part*35)], Y6[(part+1)*35:351]), axis=0)
    test_x = X6[part*35:(part+1)*35]
    test_y = Y6[part*35:(part+1)*35]
    pca_sir_save_as(train_x, train_y, test_x, test_y, batch_size=79, if_regr=0)    
print ("Average PCA score on test %.4f\n" %(np.mean(np.array(scores_pca))))
print ("Average SIR score on test %.4f\n" %(np.mean(np.array(scores_sir))))
print ("Average SAVE score on test %.4f\n" %(np.mean(np.array(scores_save))))
print ("Average AS score on test %.4f\n" %(np.mean(np.array(scores_as))))

Current part is 0

Accuracy of PCA-KNN: 43.58%
Accuracy of SIR-KNN: 51.0%
Accuracy of SAVE-KNN: 77.32%
Accuracy of AS-KNN: 33.36%
Current part is 1

Accuracy of PCA-KNN: 39.13%
Accuracy of SIR-KNN: 70.71%
Accuracy of SAVE-KNN: 77.68%
Accuracy of AS-KNN: 60.18%
Current part is 2

Accuracy of PCA-KNN: 23.79%
Accuracy of SIR-KNN: 55.24%
Accuracy of SAVE-KNN: 60.56%
Accuracy of AS-KNN: 74.64%
Current part is 3

Accuracy of PCA-KNN: 33.78%
Accuracy of SIR-KNN: 64.44%
Accuracy of SAVE-KNN: 50.02%
Accuracy of AS-KNN: 49.74%
Current part is 4

Accuracy of PCA-KNN: 19.98%
Accuracy of SIR-KNN: 53.37%
Accuracy of SAVE-KNN: 75.82%
Accuracy of AS-KNN: 49.69%
Current part is 5

Accuracy of PCA-KNN: 40.73%
Accuracy of SIR-KNN: 63.37%
Accuracy of SAVE-KNN: 59.4%
Accuracy of AS-KNN: 49.72%
Current part is 6

Accuracy of PCA-KNN: 28.95%
Accuracy of SIR-KNN: 71.42%
Accuracy of SAVE-KNN: 73.05%
Accuracy of AS-KNN: 47.27%
Current part is 7

Accuracy of PCA-KNN: -8.3%
Accuracy of SIR-KNN: 30.37%
Accuracy of