# Understand Weight Logistic Regression-Based Learning Algorithm
# (wLTL)

Reference paper : https://ieeexplore.ieee.org/abstract/document/8737742/

# A. Proposed Logistic Regression-Based Learning Algorithm (LTL)

\begin{equation*} P(y_{s}^{i}\!\!=\!\!1\vert \mathbf {x}_{s}^{i};\mathbf {w}_{s}) = \frac {1}{1+exp^{-(\mathbf {w}_{s}^{T}\mathbf {x}_{s}^{i})}},\tag{1}\end{equation*}

Formula (1) describes the logistic regression model used to make probabilistic predictions for binary classification.

Note: The set of labeled EEG trials from session s can be presented as $ d_{s}\!=\!(\mathbf {x}_{s}^{i},y_{s}^{i})_{i_{=1}}^{n_{s}} $

- $ \mathbf {x}_{s}^{i} $ = the feature vector 
- $ y_{s}^{i} $ = the class label

where s denotes the session s , and  $ \mathbf {w}_{s}\!\in \!\mathbb {R}^{v\times 1} $
 refers to the classification parameters being used to predict the class labels of the trials Xs

The proposed LTL algorithm consists of two main steps. 

In the first step, for every previously recorded session $\forall d_{s}\!\in \!\{d_{1},d_{2},\ldots ,d_{m}\} $ 
the classification parameters, $ w_{s}$ , are calculated using the following objective function 

\begin{equation*} \text L_{1}(\mathbf {w}_{s})= \min _{\mathbf {w}_{s}} \quad \left({\sum _{i=1}^{n_{s}} \text H(\mathbf {w}_{s};y_{s}^{i},\mathbf {x}_{s}^{i}) +\lambda _{s} \vert \vert \mathbf {w}_{s}\vert \vert _{2}^{2}}\right),\tag{2}\end{equation*}

Where H and $ \vert \vert.\vert \vert _{2} $ denote the cross-entropy and 2-norm functions respectively.

In fact, in $ \text L_{1}(\mathbf {w}_{s}) $ , the cross entropy aims at finding ws that minimizes the error rate while the 2-norm penalizes large values of ws to reduce the risk of over-fitting.

The subject-specific regularization parameter $\lambda _{s}$ is used to control the degree of penalization

 Cross entropy function H is also called **negative log-likelihood** where its minimization is equivalent to maximizing the log likelihood (3)

\begin{align*}&\hspace {-0.5pc}\text H(\mathbf {w}_{s}; \mathbf {x}_{s}^{i}, y_{s}^{i}) =-y_{s}^{i} \log P(y_{s}^{i}\!\!=\!\!1\vert \mathbf {x}_{s}^{i};\mathbf {w}_{s})- (1 - y_{s}^{i}) \\&\qquad \qquad \qquad \qquad \qquad \qquad {{{\log (1 - P(y_{s}^{i}\!\!=\!\!1\vert \mathbf {x}_{s}^{i};\mathbf {w}_{s})),} }}\tag{3}\end{align*}

Where $ P(y_{s}^{i}\!\!=\!\!1\vert \mathbf {x}_{s}^{i};\mathbf {w}_{s}) $ is calculated using (1). The objective of function $ \text L_{1}(\mathbf {w}_{s}) $ does not have a closed form solution. However, it has a unique minimum that can be found using simple and effective iterative approaches such as the gradient descent


Despite being sufficiently effective for sessions with large training data sizes, the objective function $ \text L_{1}(\mathbf {w}_{s}) $ may fail in estimating the classification parameters of the new subject since few available subject-specific trials typically are not able to accurately represent the distributions of the features. 

In other words, in addition to the discriminative parameters, we are interested in parameters that are similar to the classification parameters of the other sessions with this assumption that there is some common information across the sessions performing the same mental tasks (i.e. motor imagery).

Given the above-mentioned assumption, after calculating the classification parameters of the previously recorded sessions using (2), in the second step, the classification parameters of the **new target subject, $W_{t}$** , is calculated using the following objective function:

\begin{equation*} \text L_{2}(\mathbf {w}_{t})\!=\!\min _{\mathbf {w}_{t}} \quad \left({\sum _{i=1}^{n_{t}} \text H(\mathbf {w}_{t};y_{t}^{i},\mathbf {x}_{t}^{i}) + \lambda _{t} \text R_{TL}(\mathbf {w}_{t})}\right),\tag{4}\end{equation*}

where $\text R_{TL}$ is the regularization term penalizing dissimilarities between $W_{t}$ and the previously calculated $W_{s}$ (source_subjects), $ \forall d_{s}\!\in \!\{d_{1},d_{2},\ldots ,d_{m}\} $ 

The regularization parameter $\lambda _{s}$ is making a trade-off between minimizing the error and dissimilarities between the new target subject and previous sessions in terms of the classification parameters.


The term $\text R_{TL}$ is calculated by taking into account the prior distribution of the existing classification parameters and comparing them with $w_{t}$

\begin{equation*} \text R_{TL}(\mathbf {w}_{t})= 0.5[(\mathbf {w}_{t}-\boldsymbol {\mu })^{T}\boldsymbol {\Sigma }_{TL}^{-1}(\mathbf {w}_{t}-\boldsymbol {\mu })+\log (\vert \boldsymbol {\Sigma }_{TL}\vert)],\tag{5}\end{equation*}

where μ and $ \boldsymbol {\Sigma }_{TL} $ are respectively calculated as follows:

\begin{align*} &\qquad \qquad \quad \boldsymbol {\mu } = (1/m)\sum _{s=1}^{m} \mathbf {w}_{s},\tag{6}\\ &\boldsymbol {\Sigma }_{TL}=\frac {\text {diag}\left({\sum _{s=1}^{m} (\mathbf {w}_{s} - \boldsymbol {\mu })(\mathbf {w}_{s} - \boldsymbol {\mu })^{T}}\right)}{\text {trace} \left({\sum _{s=1}^{m}(\mathbf {w}_{s} - \boldsymbol {\mu })(\mathbf {w}_{s} -\boldsymbol { \mu })^{T}}\right)}.\tag{7}\end{align*}

As can be seen in (7), $\boldsymbol {\Sigma }_{TL} \!\in \!\mathbb {R}^{v\times v }$ only includes the normalized diagonal elements of the covariance matrix, where diag and trace give the diagonal elements and the sum of the diagonal elements of a matrix respectively.

Indeed, in this study, only diagonal elements are used to reduce the optimization complexity.
Subsequently, in (5), $\boldsymbol {\Sigma }_{TL}$ is used to normalize the divergence of each parameter of $w_{t}$ from the average of the corresponding parameters of the other classifier.

In [None]:
# Thank you example code from https://github.com/orvindemsy/EA-wLTL/tree/master

from sklearn.base import BaseEstimator, ClassifierMixin
import numpy as np

class LogReg_TL(BaseEstimator):
    def __init__(self, learningRate=1e-5, num_iter=100, penalty=None, intercept = True,\
                 lambd=1, ETL=np.array([[0, 0],[0, 0]]), mu=0):
        
        self.learningRate = learningRate
        self.num_iter = num_iter
        self.penalty = penalty
        self.intercept = intercept
        self.ETL = ETL
        self.lambd = lambd
        self.mu = mu
        
    def __sigmoid(self, z): #We need to change to softmax function 
        return 1/(1 + np.exp(-z))
    
    def __logLL(self, z, y): 
        return -1 * np.sum((y * np.log10(self.__sigmoid(z))) + ((1 - y) * np.log10(1 - self.__sigmoid(z))))

    def __reg_logLL1(self, z, y, weights): # cal L1 weight (source_subjects) sum of negative log-likelihood (2)
        reg = 0.5 * self.lambd * np.sum(np.dot(weights, weights))

        return (-1 * np.sum((y * np.log10(self.__sigmoid(z))) + ((1 - y) * np.log10(1 - self.__sigmoid(z)))) ) + reg
    
    def __reg_logLL2(self, z, y, weights): # cal L2 weight (target_subjects) (4)
        ETL_det = np.log10(np.linalg.det(self.ETL))
        
        reg = 0.5 * self.lambd * np.sum( ((weights-self.mu)**2)@self.ETL) + ETL_det 

        return (-1 * np.sum((y * np.log10(self.__sigmoid(z))) + ((1 - y) * np.log10(1 - self.__sigmoid(z)))) ) + reg
        
    def fit(self, X_train, y_train):
        self.weights = np.zeros(np.shape(X_train)[1] + 1) 
        
        if self.intercept:
            X_train = np.c_[np.ones([np.shape(X_train)[0], 1]), X_train]
        
        self.costs = []
        
        for i in range(self.num_iter):
            z = np.dot(X_train, self.weights)
            err = -y_train + self.__sigmoid(z)
            
            if self.penalty == 'L1':
                if i == 0:
                    print(self.penalty)
                    
                delta_w = np.dot(err, X_train)

                # weight update
                self.weights += -self.learningRate * delta_w
                self.weights[1:] += -self.learningRate * self.lambd * self.weights[1:]
                
                # costs
                self.costs.append(self.__reg_logLL1(z, y_train, self.weights))
                
            elif (self.penalty == 'L2') and (np.all(self.mu)) :
                if i == 0:
                    print(self.penalty)
                
                delta_w = np.dot(err, X_train)
                
                # weight update
                self.weights += -self.learningRate * delta_w
                self.weights[1:] += -self.learningRate * self.lambd * ((self.weights - self.mu)@(np.linalg.inv(self.ETL)) )[1:]
            
                # cost
                self.costs.append(self.__reg_logLL2(z, y_train, self.weights))
                
            else:
                if i == 0:
                    print(self.penalty)
                    
                delta_w = -self.learningRate * np.dot(err, X_train)
                
                # weight update
                self.weights += delta_w    
                
                # cost
                self.costs.append(self.__logLL(z, y_train))
                
            if i % 30000 == 0:
                print('weights: ',np.round(self.weights, 3))           
            
        return self
    
    def predict_proba(self, X):
        return self.__sigmoid((X @ self.weights[1:]) + self.weights[0])
        
    def predict(self, X):
        return np.round(self.predict_proba(X))
    
    def score(self, X, y):
        y_pred = self.predict(X)
        scores = ((y==y_pred)*1).mean()
        
        return scores

In [None]:
def build_clf_params(data):
    # Where the tranining data is stored
    X = data['feat_train']
    y = data['y']
    
    # Use this model when training subject as source 
    model_L1 = LogReg_TL(learningRate=0.001, num_iter=30000, penalty='L1', lambd=1)
    
    # Fit model and store weight
    model_L1.fit(X, y)
    data['ws'] = model_L1.weights


In [None]:
# Iterate over all source subject to compute weights
for subj in TL_data['src'].keys():
    print('Processing weight subject ', subj)
    build_clf_params(TL_data['src'][subj]) 

#  Proposed Algorithm of this paper

The proposed LTL algorithm attempts to improve the estimation of the classification parameters of a new subject by incorporating the data from previously recorded sessions.

However, it treats **different feature spaces from the previous sessions similarly**, whereas the distribution of EEG signals can be different from session to session and from subject to subject, leading to different subject-specific CSP feature spaces.

 To address this issue, in the proposed weighted logistic regression-based transfer learning algorithm different weights **are allocated to the previously recorded sessions to represent similarities between these sessions and the new subject** in terms of distributions of the features.

Kullback-Leibler (KL) divergence is frequently used in the literature to calculate similarities between two sets of EEG features.

Since in MI-based BCIs the features are typically normalized log-power of CSP filtered EEG data, they are commonly assumed normally distributed.

Thus, in this paper, the KL divergence between two normal distributions are used to **measure divergence between EEG features**.

Given two normal distributions presented as $\mathcal {N}_{0}(\boldsymbol {\mu }_{0},\boldsymbol {\Sigma }_{0})$ and $\mathcal {N}_{1}(\boldsymbol {\mu }_{1},\boldsymbol {\Sigma }_{1})$ , the KL divergence has the following closed form

\begin{align*} \text {KL}[N_{0}\vert \vert N_{1}]=&0.5\Biggl [(\boldsymbol {\mu }_{1} - \boldsymbol {\mu }_{0})^{T}\boldsymbol {\Sigma }_{1}^{-1}(\boldsymbol {\mu }_{1} - \boldsymbol {\mu }_{0}) \\&+\,\text {trace}(\boldsymbol {\Sigma }_{1}^{-1}\boldsymbol {\Sigma }_{0})-\ln \left({\frac {\text det(\boldsymbol {\Sigma }_{0})}{\text det(\boldsymbol {\Sigma }_{1})}}\right)-K \Biggr ],\tag{8}\end{align*}

where det, T and K denote the determinant function, transpose of the matrix, and the dimension of the data, respectively.

 In the supervised case, the total divergence is calculated by averaging the KL divergences calculated for each class separately. 
 On the other hand, in the unsupervised case, the total divergence equals to the KL divergence between the two sessions without considering the class labels. 

 Subsequently, the similarity weight $\alpha _{s}$ between the feature sets of the target subject $d_{t}$ and the feature sets of each of the previous sessions/subjects $d_{s}$ , is calculated as:


\begin{equation*} \alpha _{s} = \frac {(1/\bar {(\text {KL}}[d_{t},d_{s}]+\epsilon)^{4})}{\sum \limits _{i=1}^{m}(1/\bar {\text {(KL}}[d_{t},d_{i}]+\epsilon)^{4})},\tag{9}\end{equation*}

$\epsilon =0.0001$ is used to ensure the stability of the equation when $\bar {\text {KL}}[d_{t},d_{s}]$ gets equal to zero due to having two compared distributions completely similar. 

 The power of 4 is applied to the inverse of KL between the distribution of two feature sets to give larger weights to more similar distributions and smaller weights to less similar distributions. This results in an increased sparsity in the similarity weights $\alpha _{s}$.

The proposed weighted logistic regression-based transfer learning algorithm has the same steps as the proposed LTL. However, instead of equal weights, different weights are assigned to the previously recorded sessions using (9). Accordingly, the new weighted μ is obtained as

\begin{equation*} \boldsymbol {\mu }_{w} = \sum _{s=1}^{m} \alpha _{s} \mathbf {w}_{s}.\tag{10}\end{equation*}

Likewise, the weighted $\boldsymbol {\Sigma }_{TL}$ is calculated as

\begin{equation*} \boldsymbol {\Sigma }_{TL_{w}}\!=\frac {\text {diag}\left({\sum _{s=1}^{m} (\alpha _{s} \mathbf {w}_{s} - \boldsymbol {\mu }_{w})(\alpha _{s} \mathbf {w}_{s} - \boldsymbol {\mu }_{w})^{T}}\right)}{\text {trace} \left({\sum _{s=1}^{m}(\alpha _{s} \mathbf {w}_{s} - \boldsymbol {\mu }_{w})(\alpha _{s} \mathbf {w}_{s} - \boldsymbol {\mu }_{w})^{T}}\right)}.\tag{11}\end{equation*}

Finally, $\text R_{TL}$ in (5) is calculated by replacing $\boldsymbol {\mu }$ and $\boldsymbol {\Sigma }_{TL}$ with $\boldsymbol {\mu }_{w}$ and $\boldsymbol {\Sigma }_{TL_{w}}$ respectively.

Example code of KL divergence of each target to all source subjects thank again to https://github.com/orvindemsy/EA-wLTL/tree/master

In [None]:
# First define the kl divergence
def KL_div(P, Q):
    # First convert to np array
    P = np.array(P)
    Q = np.array(Q)
    
    # Then compute their means, datain shape of samples x feat
    mu_P = np.mean(P, axis=0)
    mu_Q = np.mean(Q, axis=0)    
    
    # Compute their covariance
    cov_P = np.cov(P, rowvar=False)
    cov_Q = np.cov(Q, rowvar=False)    
        
    cov_Q_inv = np.linalg.inv(cov_Q)
    
    # Compute KL divergence
    KL_div = np.log(np.linalg.det(cov_Q)/np.linalg.det(cov_P)) - mu_P.shape[0] + np.trace(cov_Q_inv@cov_P) + \
                (mu_P - mu_Q).T@cov_Q_inv@(mu_P - mu_Q)
    
    KL_div = 0.5 * KL_div
    
    return KL_div

In [None]:
# Compute kl divergence of target subject to each source subject
def compute_all_kl_div(data):
    '''
    Parameter:
    data, is the whole data containing target and source data
    '''
    
    # Iterate over all subject
    for tgt_subj in data['tgt'].keys():
        kl_div_score = []
        
        print('=== Current target subject %02d === ' %tgt_subj)
        # Separate into left and right class
        yP = data['tgt'][tgt_subj]['ytr']
        P_left = data['tgt'][tgt_subj]['feat_train'][yP==0]
        P_right = data['tgt'][tgt_subj]['feat_train'][yP==1]
        
        
        for src_subj in data['src'].keys():    
            print('KL div with respect to source subject', src_subj)
            
            # Separate into left and right class
            yQ = data['src'][src_subj]['y']
            Q_left = data['src'][src_subj]['feat_train'][yQ==0]
            Q_right = data['src'][src_subj]['feat_train'][yQ==1]
            
            # Compute kl div of each class, average, then append them
            kl_left = KL_div(P_left, Q_left)
            kl_right = KL_div(P_right, Q_right)
            kl_div = (kl_left + kl_right)/2
            kl_div_score.append(kl_div)
            
        # Zeroing source subject that acts as curetn target
        kl_div_score[tgt_subj-1] = 0    
            
        # Store them back into current target subject
        data['tgt'][tgt_subj]['kl_div'] = kl_div_score

In [None]:
compute_all_kl_div(TL_data)

Similarity weights $\alpha_{s}$ Equation (9)

In [None]:
def compute_similarity_weights(kl):
    KL_inv =[]
    eps = 1e-2
    
    for val in kl:
        if val != 0: 
            KL_inv.append(1/(val + eps**4))
        
    a_s = []
    
    for inv_val in KL_inv:
        temp = inv_val/sum(KL_inv)    
        a_s.append(temp)
                
    return a_s

In [None]:
# Compute similarity weighst of all target subject
for tgt_subj in TL_data['tgt'].keys():
    kl = TL_data['tgt'][tgt_subj]['kl_div']
    TL_data['tgt'][tgt_subj]['a_s'] = compute_similarity_weights(kl)

$\boldsymbol {\Sigma }_{TL}$ Algorithm Equation (10), (11)

In [None]:
# Define algorithm to compute ETL  

def compute_ETL_and_mu_ws(data):
    '''
    NOTE THAT THIS CODE IS FOR WEIGHTED LTL, THIS USE a_s, weights of similarity from kl_divergence
    Parameter:
    data: contains data of all target and source subjects
    '''
    print('===== Computing ETL and mu_ws =======')
    # Will compute the ETL of each target subject
    for tgt_subj in data['tgt'].keys():
        print('=== Target subject %d ====' %tgt_subj)
        src_subj = [i for i in data['src'].keys() if i != tgt_subj]
            
        # First compute the mean of all source weights 'mu_ws' over all subjects
        all_ws = []
        
        # similarity weights a_s of current target
        a_s = data['tgt'][tgt_subj]['a_s']
        
        for a, s_subj in zip(a_s, src_subj):
            print('Gathering weights from source subject', s_subj)
            ws = TL_data['src'][s_subj]['ws']
            ws = a * ws
            all_ws.append(ws)

        # Average of ws over all subjects, axis=0
        mu_ws = np.mean(np.array(all_ws), axis=0)
        
        # This will add up all ws - mu_ws, to compute ETL
        temp_ws = 0
        for a, s_subj in zip(a_s, src_subj):
            ws = TL_data['src'][s_subj]['ws']
            ws_min_mu = (a*ws - mu_ws)[:, None] @ (a*ws - mu_ws)[None, :]
            temp_ws += ws_min_mu

        # Compute ETL, only contain diagonal element with zero elements to rest of elements
        den = np.diag(temp_ws)
        nom = np.trace(temp_ws)
        ETL = np.diag(den/nom)

        data['tgt'][tgt_subj]['ETL'] = ETL
        data['tgt'][tgt_subj]['mu_ws'] = mu_ws

Example training model wLTL

In [None]:
def model_weighted_LTL(data):
    '''
    data: data containing one target subject, currently data = TL_data['tgt'][tgt_subj_no]
    
    '''
    # Create stratified KFold instance
    skf = StratifiedKFold(n_splits=5)

    # Grab the train data of current target subject
    Xtr = data['feat_train']
    ytr = data['ytr']
    Xte = data['feat_test']
    yte = data['yte']

    #print(Xtr.shape)
    #print(ytr.shape)

    # Then also defines the ETL and mu_ws
    ETL = data['ETL']
    mu_ws = data['mu_ws']

    score_tr = []
    best_score = 0

    # Do k-fold cv using stratified kfold, will train on 10 samples:
    for i, (idx_te, idx_tr) in enumerate(skf.split(Xtr, ytr)):
        print('processing cv-', i+1)
        
        # Make sure there are 10 train index
        assert len(idx_tr) == 10
        assert (ytr[idx_tr] == 1).sum() == 5

        # Define new model each iteration, so that weights is newly initialized
        model_L2 = LogReg_TL(learningRate=0.001, num_iter=30000, penalty='L2', lambd=1, ETL=ETL, mu=mu_ws)

        # Fit model into ten training samples samples
        _ = model_L2.fit(Xtr[idx_tr], ytr[idx_tr])

        # Evaluate on n-1 folds of training samples
        curr_score = model_L2.score(Xtr[idx_te], ytr[idx_te])
        score_tr.append(curr_score)
        
        print('best score:', best_score)
        print('curr score:', curr_score)
        
        if best_score < curr_score:
            print('found best')
            best_score = curr_score
            best_idx_tr = idx_tr

    # Fit model using the best train index, obtain test score
    model_L2 = LogReg_TL(learningRate=0.001, num_iter=30000, penalty='L2', lambd=1, ETL=ETL, mu=mu_ws)

    _ = model_L2.fit(Xtr[best_idx_tr], ytr[best_idx_tr])
    score_te = model_L2.score(Xte, yte)   

    return np.mean(np.array(score_tr)), np.std(np.array(score_tr)), score_te

In [None]:
# for subj in TL_data.keys():
for tgt_subj in TL_data['tgt'].keys():
    print('=== Processing target subject %02d ====' %tgt_subj)
    temp_tgt = TL_data['tgt'][tgt_subj]
    temp_tgt['score_tr_wLTL'], temp_tgt['score_std_wLTL'], temp_tgt['score_te_wLTL'] = model_weighted_LTL(TL_data['tgt'][tgt_subj])