# Positive-Unlabeled (PU) Learning

We are more familiar with

**Supervised Learning**, wherein we have a set of features $X$, and a set of labels $y$ where the goal is to create a mapping $f:X\mapsto y$ and 

**Unsupervised Learning**, wherein given a set of features $X$, we want to identify latent patterns to 
1. reduce dimensionality
2. cluster data 
3. or detect potential anomalies.

## Motivation
- In the healthcare space, positive diagnosis for different medical conditions for medical conditions such as Type 2 Diabetes Mellitus (T2D) and Depression/Anxiety is fairly common, and is often captured through medical claims record.
- However, what is less common is labeled information for individuals without the condition. In many medical conditions, a lot of cases remain undiagnosed.

Oftentimes, the problem of identifying potentially undiagnosed cases is framed as a supervised machine learning with some reasonable assumptions. One variation of this approach is the following:

- Given a defined time interval T where all the samples in the dataset has no medical records for the target condition, we get all available dataset and extract variables that we can use for modeling. 
- This will serve as our feature data X. 
- We then assign a label 1 (positive class) if the person has a medical record of the target condition on the target time period after T, 
- and label 0 (negative class) if the person didn't have a medical record.

One obvious limitation of this approach is that **the absence of a medical claims record for a medical condition does not imply the absense of the medical condition itself.**

In this notebook, we would like to explore an alternative learning approach which is called Positive-Unlabeled (PU) Learning which does not necessarily use the same assumption.

In [1]:
# conda install -c conda-forge rise

## Research Questions
1. What are the methods used in PU Learning?
2. How does the performance of PU Learning compare with Supervised Learning in terms of detecting previously unseen data?

## Data

For this demo, we will use the Banknote Authentication Dataset which can be downloaded [here](http://archive.ics.uci.edu/ml/datasets/banknote+authentication).

### Load Data

In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/data_banknote_authentication.txt', header=None)
df.columns = ['wavelet_variance',
              'wavelet_skewness',
              'wavelet_curtosis',
              'image_entropy',
              'is_fake']
df.head()

Unnamed: 0,wavelet_variance,wavelet_skewness,wavelet_curtosis,image_entropy,is_fake
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   wavelet_variance  1372 non-null   float64
 1   wavelet_skewness  1372 non-null   float64
 2   wavelet_curtosis  1372 non-null   float64
 3   image_entropy     1372 non-null   float64
 4   is_fake           1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


This dataset has 1,372 rows and there are no null values.

In [4]:
# counts of positive and negative labels
df.is_fake.value_counts()

0    762
1    610
Name: is_fake, dtype: int64

### Split to Train and Test Sets
We will set aside a test set so that we can compare how PU learning compares to a supervised learning approach. Since our dataset is very small, we decide to work on approximately 1000 data points for building the model.

In [5]:
from sklearn.model_selection import train_test_split

x_data = df.iloc[:,:-1]
y_data = df.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data,
                                                    test_size=0.25,
                                                    random_state=42)

x_train.shape, y_test.shape

((1029, 4), (343,))

In [6]:
# class proportions of training set
y_train.value_counts() / len(y_train)

0    0.554908
1    0.445092
Name: is_fake, dtype: float64

In [7]:
# class proportions of test set
y_test.value_counts() / len(y_test)

0    0.556851
1    0.443149
Name: is_fake, dtype: float64

In [8]:
y_train.value_counts()

0    571
1    458
Name: is_fake, dtype: int64

### Setup Data for PU Learning
From the training set, suppose that only 250 (chosen arbitrarily) of the fake values are labeled. Also, when we train the benchmark model for supervised learning, we will only use these 250 positive labeled rows and the 571 negatively labeled values.

In [9]:
n = 250
seed = 42

# identify sample indices
idx = y_train[y_train == 1].sample(250, random_state=seed).index

# create indicator for labeled vs unlabeled
is_labeled = pd.Series(np.zeros(len(y_train)), index=y_train.index).astype(int)
is_labeled[idx] = 1
is_labeled.value_counts()

# adjust training dataset
hidden_idx = y_train[(y_train == 1) & ~(y_train.index.isin(idx))].index

x_train_new = x_train.loc[~x_train.index.isin(hidden_idx)]
y_train_new = y_train.loc[~y_train.index.isin(hidden_idx)]

y_train_new.value_counts()

0    571
1    250
Name: is_fake, dtype: int64

Awesome! Now we can proceed with PU Learning!

## Supervised Learning Benchmark

Let's now make a benchmark model using an XGBoost classifier.

In [10]:
from xgboost import XGBClassifier
from sklearn.metrics import recall_score, precision_score, roc_auc_score, \
                            accuracy_score, f1_score

In [11]:
model_bench = XGBClassifier()
model_bench.fit(x_train_new, y_train_new)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [12]:
def evaluate_results(y_test, y_predict):
    print('Classification results:')
    f1 = f1_score(y_test, y_predict)
    print("f1: %.2f%%" % (f1 * 100.0)) 
    roc = roc_auc_score(y_test, y_predict)
    print("roc: %.2f%%" % (roc * 100.0)) 
    rec = recall_score(y_test, y_predict, average='binary')
    print("recall: %.2f%%" % (rec * 100.0)) 
    prc = precision_score(y_test, y_predict, average='binary')
    print("precision: %.2f%%" % (prc * 100.0))

    return f1, roc, rec, prc

In [13]:
results_bench = evaluate_results(y_test, model_bench.predict(x_test))

Classification results:
f1: 99.34%
roc: 99.34%
recall: 98.68%
precision: 100.00%


## Methods/Frameworks in PU Learning

1. Naive Approach
2. Two-Step Approach
3. Weighted Positive and Unlabeled Samples

### 1. Naive Approach
The idea is that unlabeled data with highest probabilities are more likely to be positive. In this case, the purpose of the model is to rank unlabeled data based on likelihood of being positive.

#### 1.1. Vanilla Implementation
Train model on positive (labeled 1) and unlabeled (labeled 0).

In [14]:
is_labeled.value_counts()

0    779
1    250
dtype: int64

In [15]:
x_train.shape, is_labeled.shape

((1029, 4), (1029,))

In [16]:
model_vanilla = XGBClassifier()
model_vanilla.fit(x_train, is_labeled)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [17]:
results_vanilla = evaluate_results(y_test, model_vanilla.predict(x_test))

Classification results:
f1: 67.25%
roc: 75.33%
recall: 50.66%
precision: 100.00%


### 2. Two-Step Approach
Step 1: Identify Reliable Negatives  
Step 2: Build a supervised classifier iteratively

#### 2.1. Method 1 
(Based from [web article](https://roywrightme.wordpress.com/2017/11/16/positive-unlabeled-learning/) by Roy Wright, 2017.)
1. Train a naive classifier wherein 1 if Positive and 0 if unlabeled.
2. Reassign labels using prediction probabilities from Step 1. For unlabeled entry $i$, if probability $p_i \ge max(p_{+})$, then $\text{label}_i = 1$. Else, for unlabeled entry $i$, if probability $p_i \lt min(p_{+})$, then $\text{label}_i = 0$. 
3. Retrain the classifier using updated labels and perform step 2.
4. Perform step 2-3 until a predefined stopping criterion is met.

In [18]:
new_labels = is_labeled.copy()

# for this demo, our stopping criterion is simply when the model 
for i in range(10):
    print(f'Iteration: {i}...')
    # STEP 1: we can reuse naive model from method 1. 
    # This becomes STEP 3 once new labels are updated
    model_2step = XGBClassifier()
    model_2step.fit(x_train, new_labels)

    # STEP 2: Reassign labels

    # get max and min probabilities of 
    p_ = pd.Series(model_2step.predict_proba(x_train)[:, 1],
                   index=is_labeled.index)
    p_max = p_[is_labeled == 1].max()
    p_min = p_[is_labeled == 1].min()

    # determine indices to be reassigned to positive
    new_pos = (new_labels[new_labels==0].index
               .intersection(new_labels[p_ >= p_max].index))
    if len(new_pos) > 0:
        new_labels.loc[new_pos] = 1

    # determine indices to be reassigned to negative
    new_neg = (new_labels[new_labels==0].index
               .intersection(new_labels[p_ < p_min].index))
    if len(new_pos) > 0:
        new_labels.loc[new_neg] = 0

Iteration: 0...
Iteration: 1...
Iteration: 2...
Iteration: 3...
Iteration: 4...
Iteration: 5...
Iteration: 6...
Iteration: 7...
Iteration: 8...
Iteration: 9...


In [19]:
results_2step1 = evaluate_results(y_test, model_2step.predict(x_test))

Classification results:
f1: 67.25%
roc: 75.33%
recall: 50.66%
precision: 100.00%


#### 2.2. Method 2: Spy Method
The spy method is very similar to the first method except that the threshold probabilities are based from the probabilities assigned by the classifier to the spies.

So what are these spies? 

Spies are **a subset of the positive labeled entries that are mixed into the unlabeled data** when performing step 1 in method 1 above. Then, decisions on the new label assignments are based on $\text{max}(p_{spy})$ and $\text{min}(p_{spy})$. Then proceed to steps 3 and 4 as in method 1.

In [20]:
# create spies
idx_spies = is_labeled[is_labeled==1].sample(20, random_state=seed).index

new_labels = is_labeled.copy()
new_labels[idx_spies] = 0

# for this demo, our stopping criterion is simply when the model 
for i in range(10):
    print(f'Iteration: {i}...')
    # STEP 1: we can reuse naive model from method 1. 
    # This becomes STEP 3 once new labels are updated
    model_2step_spies = XGBClassifier()
    model_2step_spies.fit(x_train, new_labels)

    # STEP 2: Reassign labels

    # get max and min probabilities of 
    p_ = pd.Series(model_2step_spies.predict_proba(x_train)[:, 1],
                 index=is_labeled.index)
    p_max = p_[idx_spies].max()
    p_min = p_[idx_spies].min()

    # determine indices to be reassigned to positive
    new_pos = (new_labels[new_labels==0].index
               .intersection(new_labels[p_ >= p_max].index))
    if len(new_pos) > 0:
        new_labels.loc[new_pos] = 1

    # determine indices to be reassigned to negative
    new_neg = (new_labels[new_labels==0].index
               .intersection(new_labels[p_ < p_min].index))
    if len(new_pos) > 0:
        new_labels.loc[new_neg] = 0

Iteration: 0...
Iteration: 1...
Iteration: 2...
Iteration: 3...
Iteration: 4...
Iteration: 5...
Iteration: 6...
Iteration: 7...
Iteration: 8...
Iteration: 9...


In [21]:
results_2step_spies = evaluate_results(y_test,
                                       model_2step_spies.predict(x_test))

Classification results:
f1: 65.49%
roc: 74.34%
recall: 48.68%
precision: 100.00%


### 3. Weighted Positive and Unlabeled Samples
(Algorithm here is from the [paper](https://cseweb.ucsd.edu/~elkan/posonly.pdf) by Elkan C. and Noto K., 2008 as described in a [web article](https://towardsdatascience.com/semi-supervised-classification-of-unlabeled-data-pu-learning-81f96e96f7cb) by Alon Agmon, 2020)

1. Train a classifier $f$ to predict the probability that a sample $x$ is labeled ($s=1$ if labeled and $s=0$ otherwise). This will be $P(s=1|x)$.
2. Use classifier $f$ to predict the probability that the positive samples are labeled. This will be $P(s=1|y=1)$ which is just the mean probability of all positive samples being labeled.
3. Use classifier $f$ to predict that sample $i$ is labeled. ($P(s=1|i)$)
4. Estimate the probability that sample $i$ is positive by calculating $P(s=1|i) / P(s=1|y=1)$

In [22]:
# STEP 1
model_wpu = XGBClassifier()
model_wpu.fit(x_train, is_labeled)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [23]:
# STEP 2
p_pos_labeled = model_wpu.predict_proba(x_train.loc[is_labeled==1])[:, 1].mean()
p_pos_labeled

0.62694466

In [24]:
# STEP 3
p_threshold = .5
y_preds = ((model_wpu.predict_proba(x_test)[:, 1] / p_pos_labeled) >
          p_threshold).astype(int)
y_preds    

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,

In [25]:
results_wpu = evaluate_results(y_test, y_preds)

Classification results:
f1: 94.08%
roc: 94.41%
recall: 88.82%
precision: 100.00%


## Results

In [26]:
cols = ['f1', 'roc', 'recall', 'precision']
index = ['supervised', 'vanilla', 'two_step', 'two_step_spies', 'weighted_PU']
df_results = pd.DataFrame([results_bench,
                           results_vanilla,
                           results_2step1,
                           results_2step_spies,
                           results_wpu],
                           columns=cols,
                           index=index)
df_results

Unnamed: 0,f1,roc,recall,precision
supervised,0.993377,0.993421,0.986842,1.0
vanilla,0.672489,0.753289,0.506579,1.0
two_step,0.672489,0.753289,0.506579,1.0
two_step_spies,0.654867,0.743421,0.486842,1.0
weighted_PU,0.940767,0.944079,0.888158,1.0


## Discussion
- From the table above, we see that **among the PU algorithms, the weighted PU approach is the closest to the performance of the Supervised Learning approach** that has information on the negative values. 
- Moreover, the weighted PU approach interestingly is basically just a Vanilla PU method but we adjust the final prediction by the mean probability of the known positive value which does not require excessive additional computation. 
- Also, since the final result is just a rescaling of the results from the vanilla PU method, this means that the ranks of unlabeled based on probability of being positive remains the same.

## Conclusion

Using a variant of PU Learning called **weighted PU may result to better predictive performance over tasks versus current predictive setups** that are mainly designed as vanilla PU Learning.

## TO DO
1. Study methods further to ensure that models are properly implemented.
2. Explore why all models have perfect precision.
3. Explore performance on other datasets.