# Activity recognition on the Capture24 dataset

## Semi-supervised learning

While digital data collection is becoming easier and cheaper, labeling such data
still requires expensive and time-consuming human labor.
For example, while it is possible to label accelerometer measurements for ~150
participants as in our Capture24 dataset, it is unfeasible to do so for the
tens of thousands of *unlabeled* accelerometer measurements that are
currently available in the [UK
Biobank](https://www.ukbiobank.ac.uk/activity-monitor-3/) because *a)*
compliance to wear body cameras is much lower than wrist-worn accelerometers
and *b)* the human labor to go through all the camera recordings would be
very expensive. Semi-supervised learning is therefore of great interest,
where the aim is to somehow use the unlabeled data to improve the model
performance.

###### Setup

In [1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from tqdm.auto import tqdm
import utils

# For reproducibility
np.random.seed(42)

 ###### Load dataset and hold out some instances for testing 

In [3]:
data = np.load('capture24.npz', allow_pickle=True)
print("Contents of capture24.npz:", data.files)
X, y, pid, time = data['X_feats'], data['y'], data['pid'], data['time']

# Hold out some participants for testing the model
test_pids = [2, 3]
test_mask = np.isin(pid, test_pids)
train_mask = ~np.isin(pid, test_pids)
X_train, y_train, pid_train = X[train_mask], y[train_mask], pid[train_mask]
X_test, y_test, pid_test = X[test_mask], y[test_mask], pid[test_mask]
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)

Contents of capture24.npz: ['X_feats', 'y', 'pid', 'time', 'annotation']
Shape of X_train: (325619, 125)
Shape of X_test: (4991, 125)


## Self-training

One of the simplest semi-supervised methods is based on proxy-labels via self-training. The idea is to simply evaluate a trained model on the unlabeled instances and incorporate those with high confidence predictions into the training set, then re-train the model on the augmented set. This process is repeated several times until some criteria is met, e.g. when no more instances are being included in the training set.
This simple technique works well when the initial model is already very
strong. If the initial model is weak, however, it may reinforce the mistakes
in its predictions.
In the following, we first train a random forest on the labelled training
set, then evaluate the model on the provided unlabelled dataset
`capture24_test.npz` for self-training.

In [4]:
# initial model
classifier = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=4)
classifier.fit(X_train, y_train)

# Load unlabelled dataset for self-training
data_unl = np.load('capture24_test.npz')
print("\nContents of capture24_test.npz:", data_unl.files)
X_unl = data_unl['X_feats']
print("Shape of X_unl:", X_unl.shape)


Contents of capture24_test.npz: ['X_feats', 'pid', 'time']
Shape of X_unl: (38750, 125)


###### Self-training

*Note: this takes several minutes*

In [5]:
# initial predictions and self-training parameters
y_unl_pred = classifier.predict(X_unl)
y_unl_prob = classifier.predict_proba(X_unl)
y_unl_pred_old = None
max_iter = 5
prob_threshold = 0.8

for i in tqdm(range(max_iter)):

    if np.array_equal(y_unl_pred, y_unl_pred_old):
        tqdm.write("Iteration stopped: no more change found in self-training")
        break

    y_unl_pred_old = np.copy(y_unl_pred)
    confident_mask = np.any(y_unl_prob > prob_threshold, axis=1)
    tqdm.write(f"Using {np.sum(confident_mask)} instances from the unlabeled set")

    # re-train on augmented set
    classifier.fit(
        np.vstack((X_train, X_unl[confident_mask])),
        np.hstack((y_train, y_unl_pred_old[confident_mask]))
    )

    # updated predictions
    y_unl_pred = classifier.predict(X_unl)
    y_unl_prob = classifier.predict_proba(X_unl)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Using 18792 instances from the unlabeled set
Using 20171 instances from the unlabeled set
Using 20893 instances from the unlabeled set
Using 21378 instances from the unlabeled set
Using 21725 instances from the unlabeled set



 ###### Smooth the predictions via HMM and evaluate 

In [6]:
Y_oob = classifier.oob_decision_function_[:y_train.shape[0]]
prior, emission, transition = utils.train_hmm(Y_oob, y_train)
y_test_pred = classifier.predict(X_test)
y_test_hmm = utils.viterbi(y_test_pred, prior, transition, emission)
print("\n--- Random forest performance with self-training and HMM smoothing ---")
utils.print_scores(utils.compute_scores(y_test, y_test_hmm))


--- Random forest performance with self-training and HMM smoothing ---
Accuracy score: 0.9004207573632539
Balanced accuracy score: 0.6391289988205523
Cohen kappa score: 0.8404686883093087

Per-class recall scores:
sleep      : 0.9766597510373444
sedentary  : 0.9626294461954075
tasks-light: 0.0
walking    : 0.36774193548387096
moderate   : 0.8886138613861386

Confusion matrix:
 [[1883   45    0    0    0]
 [  53 2138    0   18   12]
 [   0  111    0    9    8]
 [   0   91    0  114  105]
 [   0   41    0    4  359]]


###### Ideas

- Tune acceptance threshold of high confidence predictions.
- Incorporate the HMM smoothing into the self-training loop.

###### References

- [A nice summary of proxy-labels methods](https://ruder.io/semi-supervised/)