We're predicting the log tumor vs. immune cell ratio in these images. This is trivially computable from the cell counts from the different groups, but the key point is that this information is not necessarily obvious from the images (PD-1 might be a better response though...). All that's available immediately is the number of pixels belonging to each of the cell types.

So, that can be our baseline. If we can predict better than just the pixel counts, then we have effectively trained a cell counter. We would expect to have learned features related to total cell count that aren't captured just in the pixel count (things like the cell size).

In [None]:
import os
import json
import pandas as pd
import numpy as np
from pathlib import Path
import sklearn.linear_model as lm

We will work from the shared archive of preprocessed TNBC data in `stability_data_tnbc.tar.gz`. The block below unzips this archieve and makes it available for the regression baseline.. 

In [None]:
%%capture
%cd ../../data/raw_data/
!rm -rf stability_data/
!tar -zxvf stability_data_tnbc.tar.gz
%cd ../../data_analysis/learning/

To build this baseline, we first need to extract the proportion of per-image pixels belonging to each category.

In [None]:
data_dir = Path("../../data/raw_data/stability_data")
splits = pd.read_csv(data_dir / "Xy.csv")
x = {"train": [], "dev": [], "test": []}
y = {"train": [], "dev": [], "test": []}

for p in splits.to_dict(orient="records"):
    patch = np.load(data_dir / p["path"])
    cell_means = np.mean(patch, axis=(0, 1))
    x[p["split"]].append(cell_means)

for k in x.keys():
    x[k] = np.stack(x[k])
    y[k] = splits["y"][splits["split"] == k]

We'll fit a simple ridge regression model.

In [None]:
model = lm.Ridge()
model.fit(x["train"], y["train"])
y_hat = {
    "dev": model.predict(x["dev"]),
    "train": model.predict(x["train"]),
    "test": model.predict(x["test"])
}

We can now check the errors. It's also not hard to plot `y` vs. `y_hat` given the data that we've computed.

In [None]:
err = {}
for k in y_hat.keys():
    err[k] = np.mean((y_hat[k] - y[k]) ** 2)

json.dump(err, open(data_dir / "baseline.json", "w"))

In [None]:
err