## TTA Prediction based on the instataneous relative surface and the elapsed time since budding

Intuitively, it should work better than the prediction based on the surface only, since the elapsed time since the bud detection is very well correlated with TTA (but also with the surface...)

In [2]:
%matplotlib inline
import cv2
import imageio
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import scipy
import seaborn as sns
from skimage import measure # to get contours from masks
import sklearn

import napari

sns.set_theme()

## Loading Data

In [3]:
os.chdir("D:\Documents\STAGE\Anaphase")
data = pd.read_csv("Analysis_BF_f0001-1-100.1.csv", sep=";", comment="#")
data = data[data["time_to_anaphase"] >= 0]

print(data.shape)
data.head()

(495, 14)


Unnamed: 0,idx,frame,time,mom_x,mom_y,daugh_x,daugh_y,mom_surf,daugh_surf,relat_surf,anaphase,anaphase_int,time_to_anaphase,cum_relat_surf
0,0,0,0,463.509804,225.369863,-1.0,-1.0,229.730504,0.0,0.0,False,0,36,0.0
1,0,1,6,463.481911,225.084507,460.542099,238.588235,227.730488,9.53426,0.041866,False,0,30,0.020842
2,0,2,12,460.34542,225.60274,456.110886,239.965517,229.730504,36.406857,0.158476,False,0,24,0.066853
3,0,3,18,460.578297,225.69863,455.691877,240.914286,224.730504,51.553944,0.229403,False,0,18,0.106912
4,0,4,24,461.524165,225.985915,456.054041,241.146341,228.446179,68.701031,0.300732,False,0,12,0.145739


## Plot the bivariate data

In [4]:
%matplotlib qt
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

features = ["relat_surf", "time"]

plane = 50 * data["relat_surf"] - data["time"] + 20  # linear reference

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter((data["time"] - data["time"].mean()) / data["time"].std(), (data["relat_surf"] - data["relat_surf"].mean()) / data["relat_surf"].std(), data["time_to_anaphase"], c="b")
# ax.scatter(data["time"], data["relat_surf"], plane, c="r")

ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
ax.set_zlabel('TTA (min)')

Text(0.5, 0, 'TTA (min)')

## Linear regression

Here we will try to fit a linear regression to predict the TTA given the cumulative relative surface at time $t$. This is a dumb regressor, as it will not be bad on average, but learns to predict from single data points (i.e. makes the assumption that all the observations are i.i.d.) instead of taking the cell's history into account and trying to predict, at each time step of a time series, the TTA. As we will see, it will not perform bad on average (RMSE of 5 minutes) and will simply make early predictions on the points that are over the curve (latecomers) and late observations on the points which are under the curve (earlycomers). 

### Training

In [5]:
# prepare the training set
from sklearn.model_selection import train_test_split

for f in features:
    data[f] = (data[f] - data[f].mean()) / data[f].std()

X_train, X_test, y_train, y_test = train_test_split(data[features].values, data["time_to_anaphase"].values, train_size=0.9)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(445, 2) (50, 2) (445,) (50,)


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()

# grid search cv to avoid overfitting
parameters = {'n_estimators':[50, 100, 500, 1000, 1250],'max_depth':[2, 4, 8]}
rf_best = GridSearchCV(rf, parameters, cv = 5)
rf_best.fit(X_train, y_train)  # will split into train and val (we keep test apart)

print(f"Best parameters : {rf_best.best_estimator_}")

print(f"Score coefficient on train set: {round(rf_best.score(X_train, y_train), 2)}")

In [None]:
predictions_on_train_set = rf_best.predict(X_train)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_train[:, 1], X_train[:, 0], y_train, c="b", label="ground truth")
ax.scatter(X_train[:, 1], X_train[:, 0], predictions_on_train_set, c="r", label="predictions")

ax.set_xlabel('Time (min)')
ax.set_ylabel('Relative surface')
ax.set_zlabel('TTA (min)')

### Evaluation

In [None]:
predictions_on_test_set = rf_best.predict(X_test)

# evaluation with the clasical RMSE
RMSE_train = np.sqrt(((y_train - predictions_on_train_set) ** 2).mean())
print(f"RMSE on the train set: {round(RMSE_train, 2)} min")

RMSE_test = np.sqrt(((y_test - predictions_on_test_set) ** 2).mean())
print(f"RMSE on the test set: {round(RMSE_test, 2)} min")

if RMSE_test > 1.3 * RMSE_train:
    print("##### OVERFITTING!")

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_test[:, 1], X_test[:, 0], y_test, c="b", label="ground truth")
ax.scatter(X_test[:, 1], X_test[:, 0], predictions_on_test_set, c="r", label="predictions")

ax.set_xlabel('Time (min)')
ax.set_ylabel('Relative surface')
ax.set_zlabel('TTA (min)')
ax.set_title("Predictions on the evaluation set")