# ctDNA changes as a predictive marker for response (PR/CR/SD/PD) (UMCG/Paul and Hylke, validation at MUG)
To test whether the genomic information helps predict clinical response, we set out to build a model that is able to single out responders (composed of complete and partial responders) versus the complement, namely non-responders (stable disease or progressive disease) and non-evaluable participants. To this end, we first partitioned the dataset into a training-validation set ($n=...$ participants) and an hold-out test set (the remainder, $n=...$ subjects). 
Below, we report results from five-fold cross-validalation on the training-validation set, and the test set.

To evaluate the model, we took the single nucleotide variants and the copy number alterations that were called by  the Avenio platform, after the aforementioned filtering. Subsequently, the following transformation steps were carried on the variant calls:
1. Point mutations are pooled on a gene granularity by summing over individual variations. 

    a. For calculations in which single time points (i.e., $t_0$ or $t_1$ only) were considered, we directly sum their values, i.e., 
    $$
    \overline{c}_i^{(\alpha)}(t) \sum_{j\in \text{variations in }i} c_{ij}^{(\alpha)}(t)
    $$ 
    where $c_{ij}^{(\alpha)}(t)$ denotes the mutant concentration (in units of molecules per mL) of variation $j$ (e.g., $j$ = c.973T>C) in gene $i$ (i.e., $i \in \{\text{EGFR}, \text{KRAS},\dots  \}$) at time point $t\in \{t_0, t_1\}$ for patient $\alpha$. Models using the variant allele frequency, denoted by $x_{ij}^{(\alpha)}(t)$, instead of the mutant concentration, are coarse grained in a similar way.
    
    b. Results using _both_ timepoints $t_0$ and $t_1$ are transformed prior to coarse graining
$$
\overline{f}^{(\alpha)}_i(t_0, t_1) = \sum_{j\in \text{variations in }i} f\left(c_{ij}^{(\alpha)}(t_0), c^{(\alpha)}_{ij}(t_1)\right),
$$
 and similarly for the variant allele frequency.
2. CNV scores are calculated per gene, so no pooling is required. Two-time point results can therefore directly be transformed using $f(u, v)$.
3. After pooling, each genomic feature is made dimensionless by maximum absolute value scaling. The dimensionless numbers lie in the $[-1, 1]$ interval, thereby facilitating a comparison with the clinical variables.
4. All genomic values are set to zero when, after filtering, no variants were observed.

The data produced from these subsequent steps are jointly denoted by _genomic features_. Apart from these genomic features, the clinical variables age, gender, therapy line, smoking status, histology and presence of lymf-, brain-, adrenal-, liver- and lung metastases were selected for modelling. We shall refer to this set of variables as the _clinical_ variables or features. Each clinical variable is converted into a numeric value by one-hot-encoding, after dichotomising each variable.

A variety of machine learnings models were compared to predict the treatment response (Fig. [?]). All models proved to be equivalent --- within the variation observed in cross-validation --- in terms of the area under the curve of the receiver operating characeristic (AUC ROC). We shall therefore focus on a logistic regression model with $L_2$ regularisation in view of its simple-to-interpret coefficients, and the relative small standard deviation in cross-validation results of the AUC ROC.

In [1]:
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression

from pipelines import pipeline_Freeman
from transform import combine_tsv_files, generate_data_pairs, generate_model_data_pairs
from views import compare_prognostic_value_genomic_information, view_linear_model_freeman

In [7]:
# Parameters for logistic regression.
logistic_Freeman_parameters = {
    'estimator__C': 0.1, 
    'estimator__class_weight': 'balanced', 
    "estimator__solver": "newton-cg",
}

## In how many patients did we observe consistent increase/decrease of ctDNA levels? How many patients showed a mixed molecular response?
Figure of Paul

## Can changing levels predict response?
Marginally:

In [8]:
# Difference genomic variable.
X_train_t0, y_train = combine_tsv_files(
    "output/train__gene__t0__No. Mutant Molecules per mL.tsv",
    "output/train__gene__t0__CNV Score.tsv",
)

In [9]:
y_train = y_train["response_grouped"]

response_labels = ['non responder (sd+pd)', 'responder (pr+cr)', 'non evaluable (ne)']
pos_label = 'responder (pr+cr)'
y_train = y_train == pos_label

In [10]:
logistic_Freeman = pipeline_Freeman(LogisticRegression)
logistic_Freeman.set_params(**logistic_Freeman_parameters)

Pipeline(memory=None,
         steps=[('clinical_curation',
                 FunctionTransformer(accept_sparse=False, check_inverse=True,
                                     func=<function clinical_data_curation at 0x7feb878fd400>,
                                     inv_kw_args=None, inverse_func=None,
                                     kw_args=None, pass_y='deprecated',
                                     validate=False)),
                ('filter_clinical_variables',
                 FunctionTransformer(accept_sparse=False, check_inverse=True,
                                     func=<function d...
                                                   'adrenalmeta', 'livermeta',
                                                   'lungmeta', 'skeletonmeta',
                                                   'age'])],
                                   verbose=False)),
                ('estimator',
                 LogisticRegression(C=0.1, class_weight='balanced', dual=False,
    

In [None]:
figure_filenames = ("logistic_regression__clinical_freeman__t0", "logistic_regression__genetic_freeman__t0")
view_linear_model_freeman(
    X_train_t0, 
    y_train, 
    logistic_Freeman, 
    filenames=figure_filenames, 
)

> [0;32m/home/donkerhc/avenio/views.py[0m(353)[0;36mview_linear_model_freeman[0;34m()[0m
[0;32m    352 [0;31m    [0;31m# Make a plot for the clinical data.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 353 [0;31m    [0;32mwith[0m [0msns[0m[0;34m.[0m[0mplotting_context[0m[0;34m([0m[0mfont_scale[0m[0;34m=[0m[0;36m1.5[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    354 [0;31m        coef_partitions = dichomotise_parallel_coefficients(
[0m
ipdb> n
> [0;32m/home/donkerhc/avenio/views.py[0m(354)[0;36mview_linear_model_freeman[0;34m()[0m
[0;32m    353 [0;31m    [0;32mwith[0m [0msns[0m[0;34m.[0m[0mplotting_context[0m[0;34m([0m[0mfont_scale[0m[0;34m=[0m[0;36m1.5[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 354 [0;31m        coef_partitions = dichomotise_parallel_coefficients(
[0m[0;32m    355 [0;31m            [0mcoeff_mean_clinical[0m[0;34m,[0m [0mcoeff_std_clinical[0m[0;34m,[0m 

In [None]:
mutant_data_pairs = generate_data_pairs(
    filename_prefix="output/train", snv_type="No. Mutant Molecules per mL"
)
vaf_data_pairs = generate_data_pairs(
    filename_prefix="output/train", snv_type="Allele Fraction"
)
model_mutant_data_pairs = generate_model_data_pairs(mutant_data_pairs, logistic_parameters)
model_vaf_data_pairs = generate_model_data_pairs(vaf_data_pairs, logistic_parameters)
compare_prognostic_value_genomic_information(model_mutant_data_pairs, plot_label="Mutant concentration")
compare_prognostic_value_genomic_information(model_vaf_data_pairs, plot_label='Allele fraction')
plt.savefig('figs/comparison_genomic_data.png', bbox_inches="tight")

## Do baseline levels correlate with response
See figure:
![VAF/mol](figs/comparison_genomic_data.png "VAF molecules comparison")

## Define a cut-off for quantitative change or for baseline levels!
## Use mean of all variants
## Use a delta (T0-T1) of 30% as cut-off
## Use a delta (T0-T1) of 50% as cut-off
## Use a delta (T0-T1) of 80% as cut-off
## Consider only variants with VAF <0.5%
## Consider only variants with VAF <1%
## Use “highest” only
## Use cases with consistent vs mixed changes in levels
## Use all variants including synonymous
## Use only variants excluding synonymous

## Check whether VAF OR mutant molecules/ml is a better predictor
See figure:
![VAF/mol](figs/comparison_genomic_data.png "VAF molecules comparison")