# Fitting mutation scales

## Approach

Find a maximum fungicide and host mutation scale that would allow us to still fit the data, if the distribution were a delta function (i.e. narrowest possible). This corresponds to all of the breakdown being caused by mutation from a single initial strain.

We find this mutation scale using the first and last years (not the initial ones since the exact shape of decline depends on shape of initial distribution which we don't think is actually a delta function).


**CHOICES**:
- gaussian or exponential kernel (*Gaussian seems best*)
- mutation proportion
- how bad is acceptable??

Then fix maximum mutation scale.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

import plotly.graph_objects as go

import optuna
from optuna.visualization import (
    plot_optimization_history,
    plot_contour,
)
from optuna.samplers import TPESampler

from polymodel.fitting import HostMaxMutationObjective, score_for_this_df
from polymodel.config import Config
from polymodel.consts import MUTATION_PROP

from plots.fns import standard_layout

In [None]:
optuna.logging.set_verbosity(0)

# Host

## Find optimal value

The following code gives warning that `len(fung_frame) = len(host_frame) = 0`. This is fine because we don't care about the fitted k, l values since for the mutation fitting we use a point distribution.

In [None]:
host_fit_config = Config(
    'single',
    cultivar='Mariboss',
    n_k=40,
    n_l=500,
    mutation_proportion=MUTATION_PROP,
    mutation_scale_fung=1,
    mutation_scale_host=1,
)

In [None]:
sampler = TPESampler(seed=0)
study_h = optuna.create_study(sampler=sampler)
obj_h = HostMaxMutationObjective(host_fit_config)

In [None]:
%%time
study_h.optimize(obj_h, n_trials=300)
int(study_h.best_value)

In [None]:
%%time
study_h.optimize(obj_h, n_trials=300)
int(study_h.best_value)

In [None]:
plot_contour(study_h)

In [None]:
plot_optimization_history(study_h)

## Replicate results

In [None]:
study_h.best_params

In [None]:
yh = (
    HostMaxMutationObjective(host_fit_config)
    
    .run_model(params = study_h.best_params)
    
    # .run_model(params = {
    #     'mean': 0.83,
    #     'mutation_scale': 0.16
    # })
)

yh

In [None]:
control_data_h = (
    obj_h.df
    .loc[:, ['data_control', 
             # 'n_data',
             'year']]
    .assign(year = lambda df: df.year - df.year.min())
)

control_data_h

In [None]:
score_for_this_df(control_data_h, yh)

In [None]:
f, ax = plt.subplots(figsize=(14,7))

sns.scatterplot(
    x='year',
    y='data_control',
    # size='n_data',
    data=control_data_h,
    ax=ax,
)

ax.plot(yh, lw=4, color='red')

ax.set_ylim([0,100])

This annoyingly doesn't fit brilliantly, but it seems sort of ok since we are only after an upper bound? Plus the fung one is pretty good.

Think issue is too much mutation means that we don't fix at low values of control, we constantly mutate back towards less fit offspring from the fittest.

Works better if mutation prop is higher but mutation scale is lower... but Alexey thing suggests mutation prop is low.

- Alexey Highest value: 379 after 350 iterations
- Alexey default value: 1147 after 350 iterations

... both using weighted scoring rather than current unweighted

Suggest progress with highest Alexey value of mutation prop, then use this as upper bound for mutation scale.

Because otherwise arbitrary both - this way can say best fit value and put bad figure in appendix.

## Save mutation scale?

In [None]:
hdf = pd.DataFrame(dict(
    host_mutation_scale = [study_h.best_params['mutation_scale']],
    host_mean = [study_h.best_params['mean']],
))

if True:
    print('saving')
    hdf.to_csv('../data/03_model_inputs/host_mutation_scale.csv')
    
hdf

## Plot

In [None]:
COLZ = sns.color_palette('muted').as_hex()

In [None]:
def host_fig(df_in, y_in):
    
    col1 = COLZ[0]
    col2 = COLZ[1]
    
    data = [
        go.Scatter(
            x = df_in.year,
            y = df_in.data_control,
            mode = 'markers',
            name='Data (cultivar)',
            marker=dict(color=col1),
        ),
        go.Scatter(
            x = np.arange(df_in.year.min(), df_in.year.max()+1),
            y = y_in,
            mode = 'lines',
            name='Model (mutation only)',
            line=dict(color=col2),
        )
    ]
               
    fig = go.Figure(data=data, layout=standard_layout(True, height=400))
    
    fig.update_layout(legend=dict(x=0.05, y=0.1))
    
    fig.update_xaxes(title='Year')
    fig.update_yaxes(title='Control (%)', range=[0,100])
    return fig

In [None]:
data_use = obj_h.df.loc[:, ['year', 'data_control']]

In [None]:
f = host_fig(data_use, yh)

f.show()

In [None]:
f.write_image('../figures/paper_figs/fig_app_host_mutation.png')

# Bad fit with other mutation scale

In [None]:
MUTATION_PROP_BAD = (0.5 * (28 + 130) * 10**6 ) / (0.5 * (2.3 + 10.5) * 10**12)
MUTATION_PROP_BAD

In [None]:
host_fit_config_bad = Config(
    'single',
    cultivar='Mariboss',
    n_k=40,
    n_l=500,
    mutation_proportion=MUTATION_PROP_BAD,
    mutation_scale_fung=10,
    mutation_scale_host=10,
)

In [None]:
obj_bad = HostMaxMutationObjective(host_fit_config_bad)

In [None]:
ybad = (
    obj_bad
    # .run_model(params = study_h.best_params)
    .run_model(params = {
        'mean': study_h.best_params['mean'],
        'mutation_scale': 10
    })
)

ybad

In [None]:
f = host_fig(data_use, ybad)

f.show()