This notebook tries to search for optimal fixed policies (e.g. constant mortality) that maximize the objective (i.e. expected net reward).  Because the dynamics are so heavily stochastic, standard strategies really don't seem to work all that well. Here I try [scikit-optimize](https://scikit-optimize.github.io/stable/index.html) routines which are designed for noisy functions, but the optimum doesn't look that close.  From a simple grid search I think somewhere around 0.05 ~ 0.1 is optimal in this model.  

In [4]:
%pip install -e ..

Obtaining file:///home/rstudio/rl4fisheries
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Collecting typing (from rl4fisheries==1.0.0)
  Using cached typing-3.7.4.3-py3-none-any.whl
Building wheels for collected packages: rl4fisheries
  Building editable for rl4fisheries (pyproject.toml) ... [?25ldone
[?25h  Created wheel for rl4fisheries: filename=rl4fisheries-1.0.0-0.editable-py3-none-any.whl size=2176 sha256=2eb5ea53c53391e317b5f66336d962fe530a2b8a6c66ba1ee1820d3d3e8803b9
  Stored in directory: /tmp/pip-ephem-wheel-cache-umwgxnwa/wheels/d3/ce/fe/d5af67bb4edf309f6a59d59140b2b78d5a336b2ad4b93a1fb4
Successfully built rl4fisheries
Installing collected packages: typing, rl4fisheries
Successfully installed rl4fisheries-1.0.0 typing-

In [1]:
from rl4fisheries.asm import AsmEnv
import numpy as np

In [2]:
env = AsmEnv()

In [3]:
class fixed_effort:
    def __init__(self, effort):
        self.effort = effort[0]

    def predict(self, observation, **kwargs):
        action = self.effort * 2 - 1
        action = np.array([action], dtype=np.float32)
        return action, {}

In [4]:
def f(x):
    results = []
    agent = fixed_effort(x)    
    for rep in range(100): # try score as average of 100 replicates, still a noisy measure
        episode_reward = 0.0
        observation, _ = env.reset()
        for t in range(env.Tmax):
            action, _ = agent.predict(observation, deterministic=True)
            observation, reward, terminated, done, info = env.step(action)
            episode_reward += reward
            if terminated or done:
                break
        results.append(episode_reward)      
    return np.mean(results)
    

In [None]:
f([0.1])

In [None]:
from skopt import gp_minimize
res = gp_minimize(f, [(0.0, 0.2)], n_calls = 1000)
res.x

In [72]:
from skopt import dummy_minimize
res1 = dummy_minimize(f, [(0.0, 0.2)], n_calls = 1000)
res1.x

[0.000302295496324212]

In [None]:
from skopt import gbrt_minimize
res2 = gbrt_minimize(f, [(0.0, 0.2)], n_calls = 1000)
res2

