-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving a problem from SARSOP to POMCP #27
Comments
Looks like the main issue is how to use If you look at the file particles.pyx, there are two classes:
Thanks for pointing this out. That's a mistake in the Tiger documentation. In the code tiger_problems.py, you can find this line that converts a histogram belief into a particles belief: https://github.com/h2r/pomdp-py/blob/master/pomdp_problems/tiger/tiger_problem.py#L328 Indeed, it would be nice to add a documentation regarding the different distributions in pomdp_py. This will be a todo. |
Thank you very much for your response! I must have misread the documentation, I see. However, in my problem I am using a belief where only a handful of states are nonzero. Will I run into any problems for using a belief like this? Otherwise I will do it like in the tiger example. Still, I have the issue that, when printing Lastly, I'd be glad to help with fixing the documentation and/or contributing with documentation about the distributions. I will have to modify other parts of the code down the line (like the value_function), so helping the documentation and writing examples could be a better way to understand the codebase better. |
Well, I can't guarantee you to not run in to any problem. You should be able to use Particles to represent the belief state in your case.
This may not give you an accurate summary of your distribution, because your belief is sparse and many states are probably not added as a particle. This function only produces an equivalent distribution that is more compact. I'd say to be accurate, you have to do: for s in ALL_STATES:
# belief is Particles or WeightedParticles
print(s, belief[s])
Thanks. Your contribution is welcome. Please have a look at: #25 |
Thank you very much. I'll proceed forward with the tips you have me. One last thing related to POMCP: Is there a way of running the planning step offline? I want to try planning once I get the relevant information from brain data and then running the model without running any more simulations. This is due to the POMDP running while new data is acquired and classified in real time to obtain real observations. If not, another option would be to plan after every state change (i.e., an action other than 'wait' is taken), and instruct the experiment to stop for that interval of time (something like 0.5s). |
You could definitely do planning only once and reuse the plan. The result of POMCP planning is a tree, accessible via agent.tree. I encourage you to check out the tree debugging tool, which helps you understand how to access what's in the tree: https://h2r.github.io/pomdp-py/html/api/pomdp_py.utils.debugging.html |
I will, thank you. So far I managed to make the model work, except for a small thing I am not sure I am doing right. As explained before, the initial belief is sparse at the beginning of a trial and uniform across all states that correspond to # Uniform initial belief (for states at time_step 0)
init_belief_hist = pomdp_py.Histogram({state: 1 / n_init_states if state in all_init_states else 0 for
state in all_states})
init_belief_particles = pomdp_py.Particles.from_histogram(init_belief_hist) It looks something like this for the first trial:
I understand that the small differences in each belief are due to the particle representation and as the number of particles increase, the belief would approach uniformity. Anyway, given my transition model, any action other than def probability(self, next_state, state, action):
"""Returns the probability p(s'|s, a)"""
if state.t == self.max_t or "wait" not in action.name: # Last time step or decision
if next_state.t == 0:
return 1 / self.n_targets
else:
return 0
else: # Wait action on time steps other than the last one
if next_state.t == state.t + 1: # For the next time step
if next_state.id == state.id:
return 1.0 - 1e-9
else: # Other states in the next time step
return 1e-9
else: # Can't travel through time... yet
return 0
def sample(self, state, action):
"""Randomly samples next state according to transition model"""
if state.t == self.max_t or "wait" not in action.name: # Last time step or decision
all_init_states = self.get_all_states(t_step=0)
return random.choice(all_init_states)
else: # Wait action on time steps other than the last one
next_step = state.t + 1
return TDState(state.id, next_step)
def get_all_states(self, t_step=None):
"""Returns a list of all states"""
if t_step is not None: # Get list of states for a given time_step
all_states = [TDState(s, t_step) for s in range(self.n_targets)]
else: # All states, all time steps
all_states = [TDState(s, d) for s, d in itertools.product(range(self.n_targets), range(self.n_steps))]
return all_states However, unless I mainly re-initialize my belief at the beginning of each trial (i.e.
As you can see, where before the belief for each state went from ~7% to ~10%, now one state has close to 40% belief initially. This, combined with the fact that the observation noise is high at the earlier time steps, makes it so the correct state's belief can go to 0 by the time it is observed in consecutive time steps. I suspect this may be due to the last update of each trial. In my POMDP, observations come from the brain data of a given participant, and at the end of each trial (which can be because an action different than The way the observation model works is by training a decoder beforehand that classifies the user's brain data, and some extra data is used to generate a confusion matrix of the classifier. This is what the observation model uses, and the probability of the observation matching the state is higher as time progresses (more data available thus better decoding). This is the code for the observation model: class TDObservationModel(ObservationModel):
"""
Parameters
----------
Features: 3-D np.array, shape (n_steps, n_states, n_observations)
Feature array for the observation matrix.
Attributes
----------
observation_matrix: 3D np.array, (n_timesteps, n_class, n_observation)
Matrix representing the observation model, where each element represents the probability of obtaining
the observation corresponding on the third dimension given that the agent is currently at the state
corresponding to the second simension and the current time step of the trial is that of the first dimension.
Example:
observation_matrix[3][2][5] = p(o=o_5|s=s_2, d=3)
"""
def __init__(self, features):
self.observation_matrix = features
self.n_steps, self.n_states, self.n_obs = self.observation_matrix.shape
def probability(self, observation, next_state, action):
# Probability of obtaining a new observation given the next state and the time step
obs_idx = observation.id
state_idx = next_state.id
state_step = next_state.t
return self.observation_matrix[state_step][state_idx][obs_idx]
def sample(self, next_state, action):
# Return a random observation according to the probabilities given by the confusion matrix
state_idx = next_state.id
state_step = next_state.t
obs_p = self.observation_matrix[state_step][state_idx]
return np.random.choice(self.get_all_observations(), p=obs_p) Should I make a special observation that is always returned by for trial_n in range(n_trials):
# Separate next trial and label
next_trial = X_test[trial_n, :, :]
next_label = int(y_test[trial_n])
# Set belief to uniform, in case last trial ended without a decision
# vep_problem.agent.set_belief(init_belief_particles)
print('')
print(f'TRIAL {trial_n} (true state S{next_label})')
print('-' * 20)
# For every time step...
for step_n in range(n_steps):
# Set the true state as the env state
true_state = TDState(next_label, step_n)
vep_problem.env.apply_transition(true_state)
print('')
print(f' STEP {step_n}')
print(' Current belief:')
cur_belief = vep_problem.agent.cur_belief
print_belief(cur_belief, all_states, indent=8)
# Get your action and execute it
action = pomcp.plan(vep_problem.agent)
reward = vep_problem.env.state_transition(action, execute=False)
print('')
print(f' Action: {action.name}')
print(f' Reward: {reward}')
# Add your reward
total_reward += reward[-1]
if reward[-1] == miss_cost:
false_positives += 1
# Predict and get observation
t_start = 0
t_end = time_steps[step_n]
# Get prediction from the corresponding model
sample_end = int(t_end * params['sfreq'])
pred = int(best_model[step_n].predict(next_trial[..., :sample_end])[0])
observation = BCIObservation(pred)
print(f' Observation: {observation.name}')
# Belief update
pomcp.update(vep_problem.agent, action, observation)
# Go to next trial if action is taken
if action.name != 'a_wait':
decision_time = t_end
total_time += decision_time
beliefs.append((action, cur_belief))
print('')
print(f'Action {action.name} selected. Trial ended.')
print(f'Decision took {decision_time}s')
break
# End of the trial. If the last action was 'wait', add a missed trial
if action.name == 'a_wait':
total_time += epoch_len
misses += 1
print(f'No decision was taken, trial stopped at {t_end}')
print("End of the trial") |
I will try to look more into this. Could you briefly answer a quick question: are trials independent? or is trial 0 supposed to influence trial 1? |
Not at all, the true state of a trial has nothing to do on the state of the next trial. |
But it looks like |
It is commented out at the beginning of the trial, yeah: for trial_n in range(n_trials):
# Separate next trial and label
next_trial = X_test[trial_n, :, :]
next_label = int(y_test[trial_n])
# Set belief to uniform, in case last trial ended without a decision
# vep_problem.agent.set_belief(init_belief_particles) The thing is that, even if I force the belief to uniform, the simulation of the POMCP algorithm is not taking this belief update into account when planning or sampling. So I thought I should make this belief update happen when updating the model on the last step. If I understood correctly, when you call I am not sure, my understanding of how the tree works is limited, I'm afraid. To support my point of view, the Tiger problem does not manually re-initialize the belief after opening a door, and the problem I am working on is very similar to that one. The only difference is that on the Tiger problem you can go on forever until you open a door, and here you can either 'open a door' or reach the last time step. In both cases the world transitions into new state that is independent from the previous one. Maybe I should update my observation model to yield observations with uniform probability for whenever the trial ends? |
Belief update should happen at every time step (during a trial).
I disagree. In Tiger, there is no state change no matter what action is taken.
Could you elaborate on this? Are you saying POMCP is still planning with the old belief even after
Suppose action |
I see, I was under the impression that the state could change after opening a door. I think I mixed two things, allow me to reformulate: In the Tiger problem, when you listen, you get an observation according to the noise parameter. In my problem, the listen equivalent is 'wait', and when you wait you get an observation based on a previously estimated confusion matrix. However, when you 'open' in the Tiger problem, you get an observation with uniform probability. In my problem, this is not implemented and I think this is where the problem comes from. In more detail, once you take a decision in my problem (any action other than wait), obtaining an observation from the data in that time-step is a modelling error (I think). I believe in that case the observation should be sampled at random, similar to when you open in the tiger problem. Changing the observation model to the following: def probability(self, observation, next_state, action):
# Probability of obtaining a new observation given the next state and the time step
if next_state.t == self.n_steps or "wait" not in action.name: # Last time step or decision
return 1 / self.n_states
else: # Wait action on other time steps
obs_idx = observation.id
state_idx = next_state.id
state_step = next_state.t
return self.observation_matrix[state_step][state_idx][obs_idx]
def sample(self, next_state, action):
# Return a random observation according to the probabilities given by the confusion matrix
if next_state.t == self.n_steps or "wait" not in action.name: # Last time step or decision
return np.random.choice(self.get_all_observations())
else: # Wait action on other time steps
state_idx = next_state.id
state_step = next_state.t
obs_p = self.observation_matrix[state_step][state_idx]
return np.random.choice(self.get_all_observations(), p=obs_p) Fixed the issue of the skewed beliefs at the beginning of each trial. Do you think I should still manually re-initialize the belief because the trials are independent from each other?
No, that is not what I meant. Again, I think the confusion comes from me mixing the belief with how the observation sampling is modelled at the last time step. After reading your comments, I realized that I am missing a time step in my code. Right now, my code does the following:
The problem with this is that, at time step N, the belief is updated with the maximum available data for that trial. After that, the trial is ended, where there should be a N+1 step to take an action with that belief, and then end the trial no matter the action. After this action is taken, there is no data to get an observation from, and it doesn't matter because the state is going to change at random. I believe I should randomly sample an observation and then update with the observation model I posted above.
Yes, this is what I was worried about. Since I was missing a last step in my model, the POMCP simulation assumes all observations come from the data and influence the next step. With the new observation model, the POMCP simulation knows that reaching the last time_step, the probability of obtaining any observation is uniform. I think there was no particle depravation in my case because I was running the model with the assumption that there is data available to get observations from at any time step. Now I am creating my model with an extra time step and everything seems to be working well. Still, there would be any difference between manually setting the belief to uniform at the beginning of each trial? |
To ensure independence between trials, I would re-create a problem instance at the start of a trial. |
When re-creating a problem instance at the end of each trial, is it necessary to also re-create the POMCP instance? |
It should be fine to use the same instance, since you would pass in the agent when you do pomcp.plan(agent). Unless the rollout policy used to create the POMCP instance is agent-specific. It is not costly at all to create a POMCP instance. If you want to be safe you can create a new one. |
Roger that, thank you very much! I will close the issue now. |
Hello! I am currently trying to use pomdp_py to explore using POMDP as a framework to help decision-making in Brain-Computer Interfaces (BCI). A while back I asked about the possibility of defining a model that included a time component #21.
TL;DR - These are the problems/questions I have:
Particles
is not constructed correctly when passed a list of tuples generated from theHistogram
belief representationAfter finishing a first work exploring BCIs using a simple POMDP that relied on SARSOP for solving. I am now trying to move to POMCP to solve a problem similar to what we discussed in #21.
I ran into a problem when trying to call
pomcp.plan()
in POMCP as I was initializing my belief like the following:I got then an error saying that the belief needs to be represented as particles in order to use POMCP. I notices this is not present in the Tiger's example in the documentation. After looking around for a bit, I found the
Particles
module. However, I do not understand how to construct it. I tried to do it in the following way (starting from the previousinit_belief
)But then I get the entire tuples (both State and its probability) as values, and all weights are set to
None
. I ended up making it like the following:That way it looks like it works. However, when I start the simulation I have the following piece of code where I print the current belief (right now as it was for the histogram representation of the belief):
And then I get a very long list where none of the values is 0. At any time step, only
n_targets
states should be different than 0, as the initial belief is specified as stated previously and the transition model (not shown) specifies that all states can only go to the same state with an increase in time_step. That said, I don't know if what I am getting is the belief or rather the list of particles (my understanding of how the particle representation works is limited, even after reading the POMCP paper, so it may be that). If that is the case, I wonder if it is correct calling the belief with something likeproblem.agent.cur_belief.get_histogram()
.Thank you very much for your time.
The text was updated successfully, but these errors were encountered: