RL Workspace

experimental area for reinforcement learning.

Policy Evaluation

Temporal difference (TD) is a method that utilizes Markov property to get an estimate of the value ("goodness") of being in a specific state within an environment. TD updates the value of a state, towards it's own estimate of the value in the next state. TD(0) specifically keeps track of only previous state to compute the current state value. There is no model of the environment here, state estimation is acquired via sampling from the environment.

TD with state Aggregation in Random Walk Environment

Aggregate the states close to each other under a "feature"
Each feature gets a weight to produce a "state value"
Apply semi-gradient TD update for updating weights during random walk
Propagate the learnt weights to the next episode to keep learning

TD with state Aggregation in Random Walk Environment

Represent the states with one-hot encoding
Generate a NN which has # states many inputs and single output for estimating the state value
Use ADAM optimization and TD update to update the weights at each agent step
Propagate the learnt weights to the next episode to keep learning

Deep SARSA on Cartpole

Includes the implementation of inverted pendulum, from https://github.com/mpkuse/inverted_pendulum. Duration of each action is 0.2s. Resolution of the solver for the pendulum dynamics is 0.1s.
Currently, pole starts upright with some offset, just to make sure agent learns how to balance by moving. If it starts perfectly upright, agent rather quickly learns just to stand still to keep balancing it.
SARSA bootstraps off of the value that it's going to take next, which is sampled from it's own policy. So it learns about the "goodness" of its own policy.
DNN is used to estimate the value of state-action pair. Output of DNN has same size as action space (1 value for each action)
The crucial point is that; update on the "target" happens with the already decided (known) next_action and next_state: --> on-policy target[action] = (reward + self.discount_factor * self.model.predict(next_state)[0][next_action])
For Deep Q learning, we will see that the above equation changes to: (so next_action won't be needed to do bootstrapping-->which makes it off-policy) target[action] = (reward + self.discount_factor * np.amax(self.model.predict(next_state)[0]))

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
RL		RL
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL Workspace

Policy Evaluation

TD with state Aggregation in Random Walk Environment

TD with state Aggregation in Random Walk Environment

Deep SARSA on Cartpole

About

Releases

Packages

Languages

goksanisil23/rl_workspace

Folders and files

Latest commit

History

Repository files navigation

RL Workspace

Policy Evaluation

TD with state Aggregation in Random Walk Environment

TD with state Aggregation in Random Walk Environment

Deep SARSA on Cartpole

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages