Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between value function and policy function #65

Open
rhuselid opened this issue May 24, 2019 · 1 comment
Open

Difference between value function and policy function #65

rhuselid opened this issue May 24, 2019 · 1 comment
Labels
HW7 Discussions related to Homework #7

Comments

@rhuselid
Copy link

I am working on the part of the homework from the textbook and am running into a bit of a conceptual issue. What exactly is the difference between the value function and the policy function.

The way I have them currently constructed I get a shape mismatch because my value matrix is NxN and my policy matrix is NxT+1. I am not sure how the value matrix should be reduced in a way that is logically district from what the policy matrix already is. Relevant code below:

maximized = []
    for i in range(N):
        maximized.append([])
    for i in range(T + 1, -1, -1):
        Vt = []
        for cake_t in w:
            # each row is a wt
            row = []
            for cake_t1 in w:
                # each col is a wt+1
                val = cake_t1 - cake_t
                if val >= 0:
                    row.append(u(val))
                else:
                    row.append(-10000000)
            Vt.append(row)
        intermediary = []
        for wt in Vt:
            best_wt1 = max(wt)
            intermediary.append([best_wt1])
        maximized = np.hstack((maximized, intermediary))
@jmbejara
Copy link
Owner

The value function is the discounted future value, given the state, of acting optimally. The policy function is the optimal policy to choose, given the state. These functions also depend on the time period, since this first problem has a finite number of time periods.

There are T time periods. There is no continuation value after time T. The way that the question is constructed, they want you to construct a matrix so that column T is the value in the last time period and column T +1 represents the value at time T+1---which should be uniformly zero.

@jmbejara jmbejara added the HW7 Discussions related to Homework #7 label May 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HW7 Discussions related to Homework #7
Projects
None yet
Development

No branches or pull requests

2 participants