#### Reinforcement Learning and Decision Making &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Homework #1

# &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Planning in MDPs

## Description

You are given an $N$-sided die, along with a corresponding Boolean mask
vector, (i.e., a vector of ones and zeros) `is_bad_side`. You can assume
that $1<N\leq30$, and the vector `is_bad_side` is also of size $N$ and
$1$ indexed (since there is no $0$ side on the die). The game of DieN is
played as follows:

1.  You start with $0$ dollars.

2.  At any time you have the option to roll the die or to quit the game.

    1.  **ROLL**:

        1.  If you roll a number not in `is_bad_side`, you receive that
            many dollars (e.g., if you roll the number $2$ and $2$ is
            not a bad side -- meaning the second element of the vector
            `is_bad_side` is $0$, then you receive $2$ dollars). Repeat
            step 2.

        2.  If you roll a number in `is_bad_side`, then you lose all the
            money obtained in previous rolls and the game ends.

    2.  **QUIT**:

        1.  You keep all the money gained from previous rolls and the
            game ends.

## Procedure

-   You will implement your solution using the `solve()` method
    in the code below.
    
-   Your return value should be the number of dollars you expect to
    win for a specific value of `is_bad_side`, if you follow an
    optimal policy. That is, what is the value of the optimal
    state-value function for the initial state of the game (starting
    with $0$ dollars)? Your answer must be correct to $3$ decimal
    places, truncated (e.g., 3.14159265 becomes 3.141).

-   To solve this problem, you will need to determine an optimal policy
    for the game of DieN, given a particular configuration of the die.
    As you will see, the action that is optimal will depend on your
    current bankroll (i.e., how much money you've won so far).

-   You can try solving this problem by creating an MDP of the game
    (states, actions, transition function, reward function, and assume a
    discount rate of $\gamma=1$) and then calculating the optimal
    state-value function.

## Resources

The concepts explored in this homework are covered by:

-   Lecture Lesson 1: Smoov & Curly's Bogus Journey

-   Chapter 3 (3.6 Optimal Policies and Optimal Value Functions) and
    Chapter 4 (4.3-4.4 Policy Iteration, Value Iteration) of
    http://incompleteideas.net/book/the-book-2nd.html

-   Chapters 1-2 of 'Algorithms for Sequential Decision Making', M.
    Littman, 1996

## Submission

-   The due date is indicated on the Canvas page for this assignment.
    Make sure you have your timezone in Canvas set to ensure the
    deadline is accurate.

-   Submit your finished notebook on Gradescope. Your grade is based on
    a set of hidden test cases. You will have unlimited submissions -
    only the last score is kept.

-   Use the template below to implement your code. We have also provided
    some test cases for you. If your code passes the given test cases,
    it will run (though possibly not pass all the tests) on Gradescope.

-   Gradescope is using *python 3.6.x*.  For permitted libraries, please
    see the requirements.txt file, You can also use any core library
    (i.e., anything in the Python standard library).
    No other library can be used.  Also, make sure the name of your
    notebook matches the name of the provided notebook.  Gradescope times
    out after 10 minutes.

In [2]:
import pandas as pd
import numpy as np
import sys
sys.path.append("C:\\users\\mccar\\miniconda3\\lib\\site-packages")
# import mdptoolbox

In [3]:
# # obtain states, actions, rewards, and transitions
# is_bad_side = [1,1,1,0,0,0]
# sides = goodSides(is_bad_side)
# # print(sides)
# max_rolls = 10
# states = getStates(sides, max_rolls)
# # print(states)
# trans = getTransitions(states, sides, len(is_bad_side), max_rolls)
# # print(trans[0][0])
# rewards = getRewards(states)
# # print(rewards)
# print(trans[1])

In [4]:
# mdp = mdptoolbox.mdp.ValueIteration(trans, rewards, 1)
# mdp.run()
# print("V(0) = " + str(mdp.V[0]))

In [13]:
import numpy as np

class MDPAgent(object):
    def __init__(self):
        pass


    def goodSides(self, mask):
        """
        Creates a list of just the good sides
        Will be useful later for transition function
        """
        return [i+1 for (i, v) in enumerate(mask) if v==0]


    def getStates(self, sides, max_rolls):
        """
        Get all possible states given you can only roll the dice so many times
        We also need to add two states for the end "keep state" and the bankrupt state
        """
        states = [0] + [x for x in sides]
        max_num = (max_rolls)*sides[-1]
        counts = 0
        while (True):
            counts = len(states)
            for i in range(len(states)):
                for j in range(len(sides)):
                    s = states[i] + sides[j]
                    if s not in states and s <=max_num:
                        states.append(s)
            if counts == len(states): break
        states = sorted(states)
        states.append('E')
        states.append('B')
        return states


    def getTransitions(self, states, sides, N, max_rolls):
        """
        Generate transition matrices given states, good side and total side number.

        for ex, for the first example [1,1,1,0,0,0], a roll will produce 1/6 chance of being 4, 5, or 6, and 50% chance of going bankrupt
        """
        trans0 = pd.DataFrame(0, index=states, columns=states)
        trans1 = pd.DataFrame(0, index=states, columns=states)
        n = len(sides)                                     # number of good side
        b_rate = 1 - n/N                                   # probability of rolling a bad side
        trans0.iloc[len(sides)+1:, -1] = 1                 # preset all transitions to bankrupt state as true for action 'roll'
        for i in range(len(states)-2):                     # loop through all numerical states
            if states[i] <= sides[-1]*(max_rolls-1):       # check if the current row (state) is from less than max_rolls rolls
                for j in range(len(states)-2):
                    if states[j] - states[i] in sides:
                        trans0.iloc[i, j] = 1/N
                trans0.iloc[i, -1] = b_rate                # set probability of transition to bankrupt state 
        trans1.iloc[:-1, -2] = 1
        trans1.iloc[-1, -1] = 1
        return np.stack((trans0.to_numpy(), trans1.to_numpy()), axis=0)


    def getRewards(self, states):
        """
        get all the rewards from each state for both ending and rolling
        note that the reward for rolling and not going bankrupt is still 0, its only positive when they decide to end and keep it
        """
        rewards = pd.DataFrame(0, index=states, columns=["roll", "end"])
        for s in states:
            if s not in ['B', 'E']:
                rewards.loc[s, "end"] = s
        return rewards.to_numpy()


    def solve(self, is_bad_side):
        """
        Run value iteration to find the solution
        """
        gamma = 1
        epsilon=1e-6
        max_rolls = 10
        # get the states, actions, rewards, and transitions
        good_mask = self.goodSides(is_bad_side)
        states = self.getStates(good_mask, 10)
        actions = [0, 1]
        transitions = self.getTransitions(states, good_mask, len(is_bad_side), max_rolls)
        rewards = self.getRewards(states)
        
        # need this instead of list as we skip states (aka dollar amounts) that cant be reached
        V = {state: 0 for state in states}
        policy = {state: 0 for state in states}

        while True:
            delta = 0
            for state_ind in range(len(states)):
                state = states[state_ind]
                v = V[states[state_ind]]

                action_values = {}
                for action in actions:
                    action_value = 0
                    for next_state_ind, prob in enumerate(transitions[action][state_ind]):
                        next_state = states[next_state_ind]
                        action_value += prob * (rewards[state_ind][action] + gamma * V[next_state])
                    action_values[action] = action_value
                best_action = max(action_values, key=action_values.get)
                V[state] = action_values[best_action]
                policy[state] = best_action
                delta = max(delta, abs(v - V[state]))
 
            if delta < epsilon:
                break
        # print(V)
        # return V, policy
        return V[0]


agent = MDPAgent()
is_bad_side = [1, 1, 1, 0, 0, 0]
# V, policy = agent.solve(is_bad_side)
# print(V[0])
V = agent.solve(is_bad_side)
print(V)

is_bad_side = [1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0]
# # V, policy = agent.solve(is_bad_side)
# print(V[0])
V = agent.solve(is_bad_side)
print(V)

is_bad_side = [1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]
# # V, policy = agent.solve(is_bad_side)
# print(V[0])
V = agent.solve(is_bad_side)
print(V)

  trans0.iloc[i, j] = 1/N
  trans0.iloc[i, -1] = b_rate                # set probability of transition to bankrupt state
  trans0.iloc[i, j] = 1/N
  trans0.iloc[i, -1] = b_rate                # set probability of transition to bankrupt state


2.583333333333333
7.379980563654032


  trans0.iloc[i, j] = 1/N
  trans0.iloc[i, -1] = b_rate                # set probability of transition to bankrupt state


6.314049586776859


## Test cases

We have provided some test cases for you to help verify your implementation.

In [14]:
## DO NOT MODIFY THIS CODE.  This code will ensure that your submission
## will work proberly with the autograder

import unittest

class TestDieNNotebook(unittest.TestCase):
    def test_case_1(self):
        agent = MDPAgent()
        np.testing.assert_almost_equal(
            agent.solve(is_bad_side=[1, 1, 1, 0, 0, 0]),
            2.583,
            decimal=3
        )
        
    def test_case_2(self):
        agent = MDPAgent()
        np.testing.assert_almost_equal(
            agent.solve(
                is_bad_side=[1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0]
            ),
            7.379,
            decimal=3
        )
        
    def test_case_3(self):
        agent = MDPAgent()

        np.testing.assert_almost_equal(
            agent.solve(
                is_bad_side=[1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]
            ),
            6.314,
            decimal=3
        )

unittest.main(argv=[''], verbosity=2, exit=False)

  trans0.iloc[i, j] = 1/N
  trans0.iloc[i, -1] = b_rate                # set probability of transition to bankrupt state
ok
  trans0.iloc[i, j] = 1/N
  trans0.iloc[i, -1] = b_rate                # set probability of transition to bankrupt state
ok
  trans0.iloc[i, j] = 1/N
  trans0.iloc[i, -1] = b_rate                # set probability of transition to bankrupt state
ok

----------------------------------------------------------------------
Ran 3 tests in 0.798s

OK


<unittest.main.TestProgram at 0x21f3a2306d0>