#### Reinforcement Learning &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Homework #1

# &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Planning in MDPs

## Description

You are given an $N$-sided die, along with a corresponding Boolean mask
vector, `is_bad_side` (i.e., a vector of ones and zeros). You can assume
that $1<N\leq30$, and the vector `is_bad_side` is also of size $N$ and
$1$ indexed (since there is no $0$ side on the die). The game of DieN is
played as follows:

1.  You start with $0$ dollars.

2.  At any time you have the option to roll the die or to quit the game.

    1.  **ROLL**:

        1.  If you roll a number not in `is_bad_side`, you receive that
            many dollars (e.g., if you roll the number $2$ and $2$ is
            not a bad side -- meaning the second element of the vector
            `is_bad_side` is $0$, then you receive $2$ dollars). Repeat
            step 2.

        2.  If you roll a number in `is_bad_side`, then you lose all the
            money obtained in previous rolls and the game ends.

    2.  **QUIT**:

        1.  You keep all the money gained from previous rolls and the
            game ends.

## Procedure

-   You will implement your solution using the `solve()` method
    in the code below.
    
-   Your return value should be the number of dollars you expect to
    win for a specific value of `is_bad_side`, if you follow an
    optimal policy. That is, what is the value of the optimal
    state-value function for the initial state of the game (starting
    with $0$ dollars)? Your answer must be correct to $3$ decimal
    places, truncated (e.g., 3.14159265 becomes 3.141).

-   To solve this problem, you will need to determine an optimal policy
    for the game of DieN, given a particular configuration of the die.
    As you will see, the action that is optimal will depend on your
    current bankroll (i.e., how much money you've won so far).

-   You can try solving this problem by creating an MDP of the game
    (states, actions, transition function, reward function, and assume a
    discount rate of $\gamma=1$) and then calculating the optimal
    state-value function.

## Resources

The concepts explored in this homework are covered by:

-   Chapter 3 (3.6 Optimal Policies and Optimal Value Functions) and
    Chapter 4 (4.3-4.4 Policy Iteration, Value Iteration) of
    http://incompleteideas.net/book/the-book-2nd.html

-   Chapters 1-2 of 'Algorithms for Sequential Decision Making', M.
    Littman, 1996

## Submission

-   The due date is indicated on the Syllabus page for this assignment.

-   Use the template below to implement your code. We have also provided
    some test cases for you. If your code passes the given test cases.

-   We use *python 3.6.x* and *numpy==1.18.0*, and you can
    use any core library (i.e., anything in the Python standard library).
    No other library can be used.  Also, make sure have your name in the
    notebook.

In [34]:
#################
# DO NOT REMOVE
# Versions
# numpy==1.18.0
################
import numpy as np

# STUDENT ID
# 202264690069
# 请在 whoami 变量中指定您的姓名。

whoami = '刘兴琰'

class MDPAgent(object):
    def __init__(self):
        pass

    def solve(self, is_bad_side):
        # 计算好的一侧：将坏的一侧 (1) 转换为好的一侧 (0)
        is_good_side = [1 - x for x in is_bad_side]  
        N = sum(is_good_side)  # 计算好的一侧的数量
        
        # 计算好的一侧的期望收益
        E_total = 0
        for index, value in enumerate(is_good_side):
            E_total += (index + 1) * value
        
        # 找到状态转移点
        transition = 0
        for x in range(100):
            # 计算当前状态的期望收益 E
            E = (E_total + x * N) / len(is_good_side)
            if E > x:
                continue  # 如果期望收益大于 x，继续循环
            else:
                transition = x  # 否则，设置当前转移点为 x
                break
        
        # 定义函数判断是否为终止状态
        def is_terminal_state(state):
            expect = (E_total + state * N) / len(is_good_side)
            return state > expect  # 如果状态大于期望，返回终止状态
        
        # 定义状态转移函数
        def transition_func(state):
            # 检查当前状态是否为终止状态
            if is_terminal_state(state):
                return state  # 如果是终止状态，返回结果
            
            # 计算期望收益 E
            E = 0
            for index, value in enumerate(is_good_side):
                if value == 1:  # 只有当状态是好的（值为1）时，才会进行状态转移
                    next_state = transition_func(index + state + 1)
                    E += next_state / len(is_good_side)  # 更新期望收益
            
            return E  # 返回期望收益
        
        # 从状态 0 开始执行状态转移函数，返回最终结果
        final_result = transition_func(0)
        return final_result  # 返回最终期望收益

## Test cases

We have provided some test cases for you to help verify your implementation.

In [35]:
## DO NOT MODIFY THIS CODE.  This code will ensure that your submission
## will work proberly with the autograder

import unittest

class TestDieNNotebook(unittest.TestCase):
    def test_case_1(self):
        agent = MDPAgent()
        np.testing.assert_almost_equal(
            agent.solve(is_bad_side=[1, 1, 1, 0, 0, 0]),
            2.583,
            decimal=3
        )
        
    def test_case_2(self):
        agent = MDPAgent()
        np.testing.assert_almost_equal(
            agent.solve(
                is_bad_side=[1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0]
            ),
            7.379,
            decimal=3
        )
        
    def test_case_3(self):
        agent = MDPAgent()

        np.testing.assert_almost_equal(
            agent.solve(
                is_bad_side=[1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]
            ),
            6.314,
            decimal=3
        )

unittest.main(argv=[''], verbosity=2, exit=False)

test_case_1 (__main__.TestDieNNotebook.test_case_1) ... ok
test_case_2 (__main__.TestDieNNotebook.test_case_2) ... ok
test_case_3 (__main__.TestDieNNotebook.test_case_3) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.004s

OK


<unittest.main.TestProgram at 0x13046e8ded0>