# **RL PROJECT** - **DICE GAME**

In [48]:
import numpy as np
import random
import matplotlib.pyplot as plt

## 1. The Game

In this project, we consider the following “Dice Game”. The objective of the game is to make money by
outscoring the dealer or by rolling doubles. Each round, the player starts with a score of zero. Since this
is an episodic task, discounting is not necessary.

Each turn, the player has to choose between rolling their dice, or betting money on the dealer’s roll. If
they decide to roll themselves, they can either roll one or two dice (simultaneously), which costs them
CHF 1 for one dice, and CHF 2 for two dice. The number(s) shown by the dice are added to the player’s
score. If the player rolls a double (i.e., two dice showing the same numbers), they get an immediate
bonus payout of CHF 10, independent of the rest of the game or their score. If the player reaches a
score of 31 or more, they lose the round and have to pay CHF 10.

If the player chooses to bet on the dealer’s roll, they have to specify a bet-multiplicator of 1, 2, or 3. The
dealer then rolls their dice and the player is paid/has to pay according to the formula
(“Player Score” −“Dealer Dice Result”) ·“Bet-Multiplicator”,
and the round is over

The player has two identical dice, showing the numbers {1,2,3,4,5,6}. The dealer has one dice,
showing the numbers {25,26,27,28,29,30}. All three dice are weighted, such that their highest
number has twice the probability of each of the smaller numbers.

### FlowChart

### Dices

#### Player
---

The Player Dices have the following setup:

    P(D = 1) = p
    P(D = 2) = p
    P(D = 3) = p
    P(D = 4) = p
    P(D = 5) = p
    P(D = 6) = 2p

Therefore, since the Universe is equal to 1 by definition, p should sum to 1 as well:

    p + p + p + p + p + 2p = 1 

    6p + p = 1

    7p = 1

    p = 1/7

The Player Dices have the following properties:

    P(D = 1) = 1/7
    P(D = 2) = 1/7
    P(D = 3) = 1/7
    P(D = 4) = 1/7
    P(D = 5) = 1/7
    P(D = 6) = 2/7

#### Dealer
---

The Dealer Dice have the following setup:

    P(D = 25) = p
    P(D = 26) = p
    P(D = 27) = p
    P(D = 28) = p
    P(D = 29) = p
    P(D = 30) = 2p

Therefore, since the Universe is equal to 1 by definition, p should sum to 1 as well:

    p + p + p + p + p + 2p = 1 

    5p + 2p = 1

    7p = 1

    p = 1/7

The Dealer Dice  have the following properties:

    P(D = 1) = 1/7
    P(D = 2) = 1/7
    P(D = 3) = 1/7
    P(D = 4) = 1/7
    P(D = 5) = 1/7
    P(D = 6) = 2/7

---

## 2. The Task

Throughout, the state space should have (at most) one terminal state

1. Considering the game as a Markov decision process, identify the state space S, the action
space A, and the reward set R.

---
ANSWER:

- State Space (S) is defined by the Player Score after a **TURN**, and thus can range from 0 to 31


$$S = \{0, 1, ..., 31\} $$


- The Action Space (A) is defined by the set of possible Actions the player can choose at each **TURN**, and thus can be the following:
   
        1. Roll one dice or 
        2. two dices simulteanously
        3. Bet on the Dealer Roll (Specify 1,2,3 in the Bet-Multiplicator, which is then an action for each)

- The Reward Set (R) can be seen as the reward for each actions, therefore the player can:

    For Rolling the Dice:
        
        1. Gain 8CHF if 2 Dices roll have the same value (10CHF - the initial cost of 2 dices (2CHF))
        2. Lose 1CHF if 1 Dice roll (not payout, except the initial cost of 1 dice (1CHF))
        3. Lose -10CHF if the Player Score reach 31 and above
    
    For Betting on Dealer Dice:
    
        4. (Player Score - Dealer Dice Result) * Bet-Multiplicator CHF, 
        can be gain or loss depending on the Dealer Dice value

---

2. Implement a Python class that represents the game as a reinforcement learning task. The class
should contain all the information about the game state, and should provide a “step” method that
takes an action as input and returns the reward and next state, as well as a “reset” method that
resets the game to its initial state

### Advanced Dice Game Class

In [50]:
class DiceGame:
    
    def __init__(self):
        self.player_rounds = 1
        self.player_payout_rounds = 0
        self.player_payout = 0
        self.player_score = 0
        self.player_dice = range(1, 7)
        self.dealer_dice = range(25, 31)
        self.dice_weights = [1, 1, 1, 1, 1, 2]

    def roll_dice(self,number_of_dice):
        dice_values = []
        for _ in range(number_of_dice):
            dice_values.append(random.choices(self.player_dice, weights=self.dice_weights, k=1)[0])
        return dice_values  
    
    def results(self):
        print(f"""
----------------------------------------- 
              
PLAYER STATUS:
              
    CURRENT ROUND

        Your curent round score is {self.player_score} and with a payout of {self.player_payout} CHF

              
    TOTAL GAME

        Your current round is number {self.player_rounds} with your total cumulated payout of {self.player_payout_rounds} CHF

----------------------------------------- 

              """)
    
    def play_round(self, choice, bet_multiplier = 1, result = True):

        """
choice: (1) For 1 Dice Roll, (2) For 2 Dices roll, (3) For Dealer Bet (Bet-Multiplier = 1 by default)

bet_multiplier: either 1, 2 or 3 (1 by default)

result: Show a summary of the turn and round state of the player

        """

        if choice == 1:
            dice_values = self.roll_dice(1)
            self.player_score += dice_values[0]
            self.player_payout += -1
            print(f"""
                  
TURN CHOICE: 1 Dice Roll 
-----------------------------------------        
You rolled 1 Dice (cost of 1 CHF), and it gave you {dice_values[0]} !
                      """)

        if choice == 2:
            dice_values = self.roll_dice(2)
            dice_total = sum(dice_values)
            self.player_score += sum(dice_values)
            self.player_payout += -2
            print(f"""
                  
TURN CHOICE: 2 Dices Roll 
-----------------------------------------      
                  
You rolled 2 Dice (cost of 2 CHF), and it gave you {dice_values[0]} and {dice_values[1]} ! 
                      """)
            if dice_values[0] == dice_values[1]:
                self.player_payout += 10
                print(f"""
Congratulations! You rolled doubles and received a bonus payout of 10 CHF.
                      """)

        if choice == 3:
            dealer_result = random.choices(self.dealer_dice, weights=self.dice_weights, k=1)[0]
            dealer_payout = 0
            dealer_payout = (self.player_score - dealer_result) * bet_multiplier
            self.player_payout += dealer_payout
            self.player_payout_rounds += self.player_payout
            print(f"""    
                    
TURN CHOICE: You bet on Dealer Roll, with a Bet-Multiplicator of {bet_multiplier}
                  
The Dealer rolled {dealer_result}, the formula is then ({self.player_score} - {dealer_result}) X {bet_multiplier} = {dealer_payout} CHF 

Let's add it to this round payout ! 

-----------------------------------------

ROUND FINISHED
                      """)
            self.player_score = 0
            self.player_payout = 0
            self.player_rounds += 1
        
        if self.player_score >= 31:
            print(f"""
Oops! You went over a Score of 31 (with {self.player_score}). You lose 10 CHF !
                  
-----------------------------------------

ROUND FINISHED
                  """)
            self.player_payout_rounds += self.player_payout - 10
            self.player_payout = 0
            self.player_score = 0
            self.player_rounds += 1

        if result == True:
            return self.results()
        
    def reset(self):
        self.player_rounds = 1
        self.player_payout_rounds = 0
        self.player_payout = 0
        self.player_score = 0
        print("You have reset the Dice Game !")
                   


Initialize the Game Object

In [51]:
game = DiceGame()

Let's Play the Game !

        choice: (1) For 1 Dice Roll, (2) For 2 Dices roll, (3) For Dealer Bet

        bet_multiplier: either 1, 2 or 3 (1 by default)

        result: Show a summary of the turn and round state of the player

In [115]:
game.play_round(choice=3, bet_multiplier=3, result=True)

    
                    
TURN CHOICE: You bet on Dealer Roll, with a Bet-Multiplicator of 3
                  
The Dealer rolled 26, the formula is then (29 - 26) X 3 = 9 CHF 

Let's add it to this round payout ! 

-----------------------------------------

ROUND FINISHED
                      

----------------------------------------- 
              
PLAYER STATUS:
              
    CURRENT ROUND

        Your curent round score is 0 and with a payout of 0 CHF

              
    TOTAL GAME

        Your current round is number 2 with your total cumulated payout of 11 CHF

----------------------------------------- 

              


Reset The Game !

In [116]:
game.reset()

You have reset the Dice Game !


### Simplified Dice Game Class

In [2]:
class DiceGameSimplify: 

    def __init__(self):
        self.state = 0

    def reset(self):
        self.state = 0
        print("\nDice Game Reset !", "Score:", self.state)

    def step(self, action):
        """
        Perform an action
        """
        if action == 1:
            print("Action (1) --------")
            reward = -1
            player_roll = random.choices([1,2,3,4,5,6], weights=[1,1,1,1,1,2], k=1)[0]
            self.state += player_roll
            if self.state >= 31:
                reward = -10
                state = self.state
                print("roll:", player_roll)
                print("\nRound Over, you reach a score of", state, "which is more or equal to 31", ", payout:", reward)
                self.reset()
                return state, reward
            print("roll:", player_roll,", score:", self.state, ", payout:", reward)
            return self.state, reward

        if action == 2:
            print("Action (2) --------")
            reward = -2
            player_roll1 = random.choices([1,2,3,4,5,6], weights=[1,1,1,1,1,2], k=1)[0]
            player_roll2 = random.choices([1,2,3,4,5,6], weights=[1,1,1,1,1,2], k=1)[0]
            self.state += player_roll1 + player_roll2
            if player_roll1 == player_roll2:
                reward = 10-2
                print("doubles !")
            if self.state >= 31:
                reward = -10
                state = self.state
                print("roll:", player_roll1, "and",player_roll2 )
                print("\nRound Over, you reach a score of", state, "which is more or equal to 31", ", payout:", reward)
                self.reset()
                return state, reward
            print("roll:",player_roll1,"and",player_roll2,", score:",self.state, ", payout:", reward)
            return self.state, reward

        if action == 3:
            print("Action (3) --------")
            state = self.state
            bet_multiplier = 1
            dealer_roll = random.choices([25,26,27,28,29,30], weights=[1,1,1,1,1,2], k=1)[0]
            reward = (self.state - dealer_roll)*bet_multiplier
            print("dealer roll:",dealer_roll,", player score:", state, ", payout", reward)
            self.reset()
            return state, reward
            
        if action == 4:
            print("Action (4) --------")
            state = self.state
            bet_multiplier = 2
            dealer_roll = random.choices([25,26,27,28,29,30], weights=[1,1,1,1,1,2], k=1)[0]
            reward = (self.state - dealer_roll)*bet_multiplier
            print("dealer roll:",dealer_roll,", player score:", state, ", payout", reward)
            self.reset()
            return state, reward

        if action == 5:
            print("Action (5) --------")
            state = self.state
            bet_multiplier = 3
            dealer_roll = random.choices([25,26,27,28,29,30], weights=[1,1,1,1,1,2], k=1)[0]
            reward = (self.state - dealer_roll)*bet_multiplier
            print("dealer roll:",dealer_roll,", player score:", state, ", payout", reward)
            self.reset()
            return state, reward

        else:
            raise Exception('Invalid action! Only (1), (2), (3), (4) and (5) as integer !')



Initialize the Game Object

In [3]:
game2 = DiceGameSimplify()

Let's do some steps

        (1) For 1 Dice Roll by the Player
        (2) For 2 Dices Roll by the Player
        (3) For Dealer Bet with Muliplicator of 1
        (4) For Dealer Bet with Muliplicator of 2
        (5) For Dealer Bet with Muliplicator of 3

Note: Use the corresponding number as integer

In [10]:
game2.step(1)

Action (2) --------
roll: 6 and 4

Round Over, you reach a score of 36 which is more or equal to 31 , payout: -10

Dice Game Reset ! Score: 0


(36, -10)

Reset the Game when you want 

In [11]:
game2.reset()


Dice Game Reset ! Score: 0


### Quickly Testing some Policy

    “R1”: The player always rolls a single dice

In [502]:
gameR1 = DiceGameSimplify()

Round_Count = 3

total_reward = 0

for r in range(Round_Count):

    round_reward = 0

    while True:
        state, reward = gameR1.step(1)
        round_reward += reward
        if state >= 31:
            break

    print(f"""

Round Reward: {round_reward}

""")

    total_reward += round_reward

print(f"""
---------------------------------------

Total Game Reward: {total_reward}

    """)

gameR1 = 0

Action (1) --------
roll: 1 , score: 1 , payout: -1
Action (1) --------
roll: 5 , score: 6 , payout: -1
Action (1) --------
roll: 6 , score: 12 , payout: -1
Action (1) --------
roll: 6 , score: 18 , payout: -1
Action (1) --------
roll: 1 , score: 19 , payout: -1
Action (1) --------
roll: 3 , score: 22 , payout: -1
Action (1) --------
roll: 3 , score: 25 , payout: -1
Action (1) --------
roll: 1 , score: 26 , payout: -1
Action (1) --------
roll: 1 , score: 27 , payout: -1
Action (1) --------
roll: 2 , score: 29 , payout: -1
Action (1) --------
roll: 6

Round Over, you reach a score of 35 which is more or equal to 31 , payout: -10

Dice Game Reset ! Score: 0


Round Reward: -20


Action (1) --------
roll: 4 , score: 4 , payout: -1
Action (1) --------
roll: 6 , score: 10 , payout: -1
Action (1) --------
roll: 1 , score: 11 , payout: -1
Action (1) --------
roll: 6 , score: 17 , payout: -1
Action (1) --------
roll: 2 , score: 19 , payout: -1
Action (1) --------
roll: 4 , score: 23 , payout: 

    “R2”: The player always rolls both dice.

In [503]:
gameR2 = DiceGameSimplify()

Round_Count = 3

total_reward = 0

for r in range(Round_Count):

    round_reward = 0

    while True:
        state, reward = gameR2.step(2)
        round_reward += reward
        if state >= 31:
            break

    print(f"""

Round Reward: {round_reward}

""")

    total_reward += round_reward

print(f"""
---------------------------------------

Total Game Reward: {total_reward}

    """)

gameR2 = 0

Action (2) --------
doubles !
roll: 3 and 3 , score: 6 , payout: 8
Action (2) --------
roll: 3 and 4 , score: 13 , payout: -2
Action (2) --------
roll: 6 and 2 , score: 21 , payout: -2
Action (2) --------
roll: 4 and 3 , score: 28 , payout: -2
Action (2) --------
roll: 2 and 4

Round Over, you reach a score of 34 which is more or equal to 31 , payout: -10

Dice Game Reset ! Score: 0


Round Reward: -8


Action (2) --------
roll: 6 and 1 , score: 7 , payout: -2
Action (2) --------
roll: 2 and 6 , score: 15 , payout: -2
Action (2) --------
roll: 1 and 3 , score: 19 , payout: -2
Action (2) --------
roll: 3 and 1 , score: 23 , payout: -2
Action (2) --------
roll: 4 and 2 , score: 29 , payout: -2
Action (2) --------
roll: 6 and 2

Round Over, you reach a score of 37 which is more or equal to 31 , payout: -10

Dice Game Reset ! Score: 0


Round Reward: -20


Action (2) --------
roll: 2 and 4 , score: 6 , payout: -2
Action (2) --------
roll: 4 and 1 , score: 11 , payout: -2
Action (2) -------

    “RR”: If the player’s score is strictly smaller than 20, they roll either one or two dice with equal
    probability. Otherwise, they choose one of the three bet-multiplicators uniformly at random

In [504]:
gameR3 = DiceGameSimplify()

Round_Count = 3

total_reward = 0

for r in range(Round_Count):

    round_reward = 0
    state = 0
    reward = 0

    while True:
        if state < 20:
            action = random.choice([1,2])
            state, reward = gameR3.step(action)
            round_reward += reward
        if state >= 20:
            action = random.choice([3,4,5])
            state, reward = gameR3.step(action)
            round_reward += reward
            break

    print(f"""

Round Reward: {round_reward}

""")

    total_reward += round_reward

print(f"""
---------------------------------------

Total Game Reward: {total_reward}

    """)

gameR3 = 0

Action (1) --------
roll: 4 , score: 4 , payout: -1
Action (1) --------
roll: 1 , score: 5 , payout: -1
Action (1) --------
roll: 3 , score: 8 , payout: -1
Action (1) --------
roll: 6 , score: 14 , payout: -1
Action (1) --------
roll: 5 , score: 19 , payout: -1
Action (2) --------
doubles !
roll: 3 and 3 , score: 25 , payout: 8
Action (4) --------
dealer roll: 29 , player score: 25 , payout -8

Dice Game Reset ! Score: 0


Round Reward: -5


Action (2) --------
roll: 2 and 6 , score: 8 , payout: -2
Action (2) --------
roll: 6 and 1 , score: 15 , payout: -2
Action (1) --------
roll: 2 , score: 17 , payout: -1
Action (2) --------
roll: 4 and 2 , score: 23 , payout: -2
Action (4) --------
dealer roll: 30 , player score: 23 , payout -14

Dice Game Reset ! Score: 0


Round Reward: -21


Action (2) --------
roll: 1 and 5 , score: 6 , payout: -2
Action (2) --------
roll: 3 and 1 , score: 10 , payout: -2
Action (1) --------
roll: 6 , score: 16 , payout: -1
Action (2) --------
roll: 3 and 5 , s

3. Using dynamic programming, compute the value functions under the following policies. Explain the results and represent them graphically.

In [594]:
# Max Score (State)
MAX_STATE = 31

# Small number determining the accuracy of policy evaluation's estimation
#THETA = 1e-15
THETA = 1e-3

# Discount factor (can be 1, since this is an episodic task)
GAMMA = 1

# A list/array of all possible states
STATES = np.arange(MAX_STATE+1)

    “R1”: The player always rolls a single dice

The expected score for a state when using a single dice (R1), is then 

$${Expected \ score (R1)} = \frac{1}{8} \cdot 1 + \frac{1}{8} \cdot 2 + \frac{1}{8} \cdot 3 + \frac{1}{8} \cdot 4 + \frac{1}{8} \cdot 5+ \frac{1}{4} \cdot 6$$

Simplified:

$${Expected \ score (R1)} = \frac{21}{8} + \frac{3}{2}$$

Further simplification:

$${Expected \ score (R1)} = \frac{33}{8} = 4.125$$


In [597]:
policy = random.choices([1,2,3,4,5,6], weights=[1,1,1,1,1,2], k=MAX_STATE+1)
policy[-1] = 0
values = np.zeros(MAX_STATE+1)

In [605]:
def evalAction1(state, action, currentValues):
    if state == MAX_STATE:
        return -10
    
    # Compute and return the expected reward + value of the next state
    eValue = currentValues[state + action]

    if state + action > MAX_STATE:
        eReward = -11
    else:
        eReward = -1
    
    return eReward + GAMMA * eValue


In [606]:
while True:
    # Set delta to 0
    delta = 0
    
    # Update value function for each state
    for state in STATES:
        oldValue = values[state]

        action = policy[state]
        values[state] = evalAction1(state, action, values)

        delta = max(delta, abs(oldValue - values[state]))
    
    # Break if delta is small enough
    if delta < THETA:
        break

IndexError: index 32 is out of bounds for axis 0 with size 32

    “R2”: The player always rolls both dice.

The expected score for a state when using a two dice (R2), is then 

$${Expected \ score (R2)} = 2*(\frac{1}{8} \cdot 1 + \frac{1}{8} \cdot 2 + \frac{1}{8} \cdot 3 + \frac{1}{8} \cdot 4 + \frac{1}{8} \cdot 5+ \frac{1}{4} \cdot 6)$$

Simplified:

$${Expected \ score (R1)} = 2*(\frac{33}{8}) = 8.25$$ 

The expected reward for a state with (R2) would be

$$\text{Probability} = \frac{\text{Number of same dices value}}{\text{Total number of possible outcomes}} = \frac{6}{36} = \frac{1}{6} \approx 0.1667$$

$$\text{Expected \ reward} = 0.1667 * (10-2) = 1.3336 \ CHF$$

    “RR”: If the player’s score is strictly smaller than 20, they roll either one or two dice with equal
    probability. Otherwise, they choose one of the three bet-multiplicators uniformly at random

4. Find the optimal policy using dynamic programming. Represent the action-value function under
the optimal policy graphically. Explain the results and compare them to those of the previous task.

BONUS

Use the class you implemented in the first task for the following Monte Carlo simulation: estimate
the value of the initial state under each of the policies from the previous tasks (“R1”, “R2”, “RR”,
“Optimal”). Illustrate the results and compare them to the results of the previous tasks.

    “R1”: The player always rolls a single dice

In [30]:
gameR1Carlo = DiceGameSimplify()

In [31]:
def generateEpisode(game):
    game.reset()
    path = [0]
    rewards = [0]
    while True:
        state, reward = game.step(1)
        rewards.append(reward)
        if state >= 31:
            break
        path.append(state)

    return path, rewards

In [32]:
generateEpisode(gameR1Carlo)


Dice Game Reset ! Score: 0
Action (1) --------
roll: 6 , score: 6 , payout: -1
Action (1) --------
roll: 1 , score: 7 , payout: -1
Action (1) --------
roll: 1 , score: 8 , payout: -1
Action (1) --------
roll: 2 , score: 10 , payout: -1
Action (1) --------
roll: 3 , score: 13 , payout: -1
Action (1) --------
roll: 3 , score: 16 , payout: -1
Action (1) --------
roll: 1 , score: 17 , payout: -1
Action (1) --------
roll: 3 , score: 20 , payout: -1
Action (1) --------
roll: 1 , score: 21 , payout: -1
Action (1) --------
roll: 6 , score: 27 , payout: -1
Action (1) --------
roll: 6

Round Over, you reach a score of 33 which is more or equal to 31 , payout: -10

Dice Game Reset ! Score: 0


([0, 6, 7, 8, 10, 13, 16, 17, 20, 21, 27],
 [0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -10])

In [46]:
def monteCarloEpisode(game, values, alpha):
    path, rewards = generateEpisode(game)

    G = 0
    for t in reversed(range(len(path))):
        G += rewards[t+1]
        if not path[t] in path[0:(t-1)]:
            values[path[t]] += alpha * (G - values[path[t]])
            
    return 

In [47]:
N_EPISODES = 100
ALPHA = 0.02
valuesR1Carlo = [0]*32

for n in range(N_EPISODES):
    monteCarloEpisode(gameR1Carlo, valuesR1Carlo, ALPHA)
    


Dice Game Reset ! Score: 0
Action (1) --------
roll: 3 , score: 3 , payout: -1
Action (1) --------
roll: 1 , score: 4 , payout: -1
Action (1) --------
roll: 1 , score: 5 , payout: -1
Action (1) --------
roll: 2 , score: 7 , payout: -1
Action (1) --------
roll: 2 , score: 9 , payout: -1
Action (1) --------
roll: 6 , score: 15 , payout: -1
Action (1) --------
roll: 2 , score: 17 , payout: -1
Action (1) --------
roll: 4 , score: 21 , payout: -1
Action (1) --------
roll: 6 , score: 27 , payout: -1
Action (1) --------
roll: 4

Round Over, you reach a score of 31 which is more or equal to 31 , payout: -10

Dice Game Reset ! Score: 0

Dice Game Reset ! Score: 0
Action (1) --------
roll: 4 , score: 4 , payout: -1
Action (1) --------
roll: 2 , score: 6 , payout: -1
Action (1) --------
roll: 6 , score: 12 , payout: -1
Action (1) --------
roll: 5 , score: 17 , payout: -1
Action (1) --------
roll: 1 , score: 18 , payout: -1
Action (1) --------
roll: 6 , score: 24 , payout: -1
Action (1) --------


32


    “R2”: The player always rolls both dice.

    “RR”: If the player’s score is strictly smaller than 20, they roll either one or two dice with equal
    probability. Otherwise, they choose one of the three bet-multiplicators uniformly at random