<a href="https://colab.research.google.com/github/mcnica89/Markov-Chains-RL-W24/blob/main/Lab3_Draft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np

# This is the final updated version!

# Q1 Optimal Betting Sizes for changing $p_{win}$

This problem is very similar Lab 2 Q2, but instead of being told one value of $p_{win}$, the value of $p_{win}$ changes on each round, and are generated by an unknown process. The player gets to see the value of $p$ before they choose their bet size. All the other details like $n_{max}$,$t_{max}$, the savings account etc are exactly the same. Exception: The player is now allowed to bet 0 if they want to. (In Lab 2 they were forced to bet at least 1 at each round)

The possible value of the winning probability $p_{win}$ will always be given as a fraction with the denominator $p_{max}$

$$p = \frac{x}{p_{max}}$$

(We will typically use $p_{max}=10$, so all the probabilities are either 0\%, 10\%,20\%,...,90\%,100\%). This makes it so we can store probabilities as integers in an array with size ($p_{max}+1$)

- **Q1a** For a given fixed bet_size policy bet_size[t,x,p] (which is an array of shape $t_{max}+1,n_{max}+1,p_{max}+1$), use Monte Carlo or TD methods to compute the value function v[t,x] when following that policy and there are $t$ rounds left and a bankroll of $x$.
- *What you will submit for Q1a*:
 - A function that inputs the bet_size policy as an array, and outputs your estimate for the value function.
 - A single array prediction for the value function for the policy bet_size_1 which always bets 1.

- **Q1b** The $q$ function $q(t,x,p,b)$ for what happens if your bet size is $b$ at state (t,x,p) in terms of the value function $v(t,x)$ is:
$$q(t,x,p,b) = p\cdot \left(max(x+b-n_{max},0) + v[t-1,min(x+b,n_{max})] \right) + (1-p)\cdot \left(v[t-1,x-b]\right)$$

- Develop a Q-learning-style TD algorithm to improve the bet_size policy. Use this $q$ function equation, and the epsilon-greedy policy in your code

$$\pi(t,x,p) = \begin{cases} \text{argmax}_{1 \leq b \leq x} q(t,x,p,b) \text{ with } \mathbb{P}=1-\epsilon \\ \text{random 1 to }x\text{ with } \mathbb{P}=\epsilon \end{cases}$$

- *What you will submit for Q1b*:
 - A function that improves the bet_size policy.
 - A single array which is the best bet_size policy you have found.





## Code that runs the simulations/helper functions

In [None]:
#### CODE TO RUN THE SIMULATIONS
# You will copy/paste/modify this code to create your episodes

def generate_p_win():
  #returns the next value of p_win in units of tenths. (i.e. returning a 1 means p_win = 1/10 = 10%)
  # this is supposed to be an unknown function! No "cheating" by using what it does in your solution!
  return np.random.choice([1,2,3,4,5,6,7,7,7,7,8,9])

def bet_1_policy(t,x,p,t_max,n_max,p_max):
  #the always bet 1 policy
  return 1

#runs a simulation of the game: use this to setup your algorithms!
def simulate_game(x_init,t_max,n_max,p_max,policy):
  ## simulate the game and returns the tuple, final_bankroll, savings_account

  x = x_init #player bankroll
  savings_account = 0 #player savings account
  for t in range(t_max,0,-1): #count backwards throught number of rounds remaining!
    #### Sample p
    p_win_int = generate_p_win() #get the next p_win as an integer

    #### Choose bet size according to policy
    bet_size = policy(t,x,p_win_int,t_max,n_max,p_max) #get bet size from the policy
    bet_size = max(min(int(bet_size),x),0) #ensure its an integer between 0 and x

    #### Sample outcome
    outcome = np.random.choice([1,-1],p=[p_win_int/p_max,1.0-p_win_int/p_max]) #get the win/lose result of the bet according to p_win

    ### Update bankroll!
    x += outcome*bet_size
    if x > n_max:
      savings_account += x - n_max
      x = n_max

  ## get the total return
  total_return = x+savings_account
  return total_return



In [None]:
print(simulate_game(x_init=10,t_max=5,n_max=100,p_max=10,policy=bet_1_policy))

9


# Q2 Easy 21 With A Finite Deck And Card Counting - Warm Up for the Final Project

Rules of the game:
- A "shoe" is a large stack of cards that consists of multiple decks of cards shuffled toghether. In Easy21 with a finite deck, we create a shoe of cards by combining together *10* ordinary decks. Cards are dealt from the shoe *without replacement* (i.e. once a card is dealt out, it is gone!)
- An "episode" now consits of a full play through one shoe of the deck and consists of multiple hands played back to back with the player's bankroll going up/down as hands are played.
 - The player starts with a bankroll of $100.
 - The player can choose their bet size before each hand, which is paid out at even odds (i.e. they either win this amount, or lose this amount depending on if they win/lose the hand.) This is the function "playerBetSizeChoice".
- In each hand, The dealer/player take turns using a player strategy function "playerActionIsHit". (e.g. in Lab 2 Q1b, you computed the optimal strategy in the infinite deck case) The dealer always follows the hit-below-17 strategy.
 - As they play, cards are consumed from the shoe.
- If at anytime the cards completely run out, no more hitting is allowed. (i.e. the player or dealer is forced to "stick")
  - A new hand is only dealt out if there is at least one full deck of cards remaining in the shoe (30 cards). Otherwise the episode is over and no more hands are dealt.
- The player is allowed to do "card counting"
  - The player is allowed to write down one integer (sometimes called "the count") which they can rememeber and keep track of from hand to hand.
  - The count is updated using "playerCardCounterUpdate" function which inputs the old count, the next observed card, the number of cards left in the deck and returns the NEW count.
  - All of the players choices are allowed to depend on the count (for example, the count could be used to signal when to bet big or bet small)

  In Lab 3, you will do the following:
  - The playerActionIsHit will be taken to be the value from Lab2 in the function Lab2Actions (for the final project you will be free to change this however you want)
  - The bet size will always be 1
  - **Q2a** Write a card counter update function so that the count always represetns the average value of the cards in the shoe. Submit the counter update function.
  - **Q2b** Use a Monte Carlo or TD learning method to estimate the probability of winning a hand GIVEN the current average value of the shoe, before the cards are dealt out.
   - Implementation details:
     - To make the average value of the shoe fit (which is a float) fit into the table of values for the state, we will ROUND THE AVERAGE to the nearest tenth, and work in units of "tenths of a point" e.g. if the average is 0.183333 we will round it to 0.2, and keep this as the integer "2" for our table.
     - To deal with negative values, we will add +100 to everything e.g. if the average is -0.5, we will convert this to -5 (working in tenths), and then add +100 to it to put it into our table at index 95. (Note that the most negative possible average, an average of -10.0, gets mapped to index 0 by this formula)

  - Fun Fact: A solution to Q2b combined with a solution to Q1 gives you a method to create a good betting strategy for the final project! This is one possible strategy you can use for the final project.
  - Submit both the code that you used to do this, and an array which is your final answer.



# All the functions/code for Easy 21 with a finite deck below.

In [None]:
def playerCardCounterUpdate( card_counting_signal_input : float, num_cards_left_in_shoe : int, observed_card : int) -> float:
  # updates the card Counting signal based on:
  #  card_counting_signal_input : the previous card counting signal (which is a float)
  #  observed_card : the card that was dealt out that we just observed
  #  num_cards_left_in_shoe : how many cards are currently left in the deck

  #if the current card_counting_signal is None, then we are just starting a new shoe and must initialize it to some value
  if card_counting_signal_input == None:
    playerCardCounterInitialValue = 0
    return playerCardCounterInitialValue

  output_card_counting_signal = 0.0
  return output_card_counting_signal

def playerBetSizeChoice(card_counting_signal_input:float, num_cards_left_in_shoe : int, current_bankroll:int) -> int:
  # How much the player wants to bet
  # Is allowed to depend on the card counting signal, the number of cards left in the deck, and the plyaers current bankroll
  bet_size = 1
  return bet_size

In [None]:
def playerActionIsHit( player_sum : int, dealer_sum : int, card_counting_signal : float) -> bool:
  #Given the Player Value and the Dealer Value, and Card Counting signla, return whether or not the Player Action is to hit (otherwise the player chooses to stick)
  playerActionIsHit_answer = False
  return False

def Lab2Actions(player_sum, dealer_sum,card_counting_signal):
  actions = np.array(
      [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
  return actions[dealer_sum,player_sum]



def simulateEasy21_finite_deck(playerStrategy, playerBetSizeChoice, playerCardCounterUpdate, Verbose = False) -> int:
  # simulates one run through of a shoe of easy 21 using the player and delaer strategy
  # the game is counted as over when there are <MIN_ONE_DECKS_LEFT_TO_END_GAME worth of cards left in the shoe

  ####GAME PARAMETERS#####
  # These parameters set up the "rules" of the game
  # They are constant
  NUM_ONE_DECKS_IN_SHOE = 10 #number of ONE_DECKs in the shoe at the start of the game
  MIN_ONE_DECKS_LEFT_TO_END_GAME = 1 #when the shoe is low, the next hand is not dealt
  PLAYER_STARTING_BANKROLL = 100 #staring bankroll for hte player
  #####################

  CARDS_1_TO_10 = np.arange(1,10+1)
  ONE_DECK = np.concatenate( (CARDS_1_TO_10, CARDS_1_TO_10, -CARDS_1_TO_10) ) #the ONE_DECK is 30 cards, 1-10 twice and -1 to 10 once

  shoe = np.tile( ONE_DECK, NUM_ONE_DECKS_IN_SHOE) #the shoe is the full stack of cards used in the game
  np.random.shuffle(shoe) #shuffle the ONE_DECK!

  top_of_shoe_ix = 0 #index for how far we are in the shoe
  num_cards_left_in_shoe = len(shoe) - top_of_shoe_ix #number of cards remaining in the shoe

  player_bankroll = PLAYER_STARTING_BANKROLL   #your starting bankroll initizled to the start value
  player_cardcount_signal = float(playerCardCounterUpdate(None,None,None)) #get the starting cardcount signal by feeding "None" into the updater

  #### PLAY ANOTHER HAND LOOP
  # while we still have at least MIN_ONE_DECKS_LEFT_TO_END_GAME left in the shoe, the game continues
  # (and the player has a non-zero bankroll!)
  while len(shoe) - top_of_shoe_ix > MIN_ONE_DECKS_LEFT_TO_END_GAME*len(ONE_DECK) and player_bankroll > 0:

    ### PLAYER CHOOSES BETSIZE ####
    bet_size = playerBetSizeChoice(player_cardcount_signal, num_cards_left_in_shoe, player_bankroll) #choose the betsize!
    bet_size = max(min(int(bet_size),player_bankroll),0) #ensure the playerbet size is between  and player_bankroll an an integer

    #print out whats going on if requested
    if Verbose:
      print(f"{player_bankroll=}, {bet_size=}")

    ### DEAL INITIAL CARDS TO DEALER AND PLAYER ###

    #deal the next card to the player
    nextCard = shoe[top_of_shoe_ix]
    top_of_shoe_ix += 1
    num_cards_left_in_shoe = len(shoe) - top_of_shoe_ix

    #feed in the card to the cardcounter function
    player_cardcount_signal = float(playerCardCounterUpdate(player_cardcount_signal,num_cards_left_in_shoe,nextCard))

    #for the 1st card only, we take the absolute value (since we always start at a positive value)
    player_sum = abs(nextCard) #absolute value is taken for the starting card

    #deal the next card to the dealer and also let the player card counter see the next card
    nextCard = shoe[top_of_shoe_ix]
    top_of_shoe_ix += 1
    num_cards_left_in_shoe = len(shoe) - top_of_shoe_ix
    player_cardcount_signal = float(playerCardCounterUpdate(player_cardcount_signal,num_cards_left_in_shoe,nextCard))
    dealer_sum = abs(nextCard) #absolute value is taken for the dealer starting card

    if Verbose:
      print("==================")
      print(f"Player Starting Sum: {player_sum}")
      print(f"Dealer Starting Sum: {dealer_sum}")

    ### PLAYERS TURN ###
    if Verbose:
        print("--Player's Turn")

    player_is_active = playerStrategy(player_sum,dealer_sum,player_cardcount_signal) #Boolean flag for the Player still playing. True iff player wants to "hit" and keep going.
    player_is_active = player_is_active and ( top_of_shoe_ix < len(shoe) ) #Check and make sure there are still cards left in the shoe! Everything is automatically over if we are out of cards.
    player_is_busted = False #Boolean flag for the player being busted.



    while player_is_active: #while player wants to keep going
      #deal the next card and also let the player card counter see the next card
      nextCard = shoe[top_of_shoe_ix]
      top_of_shoe_ix += 1
      num_cards_left_in_shoe = len(shoe) - top_of_shoe_ix
      player_cardcount_signal = float(playerCardCounterUpdate(player_cardcount_signal,num_cards_left_in_shoe,nextCard))
      player_sum += nextCard

      if Verbose:
        print(f"{player_cardcount_signal=}")

      #check status for bustedness and what to do next
      player_is_busted = ( player_sum < 1 or player_sum > 21 ) #flag for busted
      player_is_active = ( not player_is_busted ) and playerStrategy(player_sum,dealer_sum,player_cardcount_signal) #check if player is still active
      player_is_active = player_is_active and ( top_of_shoe_ix < len(shoe) ) #if we are at out of cards in the shoe everything is automatically over!

      if Verbose:
        print(f"{player_sum = }, {player_is_busted=}, {player_is_active=}")

    ### DEALER'S TURN ###
    if Verbose:
        print("--Dealer's Turn")

    #The dealer will always hit if <=17 and player is not busted
    dealer_is_active = (not player_is_busted) #flag for the Dealer still playing. Note that if player busted, the dealer will just skip their turn.
    dealer_is_active = dealer_is_active and ( top_of_shoe_ix < len(shoe) ) #if we are out of cards everything is automatically over
    dealer_is_busted = False

    while dealer_is_active:
      #deal card to dealer
      nextCard = shoe[top_of_shoe_ix]
      top_of_shoe_ix += 1
      num_cards_left_in_shoe = len(shoe) - top_of_shoe_ix
      player_cardcount_signal = float(playerCardCounterUpdate(player_cardcount_signal,num_cards_left_in_shoe,nextCard))
      dealer_sum += nextCard

      #check for what to do
      dealer_is_busted = (dealer_sum < 1) or (dealer_sum > 21) #check if dealer busted
      dealer_is_active = (not dealer_is_busted) and ( dealer_sum <= 16 )#check if dealer chooses to hit and keep playing
      dealer_is_active = dealer_is_active and ( top_of_shoe_ix < len(shoe) ) #if we are out of cards everything is automatically over

      if Verbose:
        print(f"{dealer_sum = }, {dealer_is_busted=}, {dealer_is_active=}")

    player_wins = dealer_is_busted or ( (not player_is_busted) and player_sum > dealer_sum ) #Boolean variable if player wins!
    dealer_wins = player_is_busted or ( (not dealer_is_busted) and player_sum < dealer_sum ) #Boolean variable if dealer wins!
    if Verbose:
      print(f"{player_wins=}, {dealer_wins=}")

    player_bankroll += bet_size*player_wins - bet_size*dealer_wins #give player the money for winning or losing!

  return player_bankroll

In [None]:
### Simulate a bunch of games with the Lab 2 policy
n_sims = 1000
results = np.zeros(n_sims)
for i in range(n_sims):
  results[i] = simulateEasy21_finite_deck(Lab2Actions, playerBetSizeChoice, playerCardCounterUpdate, Verbose=False)
print(np.mean(results))

103.454


In [None]:
### Simulate a bunch of games with the "Always Stick" policy
def Always_Stick_Policy(player_sum : int, dealer_sum : int, card_counting_signal:int):
  #the policy to always stick and never hit!
  return False

n_sims = 1000
results = np.zeros(n_sims)
for i in range(n_sims):
  results[i] = simulateEasy21_finite_deck(Always_Stick_Policy, playerBetSizeChoice, playerCardCounterUpdate, Verbose=False)
print(np.mean(results))

102.223
