<a href="https://colab.research.google.com/github/kretchmar/CS339_2023/blob/main/HiHoCherryO_MC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Solving Hi Ho Cherry O Game using Monte Carlo

## Game Description

In the game, each player has a bucket.  The goal is to put 10 cherries in your own bucket to win the game.  Each player starts with zero cherries.  On each turn, a player spins a spinner.  The spinner has 7 outcomes

- Add 1 cherry
- Add 2 cherries
- Add 3 cherries
- Add 4 cherries
- Remove 2 cherries (bird eats them)
- Remove 2 cherries (dog eats them)
- Remove all cherries (spill the bucket)

Our goal is to figure out how many spins it takes on average to reach a full bucket of 10 cherries.


## Equations

We let $v_i$ represent the value of being in state $i$; that is, with $i$ cherries in the bucket.   The value of state $v_i$ is the expected number of spins needed to reach 10 cherries when starting with $i$ cherries.  So $v_0$ will be the expected number of spins when starting with 0 cherries -- the start state of the game and the answer that we seek.




In [2]:
import numpy as np

In [38]:
# Note to reader: In the analysis below, I assumed the dog and bird spinner would remove 3 cherries (not 2)
# so the equations and results are slightly different, though the process is the same.  

def update (s):
  '''
  Compute next state given state s
  s = number of cherries in bucket
  '''
  spin = np.random.randint(1,8)
  if spin <= 4:
    return min(10,s+spin)
  elif spin <= 6:
    return max(0,s-3)
  else:
    return 0

def trajectory ():
  '''
  Simulate one trajectory of experience
  Return list of states during trajectory
  '''
  traj = [0]
  s = 0
  while ( s < 10):
    s = update(s)
    traj.append(s)
  return traj

def init ():
  totals = np.zeros(11)
  count = np.zeros(11)
  return totals,count

def policy_evaluation(totals,count,n):
  '''
  do n trajectories of learning
  and update the v/count arrays
  '''
  for i in range(n):
    t = trajectory()
    m = len(t)
    for j in range(m):
      cost = m-1-j
      s = t[j]
      count[s] += 1
      totals[s] += cost


def compute_value (totals,count):
  v = totals / count
  return v



In [24]:
t = trajectory()
print(t)

[0, 4, 7, 10]


In [37]:
totals,counts = init()


In [44]:
policy_evaluation(totals,counts,500)
print(totals)
print(counts)
v = compute_value(totals,counts)
print(v)

[122440.  26228.  25364.  27083.  26370.  14953.  11850.   7792.   5607.
   3190.      0.]
[6884. 1543. 1507. 1731. 1835. 1117. 1041.  813.  682.  540. 1000.]
[17.78617083 16.99805574 16.83078965 15.64586944 14.37057221 13.38675022
 11.3832853   9.58425584  8.22140762  5.90740741  0.        ]
