GitHub - jeapostrophe/monaco: monaco - Monte Carlo Tree Search library

This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

jeapostrophe / monaco Public archive

Notifications You must be signed in to change notification settings
Fork 0
Star 1

monaco - Monte Carlo Tree Search library

View license

1 star 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
c		c
examples		examples
LICENSE		LICENSE
README		README
info.rkt		info.rkt
main.rkt		main.rkt
util.rkt		util.rkt

Repository files navigation

monaco - Monte Carlo Tree Search library

;; XXX Use adqc to write the functions?

;; XXX The discussion of MCTS in https://arxiv.org/pdf/1805.09218.pdf
;; says that there is a fixed set of actions. But sometimes, actions
;; are illegal (like in Go, you can't put your stone on top of
;; another.) Would you represent this by saying that you lose if you
;; do an illegal move? Or do you just vary the number of actions on
;; each node? Perhaps the MCTS-T algorithm is great for this
;; situation, because it is well-designed for "ending early"
;; situations. [This paper has a link to a git repo that implements
;; it.]

;; XXX Exploiting graph properties of game trees, by Plaat et al.,
;; 1996 talks about "transpositions" (the same state occurring in
;; different paths in the tree.)

;; XXX Any space paper says
;; - Left-Child Right-Sibling representation for the search tree
;; - This seems to imply that we need to generate all of the children
;;   at the same time.

;; XXX MCTS steps
;; - Selection: via tree policy
;; --- - UCB1: choose child v' maximizing Q(v')/N(v') + K \sqrt{ln N(v) / N(v')}
;; --- - K is often \sqrt(2) or \sqrt(2)/2
;; --- - This could be tracked easily with a priority queue per node.
;; --- - If v' is not in the tree, then it is "expanded"
;; - Simulation:
;; --- - Randomly playout of the game
;; - Back propagation:
;; --- - Update Q and N

;; XXX Single-Observer Information-Set MCTS

;; - The state is unknown, so we sample a concrete state from what is
;;   possible (based on the actions we've observed in the past)

;; - Still update the same tree

;; XXX Multi-Observer Information-Set MCTS

;; - One tree for each observer

;; - Choosing a move for player-i uses the i-tree

;; - Player i's tree has branches for actions but player j ≠ i's tree
;;   has merged the branches when it can't tell the difference

;; XXX AnyTime: MCTS is generic in the number of iterations it takes
;; before returning an answer.
;;
;; - Run MCTS when the user is idle and immediately make move when
;; they do. (Disadvantage: doesn't take advantage of what the user
;; did.)
;;
;; - Run MCTS for as long as the player took making their
;; move. (Disadvantage: Doesn't make use of idle time.)

;; XXX AnySpace: Node recycling

;; - http://mcts.ai/edpowley/papers/powley_aiide17.pdf

;; - Predetermine how many nodes you want to store

;; - Draw from the free list when you want one

;; - If the free list is empty, remove the one for which the Q/N
;;   values were least-recently read.

;; - It must be read, because if not, then you may kill a sibling that
;;   is useful.

;; - You maintain this by adjusting the position of elements on the
;;   way down the tree

;; - This means that the all of the siblings of a node need to be ;;
;;   touched on the way down, which is inefficient and doesn't make
;;   use of the heap effectively. Perhaps you could update them when
;;   readjusting the heap position of the sibling, but that would
;;   either be inefficient (look through all) or not exhaustive (only
;;   look at the ones on the path up the heap) and backwards (because
;;   the top of the queue would be the ones at the top of the tree,
;;   rather than the bottom)

;; - Paper says: We maintain a queue Θ of nodes. Non-leaf node are
;;   removed from Θ when visited during playout, and push to the back
;;   of Θ when they are updated. Leaf nodes are never present in
;;   Θ. Thus the node at the front of Θ is the least recently updated
;;   non-leaf node.

;; - If you remove a child, how do you get it back if you need it
;;   again? Generate all choices from the parent and add the ones that
;;   aren't there? How would you know? The paper doesn't say anything
;;   about it. One strategy would be to track only parents and delete
;;   ALL of the children.

;; XXX AlphaZero
;; - How is the neural network used?
;; --- - It is used to rank the children.
;; --- - It is used during simulation to select the move (like a "heavy" playout)
;; - Typical number of legal moves is used to scale exploration (the K constant)
;; - Games are terminated early after a certain number of steps and scored.
;; - Network input: N x N x (MT + L)
;; --- - N is the size of the board
;; --- - M is the number of planes (1 for each piece type)
;; --- - T is the number of time steps into the past
;; --- - L is game-specific variables, like legality of castling
;; - Network output "policy" is basically what piece to "pick up" and where to put it
;; --- - Probabilities for illegal moves are 0'd and then the rest re-normalized
;; - MCTS # of iterations was 800

;; XXX Ultimate Tic-Tac-Toe: A 3x3 grid of boards where if you play in
;; (2,1) then the opponent must use that board next.

;; XXX Connect-4: A 7-column x 6-row grid where you try to get 4 in a
;; row. Solved where first player wins if they play in the middle
;; column.

;; XXX Caylus: 35 spots to choose and 26 different buildings

;; XXX Hansa Teutonica (see other file) there are billions of options.

;; XXX Tigris & Euphrates: 11x16 spots, 4 different kinds of tiles,
;; plus 2 catastrophe tiles, 4 leaders with ability to remove one
;; also. Consider this as 11x16+1 spots with 9 things to place (tiles,
;; catastrophe, and leaders.) Plus, you get two actions, so that's 2.5
;; million possibilities.

;; XXX HT + T&E --- Abstract the actions into meta-actions like (HT)
;; "Open a new route", "Finish a route", and so on or (T&E) "Cause a
;; revolt", "Cause a war", "Establish kingdom", or so on. Then select
;; among possible ways to do that thing. I found some papers about
;; these ideas, but haven't read them yet:
;; https://project.dke.maastrichtuniversity.nl/games/files/msc/Roelofs_thesis.pdf
;; and https://arxiv.org/pdf/1805.09613.pdf

;; XXX HT (based on above): Another way is to think about a
;; "strategic" and a "tactical" AI. The strategic one chooses which
;; cities/offices it will target and the tactical one takes that goal
;; and determines the best way to get it. The hard part is
;; incorporating reward into things... what does the money-bag give
;; you? It really gives you more tactical options, but how do you
;; quantify that? It is clear that the key and medal give you points,
;; the liber sophia gives you a merchant, privilege gives you options,
;; but the money bag just gives you more moveable pieces. Perhaps I could quantify things by keeping track of how many turns it would take to claim a route/city. So, in the beginning you have 2 units per turn, but if you get displaced it is as if you have 3 units, because can put one and then move.

;; Roelofs ---
;; https://project.dke.maastrichtuniversity.nl/games/files/msc/Roelofs_thesis.pdf
;; --- has a big idea on page 24 and 26. There are multiple dimensions
;; of decisions (move, complete, and so on) and you first select a
;; dimension, then an action from that, and so on until you have come
;; to a complete action. Because of the asymmetric exploration of the
;; tree, you'll ignore unpromising sub-actions/dimension orderings. This requires "partial move completion" to take an incomplete action into a complete one. (In the experiment, the dimensions are the nodes to move soldiers from.)