Name : Pratik Warade

# Reinforcement Learning Solution to the Towers of Hanoi Puzzle

## Overview :
In this assignment, I have implemented the reinforcement learning to solve Tower of Hanoi Puzzle with 3 pegs and for 4 Pegs

### Background :


The Tower of Hanoi game consists of three pegs and a number of disks of different sizes which can slide onto the pegs. The puzzle starts with all disks stacked on the first peg in ascending order, with the largest at the bottom and the smallest on top. The objective of the game is to move all the disks to the third peg. The only legal moves are those which take the top-most disk from one peg to another, with the restriction that a disk may never be placed upon a smaller disk. Figure 1 shows the optimal solution for 4 disks.
<img src= "http://www.cs.colostate.edu/~pswarade/pratik_imp/Tower_of_Hanoi.gif">

### What is RL (Reinforcement Learning)? 
In simple words,Reinforcement Learning is all about how we can make good decision through trial and error.  It is the interaction between the "agent" and the "environment".  
Repeat the following steps until reaching a termination condition :  
  1) The agent observe the environment having state s  
  2) Out of all possible actions, the agent need to decide which action to take.  (this is called "policy", which is a function that output an action given the current state)  
  3) Agent take the action, and the environment receive that action  
  4) Through a transition matrix model, environment determine what is the next state and proceed to that state  
  5) Through a reward distribution model, the environment determines the reward to the agent given he take action a at state s
  
  
  
  
<img src= "http://www.cs.colostate.edu/~pswarade/pratik_imp/reinforcement.png">

## Requirements:
First, how should we represent the state of this puzzle?  We need to keep track of which disks are on which pegs. I Named the disks 1, 2, and 3, with 1 being the smallest disk and 3 being the largest. The set of disks on a peg can be represented as a list of integers.  Then the state can be a list of three lists.

For example, the starting state with all disks being on the left peg would be `[[1, 2, 3], [], []]`.  After moving disk 1 to peg 2, we have `[[2, 3], [1], []]`.

To represent that move I just made, I can use a list of two peg numbers, like `[1, 2]`, representing a move of the top disk on peg 1 to peg 2.



Now on to some functions. Define at least the following functions. Examples showing required output appear below.

   - `printState(state)`: prints the state in the form shown below
   - `validMoves(state)`: returns list of moves that are valid from `state`
   - `makeMove(state, move)`: returns new (copy of) state after move has been applied.
   - `trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF)`: train the Q function for number of repetitions, decaying epsilon at start of each repetition. Returns Q and list or array of number of steps to reach goal for each repetition.
   - `testQ(Q, maxSteps, validMovesF, makeMoveF)`: without updating Q, use Q to find greedy action each step until goal is found. Return path of states.

A function that you might choose to implement is

   - `stateMoveTuple(state, move)`: returns tuple of state and move.  
    
This is useful for converting state and move to a key to be used for the Q dictionary.

In [1]:
# Required Imports 
import numpy as np
from random import choice
from copy import deepcopy
from copy import copy


In [2]:
# Lets start by simply printing towe of Hanoi states. Following code call getStates method that will return p,q,r states from input board.
# Once We get p,q,r list. I then, check for size and based  on that, I update states with blanks or values and print using .format.

def getState(state):
    i=0
    states=deepcopy(state)
    #looping through all states and assigning them to p,q,r state
    for s in states :
        if(i==0):
            p=s            
          #  print ("P= ",p)
        if(i==1):
            q=s
           # print ("q= ",q)
        if(i==2):
            r=s
            #print ("r= ",r)
        i=i+1
    
    return (p,q,r)


def printState(state):
    #creating board skeleton
    tableformat = '''
               {} {} {} 
               {} {} {}
               {} {} {}
               -----
             '''
    p,q,r = getState(state)
    # Creating Finished/final state P for all conditions 
    if (len(p)==0):
        p=[' ',' ',' ']
    elif (len(p)==1):
        p=[' ',' ',p[0]]
    elif (len(p)==2):
        p=[' ',p[0],p[1]]
    elif (len(p)==3):
        p=[p[0],p[1],p[2]]
        
    # Creating Finished/final state q for all conditions 
    if (len(q)==0):
        q=[' ',' ',' ']
    elif (len(q)==1):
        q=[' ',' ',q[0]]
    elif (len(q)==2):
        q=[' ',q[0],q[1]]
    elif (len(q)==3):
        q=[q[0],q[1],q[2]]
        
    # Creating Finished/final state r for all conditions 
    if (len(r)==0):
        r=[' ',' ',' ']
    elif (len(r)==1):
        r=[' ',' ',r[0]]
    elif (len(r)==2):
        r=[' ',r[0],r[1]]
    elif (len(r)==3):
        r=[r[0],r[1],r[2]]
        
    #Inserting required values on given positions of board and printing it
    tableformat_final = tableformat.format(p[0],q[0],r[0],p[1],q[1],r[1],p[2],q[2],r[2])
    print (tableformat_final)


In [3]:
state= [[1, 2, 3], [], []]
printState(state)


               1     
               2    
               3    
               -----
             


In [4]:
# Next is function that return valid moves. This is based on game rule where, we cannot put big disk above small disk and one move at at time
#Logic : 1st check length of p,q,r and based on that move by checking other tower  is empty or not ( if not check moving peg is smaller or not)
# Update all possible moves to a new list
def validMoves(state):
    solution=[]
    p,q,r=getState(state)
    '''                     FOR LENGTH of 3 '''
    if(len(p)==3):
        return ([1,2],[1,3])
    if(len(q)==3):
        return ([2,1],[2,3])
    if(len(r)==3):
        return ([3,1],[3,2])
    '''                     FOR LENGTH of 2 '''
    #for P if len of 2
    if(len(p)==2):
        if( len(q))==0:
            solution.append( [1,2])
        else:           
            if(p[0]<q[0]):
                solution.append( [1,2])
        if( len(r))==0:
           # print("pratik")
            solution.append( [1,3])
        else:
            if(p[0]<r[0]):
                solution.append( [1,3])
    #for q if len of 2
    if(len(q)==2):
        if( len(p))==0:
            
            solution.append( [2,1])
        else:           
            if(q[0]<p[0]):
                solution.append( [2,1])
        if( len(r))==0:
            #print("pratik")
            solution.append( [2,3])
        else:
            if(q[0]<r[0]):
                solution.append( [2,3])
     #for r if len of 2
    if(len(r)==2):
        if( len(p))==0:
            solution.append( [3,1])
        else:           
            if(r[0]<p[0]):
                solution.append( [3,1])
        if( len(q))==0:
           # print("pratik")
            solution.append( [3,2])
        else:
            if(r[0]<q[0]):
                solution.append( [3,2])
                
    '''                     FOR LENGTH of 1 '''
    
    if(len(p)==1): 
        if(len(q)>0 ):
            if(p[0]<q[0]):
                solution.append( [1,2])
        if ( len(r)>0):
            if(p[0]<r[0]):
                solution.append( [1,3])
        
        if (len(q)==0):
                solution.append( [1,2])
        if(len(r)==0):
                solution.append( [1,3])
    
    #for q if len of 1
    if(len(q)==1): 
        if(len(p)>0 ):
            if(q[0]<p[0]):
                solution.append( [2,1])
        if ( len(r)>0):
            if(q[0]<r[0]):
                solution.append( [2,3])
        
        if (len(p)==0):
                solution.append( [2,1])
        if(len(r)==0):
                solution.append( [2,3])
    #for r if len of 1
    if(len(r)==1): 
        if(len(p)>0): 
            if(r[0]<p[0]):
                solution.append( [3,1])
        if (len(q)>0):
            if(r[0]<q[0]):
                solution.append( [3,2])
        
        if (len(p)==0):
                solution.append( [3,1])
        if(len(q)==0):
                solution.append( [3,2])
                
    return (solution)
        
        
    

In [5]:
validMoves(state)

([1, 2], [1, 3])

In [6]:
# Once we get valid move, Its time to make move.
# makeMove function first getStates and then get 0th and 1st value from valid move. tmp list is use to append what peg we are moving and once it is moved, it is inserted in corresponging tower

def makeMove(states, move): 
    newstate=[]
    tmp=[]
    p,q,r=getState(states)
    #print(p)
    if(move[0]==1):  
        try:
    #    print(p[0])
            tmp.append(p[0]) 
            p.pop(0)
        except:
            print("INVALID")
    elif(move[0]==2):
        try:
            tmp.append(q[0])
            q.pop(0)
        except:
            print("INVALID")
    elif(move[0]==3):
        try:
            tmp.append(r[0])
            r.pop(0)
        except:
            print("INVALID")
            
    if(move[1]==1):  
        try:
            p.insert(0,tmp[0])
        except:
            print("INVALID")
    elif(move[1]==2):
        try:
            q.insert(0,tmp[0])
        except:
            print("INVALID")
    elif(move[1]==3):
        try:
            r.insert(0,tmp[0])
        except:
            print("INVALID")
      
                
    
   # print(p)    
    return [p,q,r]

In [7]:
move =[1, 2]
newstate = makeMove(state, move)
newstate

[[2, 3], [1], []]

In [8]:
#Below is handy function to covert list of list in tuple. This will be great help while updating Q values
def stateMoveTuple(state, move):
    p,q,r=getState(state)
    #print(p,q,r)
    return((tuple(p),tuple(q),tuple(r)),tuple(move))

In [9]:
stateMoveTuple(state, move)

(((1, 2, 3), (), ()), (1, 2))

## TrainQ


This function takes arguments nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF. In this fuction,we learn Q i.e a form of reinforcement learning in which the agent learns to assign values to state-move pairs. 
  
The Q-value for a state-action pair is the sum of all of these reinforcements, and the Q-value is calculated by mapping from state-move pairs to values. 

If the agent knew the Q-values of every state-move pair, it could use this information to select an action for each state. 

Initially the agent  has no idea what the Q-values of any state-move pairs are. The agent's goal  is to settle on an optimal Q-value function, one which that assigns the appropriate values for all state/action pairs. But Q-values depend on future reinforcements, as well as current ones.


If an moves in a given state causes something bad to happen, learn not to do that action in that situation. If an move in a given state causes something good to happen, learn to do that move in that situation.
If all moves in a given state cause something bad to happen, learn to avoid that state. That is, don't take actions in other moves that would lead you to be in that bad state. If any move in a given moves causes something good to happen, learn to like that move.

Because of Q- Learning, it able for agent to learn high or low values for particular actions from a particular state, even when there is no immediate reinforcement associated with those actions.


To slowly transition from taking random actions to taking the action currently believed to be best, called the greedy action, we slowly decay a parameter, ϵϵ, from 1 down towards 0 as the probability of selecting a random action. This is called the ϵϵ-greedy policy.


  
                                  `Q(st,at)=Q(st,at)+ρ(rt+1+Q(st+1,at+1)−Q(st,at))`
let's calculate the temporal difference error and the equation above is use to adjust the Q value of the previous board,move. The learning rate, rho, which controls the learning step size, that is, how fast learning takes place. The new Q-value for the state and action is the weighted combination of the old Q-value for that state and action and what the new information would lead us to goal or next state. On any given time step, the agent has a choice: it can pick the action with the highest Q-value for the state it is in (exploitation), or it can pick an action randomly when( `np.random.uniform() < epsilon`) (exploration).




## TestQ : 

testQ funciton take argument ( Q, maxSteps, validMovesF, makeMoveF ). This is very easy as we don't update any Q values in this function. We use Q value from trainQ function to find greedy move or action for each step till goal is found. This function returns list of path. i.e we choose the best optimal action that was train and is reflected in Q values. Using argmax, we find the opitmal move and then move to next step and do it until maxsteps are done.

In [10]:
def foundGoal(board):
    # return goal state.
    return board == [[],[],[1,2,3]]

def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF): 
    # Requried Import to pass grader file
    import numpy as np
    from random import choice
    from copy import deepcopy
    from copy import copy
    
    results = []                           #List to return i.e StepstoGoal.
    maxGames = nRepetitions                           # number of games
    rho = learningRate                              # learning rate
    epsilonExp = epsilonDecayFactor                      # rate of epsilon decay
    Q = {}                                  # initialize Q dictionary
    epsilon = 1.0                           # initial epsilon value
    printMoves = True                        # flag to print each board change                        
    for game in range(maxGames):
    
        epsilon *= epsilonExp
        step = 0
        board = [[1, 2, 3], [], []]        # Initial Step
        done = False
        #to debug what moves are take uncomment bellow if block
        #  if printMoves:
        #     printState(board)
        while not done:
            step += 1
            validMoves1 = validMoves(board)
            if np.random.uniform() < epsilon:
                # Randomly chooses list item
                move = validMoves1[np.random.choice(len(validMoves1))]
            else:
                # Select move who has highest Q value                    
                Qs = np.array([Q.get(stateMoveTuple(board,m), -1) for m in validMoves1]) 
                move= validMoves1[ np.argmax(Qs) ]  
            #make Tuple of board and move
            key=stateMoveTuple(board,move)
            if key not in Q:
                 Q[key] = 0
            # Make move
            movePlayed=makeMove(board,move)
            boardNew = deepcopy(movePlayed)
            if foundGoal(boardNew):
                Q[key] = 0
                done = True
            if step > 1:
                Q[stateMoveTuple(boardOld,moveOld)] += rho * (-1+Q[stateMoveTuple(board,move)] - Q[stateMoveTuple(boardOld,moveOld)])
            boardOld,moveOld = deepcopy(board),copy(move)
            board = boardNew
        results.append((step))    
    return (Q,results)

In [11]:
Q, stepsToGoal = trainQ(50, 0.5, 0.7, validMoves, makeMove)

In [12]:
Q

{(((), (1,), (2, 3)), (2, 3)): 0,
 (((), (1,), (2, 3)), (3, 1)): -0.9375,
 (((), (1, 2), (3,)), (2, 1)): -1.9999999999999467,
 (((), (1, 2), (3,)), (2, 3)): -2.3828125,
 (((), (1, 2), (3,)), (3, 1)): -2.85888671875,
 (((), (1, 2, 3), ()), (2, 1)): -3.84375,
 (((), (1, 2, 3), ()), (2, 3)): -3.9189453125,
 (((), (1, 3), (2,)), (2, 1)): -3.8134765625,
 (((), (1, 3), (2,)), (2, 3)): -3.59521484375,
 (((), (1, 3), (2,)), (3, 1)): -3.48681640625,
 (((), (2,), (1, 3)), (2, 1)): -1.8125,
 (((), (2,), (1, 3)), (3, 1)): -1.89892578125,
 (((), (2,), (1, 3)), (3, 2)): -1.90625,
 (((), (2, 3), (1,)), (2, 1)): -3.590576171875,
 (((), (2, 3), (1,)), (3, 1)): -3.453125,
 (((), (2, 3), (1,)), (3, 2)): -3.8642578125,
 (((), (3,), (1, 2)), (2, 1)): -4.2489013671875,
 (((), (3,), (1, 2)), (3, 1)): -4.0555419921875,
 (((), (3,), (1, 2)), (3, 2)): -3.8560791015625,
 (((1,), (), (2, 3)), (1, 2)): -0.5,
 (((1,), (), (2, 3)), (1, 3)): 0,
 (((1,), (), (2, 3)), (3, 2)): -1.375,
 (((1,), (2,), (3,)), (1, 2)): -1.

In [13]:
stepsToGoal

[30,
 215,
 29,
 27,
 30,
 32,
 11,
 14,
 25,
 19,
 9,
 9,
 33,
 26,
 11,
 7,
 33,
 7,
 9,
 7,
 10,
 15,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7]

In [14]:
def testQ(Q, maxSteps, validMovesF, makeMove): 
    # Import for grader file to work
    import numpy as np
    from random import choice
    from copy import deepcopy
    from copy import copy

    printMoves = True                        # flag to print each board chang
    steps=0
   
    board1= [[1, 2, 3], [], []]   #initialize board
    board=deepcopy(board1)
    path=[board]
    while steps<maxSteps and not foundGoal(board):  #loop through each step
        #print(steps, board)
        steps=steps+1
        validMoves1 = validMoves(board)   #get valid moves
        #print(validMoves1)
        q=[]
        for m in validMoves1:                          #get optimal action 
            q.append(Q.get(stateMoveTuple(board,m)) )      
        #print(q)
        move= validMoves1[ np.argmax(q) ]  
        board = makeMove(board,move)       #make move
        path.append(board)          #Append the found path
        
    return path

In [15]:
path1=testQ(Q, 20, validMoves, makeMove)
path1


[[[1, 2, 3], [], []],
 [[2, 3], [], [1]],
 [[3], [2], [1]],
 [[3], [1, 2], []],
 [[], [1, 2], [3]],
 [[1], [2], [3]],
 [[1], [], [2, 3]],
 [[], [], [1, 2, 3]]]

In [16]:
for s in path1:
    printState(s)
    


               1     
               2    
               3    
               -----
             

                     
               2    
               3   1
               -----
             

                     
                    
               3 2 1
               -----
             

                     
                 1  
               3 2  
               -----
             

                     
                 1  
                 2 3
               -----
             

                     
                    
               1 2 3
               -----
             

                     
                   2
               1   3
               -----
             

                   1 
                   2
                   3
               -----
             


In [17]:
%run -i A5grader.py


Testing validMoves([[1], [2], [3]])

--- 10/10 points. Correctly returned [[1, 2], [1, 3], [2, 3]]

Testing validMoves([[], [], [1, 2, 3]])

--- 10/10 points. Correctly returned ([3, 1], [3, 2])

Testing makeMove([[], [], [1, 2, 3]], [3, 2])

--- 10/10 points. Correctly returned [[], [1], [2, 3]]

Testing makeMove([[2], [3], [1]], [1, 2])

--- 10/10 points. Correctly returned [[], [2, 3], [1]]

Testing   Q, steps = trainQ(1000, 0.5, 0.7, validMoves, makeMove).

--- 10/10 points. Q dictionary has correct number of entries.

--- 10/10 points. The mean of the number of steps is 7.542 which is correct.

Testing   path = testQ(Q, 20, validMoves, makeMove).

--- 20/20 points. Correctly returns path of length 8, less than 10.

C:\Users\waradepratik Execution Grade is 80/80

 Remaining 20 points will be based on your text describing the trainQ and test! functions.

C:\Users\waradepratik FINAL GRADE is __/100


# Extra Credit

Task : Modify your code to solve the Towers of Hanoi puzzle with 4 disks instead of 3.
I have implemented sperate trainQ and TestQ fuction as, I have to make modifications in code

In [34]:
def printState_4disk(state):
   
    #creating board skeleton
    tableformat = '''
               {} {} {}
               {} {} {} 
               {} {} {}
               {} {} {}
               -----
             '''
    p,q,r = getState(state)
    # Creating Finished/final state P for all conditions 
   # print(len(p))
    if (len(p)==0):
        p=[' ',' ',' ',' ']
    elif (len(p)==1):
        p=[' ',' ',' ',p[0]]
    elif (len(p)==2):
        p=[' ',' ',p[0],p[1]]
    elif (len(p)==3):
        p=[' ',p[0],p[1],p[2]]
    elif (len(p)==4):
        p=[p[0],p[1],p[2],p[3]]
    # Creating Finished/final state q for all conditions 
    if (len(q)==0):
        q=[' ',' ',' ',' ']
    elif (len(q)==1):
        q=[' ',' ',' ',q[0]]
    elif (len(q)==2):
        q=[' ',' ',q[0],q[1]]
    elif (len(q)==3):
        q=[' ',q[0],q[1],q[2]]
    elif (len(q)==4):
        q=[q[0],q[1],q[2],q[3]]
    # Creating Finished/final state r for all conditions 
    if (len(r)==0):
        r=[' ',' ',' ',' ']
    elif (len(r)==1):
        r=[' ',' ',' ',r[0]]
    elif (len(r)==2):
        r=[' ',' ',r[0],r[1]]
    elif (len(r)==3):
        r=[' ',r[0],r[1],r[2]]
    elif (len(r)==4):
        r=[r[0],r[1],r[2],r[3]]
    #Inserting required values on given positions of board and printing it
    tableformat_final = tableformat.format(p[0],q[0],r[0],p[1],q[1],r[1],p[2],q[2],r[2],p[3],q[3],r[3])
    print (tableformat_final)

    
    

In [35]:
state=[[1,2,3,4],[],[]]
printState_4disk(state)


               1    
               2     
               3    
               4    
               -----
             


In [36]:
def validMoves_4disk(state):
    solution=[]
    p,q,r=getState(state)

    '''                     FOR LENGTH of 4 '''
    if(len(p)==4):
        return ([1,2],[1,3])
    if(len(q)==4):
        return ([2,1],[2,3])
    if(len(r)==4):
        return ([3,1],[3,2])
    
    '''                     FOR LENGTH of 3 '''
    if(len(p)==3):
        if( len(q))==0:
            solution.append( [1,2])
        else:    
            if(p[0]<q[0]):
                solution.append( [1,2])
        if( len(r))==0:
           # print("pratik")
            solution.append( [1,3])
        else:
            if(p[0]<r[0]):
                solution.append( [1,3])
       
    if(len(q)==3):
        if( len(p))==0:
            
            solution.append( [2,1])
        else:
            if(q[0]<p[0]):
                solution.append( [2,1])
        if( len(r))==0:
            #print("pratik")
            solution.append( [2,3])
        else:
            if(q[0]<r[0]):
                solution.append( [2,3])
    
    if(len(r)==3):
        if( len(p))==0:
            solution.append( [3,1])
        else:       
            if(r[0]<p[0]):
                solution.append( [3,1])
        if( len(q))==0:
           # print("pratik")
            solution.append( [3,2])
        else:
            if(r[0]<q[0]):
                solution.append( [3,2])
    '''                     FOR LENGTH of 2 '''
    #for P if len of 2
    if(len(p)==2):
        if( len(q))==0:
            solution.append( [1,2])
        else:       
            if(p[0]<q[0]):
                solution.append( [1,2])
        if( len(r))==0:
           # print("pratik")
            solution.append( [1,3])
        else:
            if(p[0]<r[0]):
                solution.append( [1,3])
    #for q if len of 2
    if(len(q)==2):
        if( len(p))==0:
            
            solution.append( [2,1])
        else:      
            if(q[0]<p[0]):
                solution.append( [2,1])
        if( len(r))==0:
            #print("pratik")
            solution.append( [2,3])
        else:
            if(q[0]<r[0]):
                solution.append( [2,3])
     #for r if len of 2
    if(len(r)==2):
        if( len(p))==0:
            solution.append( [3,1])
        else:      
            if(r[0]<p[0]):
                solution.append( [3,1])
        if( len(q))==0:
           # print("pratik")
            solution.append( [3,2])
        else:
            if(r[0]<q[0]):
                solution.append( [3,2])
                
    '''                     FOR LENGTH of 1 '''
    
    if(len(p)==1): 
        if(len(q)>0 ):
            if(p[0]<q[0]):
                solution.append( [1,2])
        if ( len(r)>0):
            if(p[0]<r[0]):
                solution.append( [1,3])
        
        if (len(q)==0):
                solution.append( [1,2])
        if(len(r)==0):
                solution.append( [1,3])
    
    
    if(len(q)==1): 
        if(len(p)>0 ):
            if(q[0]<p[0]):
                solution.append( [2,1])
        if ( len(r)>0):
            if(q[0]<r[0]):
                solution.append( [2,3])
        
        if (len(p)==0):
                solution.append( [2,1])
        if(len(r)==0):
                solution.append( [2,3])
    if(len(r)==1): 
        if(len(p)>0): 
            if(r[0]<p[0]):
                solution.append( [3,1])
        if (len(q)>0):
            if(r[0]<q[0]):
                solution.append( [3,2])
        
        if (len(p)==0):
                solution.append( [3,1])
        if(len(q)==0):
                solution.append( [3,2])
                
    return (solution)
   

In [37]:
state=[[2,3,4],[1],[]]
validMoves_4disk(state)

[[1, 3], [2, 1], [2, 3]]

In [38]:
def makeMove_4disk(states, move):
    newstate=[]
    tmp=[]
    p,q,r=getState(states)
    #print(p)
    if(move[0]==1):  
        try:
    #    print(p[0])
            tmp.append(p[0]) 
            p.pop(0)
        except:
            print("INVALID")
    elif(move[0]==2):
        try:
            tmp.append(q[0])
            q.pop(0)
        except:
            print("INVALID")
    elif(move[0]==3):
        try:
            tmp.append(r[0])
            r.pop(0)
        except:
            print("INVALID")
            
    if(move[1]==1):  
        try:
            p.insert(0,tmp[0])
        except:
            print("INVALID")
    elif(move[1]==2):
        try:
            q.insert(0,tmp[0])
        except:
            print("INVALID")
    elif(move[1]==3):
        try:
            r.insert(0,tmp[0])
        except:
            print("INVALID")        
    
   # print(p)    
    return [p,q,r]
    

In [39]:
move=[1,3]
state=[[1,2,3,4],[],[]]
makeMove_4disk(state,move)

[[2, 3, 4], [], [1]]

In [40]:
def foundGoal_4(board):
    # return goal state.
    return board == [[],[],[1,2,3,4]]

def trainQ4(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF): 
    # Requried Import to pass grader file
    import numpy as np
    from random import choice
    from copy import deepcopy
    from copy import copy
    
    results = []                           #List to return i.e StepstoGoal.
    maxGames = nRepetitions                           # number of games
    rho = learningRate                              # learning rate
    epsilonExp = epsilonDecayFactor                      # rate of epsilon decay
    Q = {}                                  # initialize Q dictionary
    epsilon = 1.0                           # initial epsilon value
    printMoves = True                        # flag to print each board change                        
    for game in range(maxGames):
    
        epsilon *= epsilonExp
        step = 0
        board = [[1, 2, 3,4], [], []]        # Initial Step
        done = False
        #to debug what moves are take uncomment bellow if block
        #  if printMoves:
        #     printState(board)
        while not done:
            step += 1
            validMoves1 = validMoves_4disk(board)
            if np.random.uniform() < epsilon:
                # Randomly chooses list item
                move = validMoves1[np.random.choice(len(validMoves1))]
            else:
                # Select move who has highest Q value                    
                Qs = np.array([Q.get(stateMoveTuple(board,m), -1) for m in validMoves1]) 
                move= validMoves1[ np.argmax(Qs) ]  
            #make Tuple of board and move
            key=stateMoveTuple(board,move)
            if key not in Q:
                 Q[key] = 0
            # Make move
            movePlayed=makeMove(board,move)
            boardNew = deepcopy(movePlayed)
            if foundGoal_4(boardNew):
                Q[key] = 0
                done = True
            if step > 1:
                Q[stateMoveTuple(boardOld,moveOld)] += rho * (-1+Q[stateMoveTuple(board,move)] - Q[stateMoveTuple(boardOld,moveOld)])
            boardOld,moveOld = deepcopy(board),copy(move)
            board = boardNew
        results.append((step))    
    return (Q,results)

def testQ4(Q, maxSteps, validMovesF, makeMove): 
    # Imports for grader file to work
    import numpy as np
    from random import choice
    from copy import deepcopy
    from copy import copy

    printMoves = True                        # flag to print each board chang
    steps=0
   
    board1= [[1, 2, 3,4], [], []]   #initialize board
    board=deepcopy(board1)
    path=[board]
    while steps<maxSteps and not foundGoal_4(board): 
        #print(steps, board)
        steps=steps+1
        validMoves1 = validMoves_4disk(board)
        #print(validMoves1)
        q=[]
        for m in validMoves1:
            q.append(Q.get(stateMoveTuple(board,m)) )
        #print(q)
        move= validMoves1[ np.argmax(q) ]
        board = makeMove_4disk(board,move)
        path.append(board)
        
    return path

After running for bunch of combinations for number of repetitions, learning rate, and epsilon decay factor, I found bellow combination which gave me shortest path.

In [41]:
Q, stepsToGoal = trainQ4(150, 0.5, 0.7, validMoves_4disk, makeMove_4disk)

In [42]:
stepsToGoal

[872,
 121,
 38,
 293,
 327,
 86,
 53,
 58,
 67,
 52,
 171,
 163,
 35,
 155,
 66,
 52,
 18,
 146,
 71,
 127,
 59,
 33,
 57,
 29,
 60,
 109,
 23,
 70,
 45,
 71,
 91,
 28,
 47,
 37,
 17,
 103,
 44,
 24,
 28,
 63,
 37,
 35,
 16,
 41,
 20,
 123,
 21,
 22,
 42,
 19,
 39,
 20,
 43,
 18,
 20,
 20,
 19,
 44,
 16,
 18,
 17,
 16,
 16,
 46,
 30,
 16,
 30,
 16,
 16,
 16,
 16,
 16,
 18,
 16,
 16,
 16,
 16,
 16,
 17,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15]

In [43]:
path=testQ4(Q, 20, validMoves_4disk, makeMove_4disk)
path

[[[1, 2, 3, 4], [], []],
 [[2, 3, 4], [1], []],
 [[3, 4], [1], [2]],
 [[3, 4], [], [1, 2]],
 [[4], [3], [1, 2]],
 [[1, 4], [3], [2]],
 [[1, 4], [2, 3], []],
 [[4], [1, 2, 3], []],
 [[], [1, 2, 3], [4]],
 [[], [2, 3], [1, 4]],
 [[2], [3], [1, 4]],
 [[1, 2], [3], [4]],
 [[1, 2], [], [3, 4]],
 [[2], [1], [3, 4]],
 [[], [1], [2, 3, 4]],
 [[], [], [1, 2, 3, 4]]]

In [44]:
for s in path:
    printState_4disk(s)
    


               1    
               2     
               3    
               4    
               -----
             

                    
               2     
               3    
               4 1  
               -----
             

                    
                     
               3    
               4 1 2
               -----
             

                    
                     
               3   1
               4   2
               -----
             

                    
                     
                   1
               4 3 2
               -----
             

                    
                     
               1    
               4 3 2
               -----
             

                    
                     
               1 2  
               4 3  
               -----
             

                    
                 1   
                 2  
               4 3  
               -----
             

                    
          

Find values for number of repetitions, learning rate, and epsilon decay factor for which trainQ learns a Q function that testQ can use to find the shortest solution path. Include the output from the successful calls to trainQ and testQ.?
  
Ans:  Shortest path length is 15 ( excluding start state). Number of repitions = 150 , learning rate =0.5 , and epsilon decay factor =0.7