Initial state:
┏━━━┳━━━┳━━━┓
┃ ☼ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
Desired state:
┏━━━┳━━━┳━━━┓
┃ ┃ ☾ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛
The files are utils.py
, environment.py
, main.py
.
The program solves the puzzle in 14 moves almost every time under 10 seconds. Just run main.py
.
Requirements: numpy
and any python version above 3.6.
Environment related methods and variables are implemented in this file. There is a class named env
which creates and handles a game board.
There are 4 actions in polar coordination order. Actions are defined for the blank cell.
0: right
1: up
2: left
3: down
Three different rewards are set:
- Each move has a penalty of -10, so the agent tries to reduce the moves.
- Invalid moves (hitting the wall) has a penalty of -1000 to prevent agent from choosing them.
- Reaching the goal state has a reward of 100.
Each state is stored as a 1-D tuple, containing 4 moons, 4 suns and 1 blank.
The state space is generated by calculating permutations of the first state, and stored in the env
class as a list.
Runs an epoch and returns the new state, the reward, and done flag (whether reached the desired state or not )
Executes the action on the given state and returns the new state.
This file does the Q-learning process. It initiates a Q-table using the state space of env
class as keys, and a list of four 0s as values. An entry would look like this:
{('moon', 'moon', 'moon', 'moon', 'sun', 'sun', 'sun', 'sun', 'blank'): [0, 0, 0, 0]}
For actions are declared as right = 0
, the Q values can be reached easily with Q_table[state][action]
.
The values for Q function are as follows:
- Learning rate: 0.7
- Discount factor: 0.9
- Exploration rate: 0.1
I found them by testing different values, and since the problem is not too complicated nor needs lots of learning, there's no need to optimize the code by changing the values according to time. For more serious problems, we can consider a decreasing value for exploration and learning rates.
I've run 1000 episodes with each length of 1000 epochs to make sure it finds the best solution. It won't take longer than 10 seconds on an average Intel processor.
In each epoch, the program checks the Q-table and chooses an action based on max(Q_table[state])
. If there's more than one maximum value action, it chooses randomly from them. It also chooses a complete random action if it decides to explore instead of exploit.
The program does the action on the game, and updates Q_table
based on the feedbacks from env.step()
returned values. If It did an invalid move, the reward would be -1000, and it's written directly in Q_table
no matter the previous value.
The formula for updating Q values is Q_table[old_state][action] = (1-alpha)*Q_table[old_state][action] + alpha*(reward + gamma * max(Q_table[new_state]))
where
alpha
is learning rategamma
is discount factorold_state
is the state before taking theaction
new_state
is the state after taking theaction
reward
is returned byenv.step()
for takingaction
onold_state
Each episode is made of epochs and is ended either if it reaches the desired state or exceeds maximum episode length (1000).
At the beginning of each episode, the game is reset to it's initial state, and the program starts to run epochs.
The final episode is set to not explore, and print itself as a result.
Consists of Unicode characters for box drawing and game elements and some functions to clear screen and change array dimensions.
The program output is like this.
1. initializing state space... DONE!
2. running 1000 episodes for training...
██████████████████████████████████████████████████ DONE!
3. let's see the result.
STATE 1
┏━━━┳━━━┳━━━┓
┃ ☼ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
STATE 2
┏━━━┳━━━┳━━━┓
┃ ☼ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
STATE 3
┏━━━┳━━━┳━━━┓
┃ ☼ ┃ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
STATE 4
┏━━━┳━━━┳━━━┓
┃ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
STATE 5
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
STATE 6
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
STATE 7
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛
STATE 8
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ┃
┗━━━┻━━━┻━━━┛
STATE 9
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛
STATE 10
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛
STATE 11
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛
STATE 12
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛
STATE 13
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛
STATE 14
┏━━━┳━━━┳━━━┓
┃ ┃ ☾ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛