Solving an 8-puzzle like game using Q-learning

Initial state:

┏━━━┳━━━┳━━━┓
┃ ☼ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃   ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

Desired state:

┏━━━┳━━━┳━━━┓
┃   ┃ ☾ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛

The files are utils.py, environment.py, main.py.

The program solves the puzzle in 14 moves almost every time under 10 seconds. Just run main.py.

Requirements: numpy and any python version above 3.6.

environment.py

Environment related methods and variables are implemented in this file. There is a class named env which creates and handles a game board.

Actions

There are 4 actions in polar coordination order. Actions are defined for the blank cell.

0: right
1: up
2: left
3: down

Rewards

Three different rewards are set:

Each move has a penalty of -10, so the agent tries to reduce the moves.
Invalid moves (hitting the wall) has a penalty of -1000 to prevent agent from choosing them.
Reaching the goal state has a reward of 100.

State space

Each state is stored as a 1-D tuple, containing 4 moons, 4 suns and 1 blank.

The state space is generated by calculating permutations of the first state, and stored in the env class as a list.

Method `step(action)`

Runs an epoch and returns the new state, the reward, and done flag (whether reached the desired state or not )

Method `take_action(state, action)`

Executes the action on the given state and returns the new state.

main.py

This file does the Q-learning process. It initiates a Q-table using the state space of env class as keys, and a list of four 0s as values. An entry would look like this:

{('moon', 'moon', 'moon', 'moon', 'sun', 'sun', 'sun', 'sun', 'blank'): [0, 0, 0, 0]}

For actions are declared as right = 0, the Q values can be reached easily with Q_table[state][action].

The values for Q function are as follows:

Learning rate: 0.7
Discount factor: 0.9
Exploration rate: 0.1

I found them by testing different values, and since the problem is not too complicated nor needs lots of learning, there's no need to optimize the code by changing the values according to time. For more serious problems, we can consider a decreasing value for exploration and learning rates.

I've run 1000 episodes with each length of 1000 epochs to make sure it finds the best solution. It won't take longer than 10 seconds on an average Intel processor.

Running an epoch

In each epoch, the program checks the Q-table and chooses an action based on max(Q_table[state]). If there's more than one maximum value action, it chooses randomly from them. It also chooses a complete random action if it decides to explore instead of exploit.

The program does the action on the game, and updates Q_table based on the feedbacks from env.step() returned values. If It did an invalid move, the reward would be -1000, and it's written directly in Q_table no matter the previous value.

The formula for updating Q values is Q_table[old_state][action] = (1-alpha)*Q_table[old_state][action] + alpha*(reward + gamma * max(Q_table[new_state])) where

alpha is learning rate
gamma is discount factor
old_state is the state before taking the action
new_state is the state after taking the action
reward is returned by env.step() for taking action on old_state

Running episodes

Each episode is made of epochs and is ended either if it reaches the desired state or exceeds maximum episode length (1000).

At the beginning of each episode, the game is reset to it's initial state, and the program starts to run epochs.

The final episode is set to not explore, and print itself as a result.

utils.py

Consists of Unicode characters for box drawing and game elements and some functions to clear screen and change array dimensions.

The program output is like this.

1. initializing state space... DONE!
2. running 1000 episodes for training...
██████████████████████████████████████████████████ DONE!
3. let's see the result.

   STATE 1
┏━━━┳━━━┳━━━┓
┃ ☼ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃   ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

   STATE 2
┏━━━┳━━━┳━━━┓
┃ ☼ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃   ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

   STATE 3
┏━━━┳━━━┳━━━┓
┃ ☼ ┃   ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

   STATE 4
┏━━━┳━━━┳━━━┓
┃   ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

   STATE 5
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃   ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

   STATE 6
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃   ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

   STATE 7
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃   ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┗━━━┻━━━┻━━━┛

   STATE 8
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃   ┃
┗━━━┻━━━┻━━━┛

   STATE 9
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃   ┃ ☼ ┃
┗━━━┻━━━┻━━━┛

   STATE 10
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃   ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛

   STATE 11
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃   ┃ ☾ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛

   STATE 12
┏━━━┳━━━┳━━━┓
┃ ☾ ┃ ☼ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃   ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛

   STATE 13
┏━━━┳━━━┳━━━┓
┃ ☾ ┃   ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛

   STATE 14
┏━━━┳━━━┳━━━┓
┃   ┃ ☾ ┃ ☼ ┃
┣━━━╋━━━╋━━━┫
┃ ☾ ┃ ☼ ┃ ☾ ┃
┣━━━╋━━━╋━━━┫
┃ ☼ ┃ ☾ ┃ ☼ ┃
┗━━━┻━━━┻━━━┛

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
environment.py		environment.py
main.py		main.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solving an 8-puzzle like game using Q-learning

environment.py

Actions

Rewards

State space

Method `step(action)`

Method `take_action(state, action)`

main.py

Running an epoch

Running episodes

utils.py

About

Releases

Packages

Languages

hsh9544/8puzzle

Folders and files

Latest commit

History

Repository files navigation

Solving an 8-puzzle like game using Q-learning

environment.py

Actions

Rewards

State space

Method step(action)

Method take_action(state, action)

main.py

Running an epoch

Running episodes

utils.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Method `step(action)`

Method `take_action(state, action)`

Packages