# Exercise 3 - Overview


- Markov Decision Process (MDP)
    - 4-tuple ($S$, $A$, $P_{a}$, $R_{a}$)
        - $S$: finite set of states $S = \{s_1, s_2 ,..., s_n\}$
        - $A$: finite set of actions $A = \{a_1, a_2,..., a_n\}$
        - $P_a(s, s')$: transition probability matrix (probabilities to lead from state $s$ into another state $s'$ within the action $a$
        - $R_a(s, s')$: reward matrix (receiving a defined reward after action $a$ in state $s$ to reach state $s'$


- Size of warehouse is either 2x2 or 3x2
- Separate start/stop position outside the storage space
- Robots can move to adjacent fields (but not diagonally)
- First position the robot can move into is always (1, 1)
- Three types of items, identified by color (white, blue, red)


## Data Exploration

In [1]:
import pandas as pd
import numpy as np

In [2]:
# 2x2
data_dir_2x2 = 'data_2x2'
file_train_2x2 = 'warehousetraining2x2.txt'
file_order_2x2 = 'warehouseorder2x2.txt'
csv_data_file_train = f'{data_dir_2x2}\\{file_train_2x2}'
csv_data_file_order = f'{data_dir_2x2}\\{file_order_2x2}'
df_2x2_train = pd.read_csv(csv_data_file_train, sep='\t', header=None, names=["operation_type", "color"])
df_2x2_train['merged'] = df_2x2_train['operation_type'] + ' ' + df_2x2_train['color']
df_2x2_order = pd.read_csv(csv_data_file_order, sep='\t', header=None, names=["operation_type", "color"])
df_2x2_order['merged'] = df_2x2_order['operation_type'] + ' ' + df_2x2_order['color']

In [3]:
# 3x2
data_dir_3x2 = 'data_3x2'
file_train_3x2 = 'warehousetraining3x2.txt'
file_order_3x2 = 'warehouseorder3x2.txt'
csv_data_file_train = f'{data_dir_3x2}\\{file_train_3x2}'
csv_data_file_order = f'{data_dir_3x2}\\{file_order_3x2}'
df_3x2_train = pd.read_csv(csv_data_file_train, sep='\t', header=None, names=["operation_type", "color"])
df_3x2_train['merged'] = df_3x2_train['operation_type'] + ' ' + df_3x2_train['color']
df_3x2_order = pd.read_csv(csv_data_file_order, sep='\t', header=None, names=["operation_type", "color"])
df_3x2_order['merged'] = df_3x2_order['operation_type'] + ' ' + df_3x2_order['color']

### 2x2 distributions

In [4]:
print(f'distribution for {file_train_2x2}:')
val_counts_2x2_train = df_2x2_train['merged'].value_counts(normalize=False)
print(val_counts_2x2_train)
val_counts_2x2_train_norm = df_2x2_train['merged'].value_counts(normalize=True)
#np.save(f'{data_dir_2x2}\\train_dist_norm.npy', val_counts_2x2_train_norm)
print(val_counts_2x2_train_norm)
print()
print(f'distribution for {file_order_2x2}:')
val_counts_2x2_order = df_2x2_order['merged'].value_counts(normalize=False)
print(val_counts_2x2_order)
val_counts_2x2_order_norm = df_2x2_order['merged'].value_counts(normalize=True)
#np.save(f'{data_dir_2x2}\\order_dist_norm.npy', val_counts_2x2_train_norm)
print(val_counts_2x2_order_norm)
print()


distribution for warehousetraining2x2.txt:
restore red      2064
store red        2064
store white      1030
restore white    1029
store blue        995
restore blue      995
Name: merged, dtype: int64
restore red      0.252415
store red        0.252415
store white      0.125963
restore white    0.125841
store blue       0.121683
restore blue     0.121683
Name: merged, dtype: float64

distribution for warehouseorder2x2.txt:
store red        16
restore red      15
store white      10
restore blue      8
restore white     8
store blue        8
Name: merged, dtype: int64
store red        0.246154
restore red      0.230769
store white      0.153846
restore blue     0.123077
restore white    0.123077
store blue       0.123077
Name: merged, dtype: float64



### 3x2 distributions

In [5]:
print(f'distribution for {file_train_3x2}:')
val_counts_3x2_train = df_3x2_train['merged'].value_counts(normalize=False)
print(val_counts_3x2_train)
val_counts_3x2_train_norm = df_3x2_train['merged'].value_counts(normalize=True)
np.save(f'{data_dir_3x2}\\train_dist_norm.npy', val_counts_3x2_train_norm)
print(val_counts_3x2_train_norm)
print()
print(f'distribution for {file_order_3x2}:')
val_counts_3x2_order = df_3x2_order['merged'].value_counts(normalize=False)
print(val_counts_3x2_order)
val_counts_3x2_order_norm = df_3x2_order['merged'].value_counts(normalize=True)
np.save(f'{data_dir_3x2}\\order_dist_norm.npy', val_counts_3x2_order_norm)
print(val_counts_3x2_order_norm)
print()

distribution for warehousetraining3x2.txt:
restore red      2989
store red        2989
restore white    1548
store white      1548
restore blue     1517
store blue       1517
Name: merged, dtype: int64
restore red      0.246862
store red        0.246862
restore white    0.127849
store white      0.127849
restore blue     0.125289
store blue       0.125289
Name: merged, dtype: float64

distribution for warehouseorder3x2.txt:
store white      13
restore white    13
store red        11
restore red      11
restore blue      6
store blue        6
Name: merged, dtype: int64
store white      0.216667
restore white    0.216667
store red        0.183333
restore red      0.183333
restore blue     0.100000
store blue       0.100000
Name: merged, dtype: float64



I am interested in the normalized distribution. Let's round the numbers...

In [6]:
# 2x2
print(f'distribution for {file_train_2x2}:')
print(val_counts_2x2_train_norm.round(decimals=4))
print('Does it still sum up to 1.0? ', end='')
print(1.0 == np.sum(val_counts_2x2_train_norm.round(decimals=4)))
print()

# 3x2
print(f'distribution for {file_train_3x2}:')
print(val_counts_3x2_train_norm.round(decimals=4))
print('Does it still sum up to 1.0? ', end='')
print(1.0 == np.sum(val_counts_3x2_train_norm.round(decimals=4)))
print()

distribution for warehousetraining2x2.txt:
restore red      0.2524
store red        0.2524
store white      0.1260
restore white    0.1258
store blue       0.1217
restore blue     0.1217
Name: merged, dtype: float64
Does it still sum up to 1.0? True

distribution for warehousetraining3x2.txt:
restore red      0.2469
store red        0.2469
restore white    0.1278
store white      0.1278
restore blue     0.1253
store blue       0.1253
Name: merged, dtype: float64
Does it still sum up to 1.0? True



## Further investigations

- Can we determine for how long an item was stored?
    - No, not necessarily. If two or items are stored at the same time, and a 'restore'-action is done, it is unknown which of these stored items was restored.
- 3x2 states: $4^6 * 6 = 24576$ (6 fields being empty or filled with white, blue (4) and 6 different actions)
- 2x2 states: $4^4 * 6 = 1536$ (6 fields being empty or filled with white, blue (4) and 6 different actions)

## MDP Toolbox usage
- Create a transition probability matrix (TPM) per action
    - 2x2: six matrizes à $1536x1536$ (states x states times per TPM, one per action (6))
    - 3x2: six matrizes à $24576x245766$ (states x states times per TPM, one per action (6))
- Group the TPMs
```
tpms = np.array([tpm0, tpm1, tpm2, tpm3, tpm4, tpm5])
```

- Create a reward matrix (basically one vector per action)
    - 2x2: one matrix à $6x1536$  (actions x states)
    - 3x2: one matrix à $6x24576$  (actions x states)
- Use a discount factor
- Create PolicyIteration and ValueIteration and let it run

```python
# Definition of the mdp with discount factor, maximal iterations, the tranisition probability matrix and the reward matrix
mdpresultPolicy = \
    mdptoolbox.mdp.PolicyIteration(tpms, rewardmatrix, 0.3, max_iter=100)
mdpresultValue = \
    mdptoolbox.mdp.ValueIteration(tpms, rewardmatrix, 0.3, max_iter=100)

# Run the MDP
mdpresultPolicy.run()
mdpresultValue.run()
```

- `PolicyIteration` and `ValueIteration` and both contain a `policy`- and a `V`-vector and the number of iterations it took
   - `PolicyIteration` and `ValueIteration` are both algorithms for solving MDP (see https://en.m.wikipedia.org/wiki/Markov_decision_process#Algorithms)
   - the algorithm used will require more iterations and therefore time to solve the MDP the closer we set the discount factor to 1
   - The `policy` is what we'll use to instantiate an AI to run in the warehouse

```python
print('PolicyIteration:')
print(mdpresultPolicy.policy)
print(mdpresultPolicy.V)
print(mdpresultPolicy.iter)

print('ValueIteration:')
print(mdpresultValue.policy)
print(mdpresultValue.V)
print(mdpresultValue.iter)
```

```
PolicyIteration:
(0, 0, 0, 0)
(0.4740921452567589, 1.4285714285714286, 2.7443609022556394, 11.338985762278037)
1
ValueIteration:
(0, 0, 0, 0)
(0.4622244, 1.417, 2.728424, 11.321488)
4
```
   