In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

# Temporal Difference Learning Minimizes Cost with Delayed Feedback in Supply Chains

## Background

### Delayed Feedback

* For example, when you are first playing a game of tic tac toe, the value of your first move is not known until you observe your opponents move
* In this scenario the feedback from your move is **delayed**
* The action you execute into the environment isn't directly paired with feedback for your agent to learn from
* As seen below, one can **step back in time** & **update** the value of the previous states based on the state that the previous states led you to

In [7]:
Image(url='https://raw.githubusercontent.com/pbelsey/Neural-Cog-Final-Project/master/td_tictactoe.png', width=500)

### **Temporal Difference (TD) Learning **
* An unsupervised learning method for prediction problems
    * **unsupervised learning** allows for the use of intermediate information in addition to the final state in order to form a prediction
    * whereas **supervised learning** only uses the actual outcome in order to train an agent to form a prediction
* TD Learning takes the original state-action pair and adds the value of the difference between the feedback of the state-action pair after taking an action and the original state-action as well as adding the discounted value of the new state-action pair and updates the value of the original state-action based on the learning rate
    * the value of the original state-action **increases** if the value of the new state-action is **greater**, and
    * the value of the original state-action **decreases** if the value of the new state-action is **less**
* Similar to the function of a learning rate, the value of the new state-action pair is discounted with $\gamma$ known as the **discounting rate**

**TD-Learning**: 
$$Q(s_t, a_t) \leftarrow  Q(s_t, a_t) + \alpha (r_{t+1} + \gamma  Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))$$


* TD Learning can be simiplified by staying within the same state, and allowing the agent to execute different actions and observe the outcome for the actions for the same exact state
* By keeping the states the same, the model can be simplified to

$$Q(a_t) \leftarrow  Q(a_t) + \alpha (r_{t+1} + \gamma  Q(a_{t+1}) - Q(a_t))$$

* Similarly, Q-Learning updates the value of a given action by adding the difference between the fedback and the given action based on the learning rate
* The differences between TD-Leearning and Q-Learning are:
    * The feedback is of the original action and not of the new action
    * The calculation of the difference between the feedback and the value of the original action does not include the addition the value of the new action
    * A $\gamma$ parameter is not included which discounts the value of the new state-action pair
    
    

**Q-Learning**: 
$$Q(a_i) \leftarrow  Q(a_i) + \alpha (r_t - Q(a_i))$$

We can use these similarities and differences between TD-Learning and Q-Learning in order to modify the Q-Learning model to execute a learning process that relies on delayed feedback.

### **The Beer Game**

* **“The Beer Game”** is a dynamic system scenario, focusing on a supply chain with five different levels from consumer to factory. The goal is to match the demand of one’s buyers exactly, minimizing both unsold products and backorders. 

* **Delays in the system** prevent ideal behavior. 

In [10]:
Image(url='https://raw.githubusercontent.com/pbelsey/Neural-Cog-Final-Project/master/beer-game-large.png', width=900)

### **Our Game**

* A supply chain comprising a:
    - Consumer (C), Store (S), Warehouse (W), and Factory (F) 

* For the Warehouse, a case of beer costs: 
    - \$0.50 to store, and \$1 when backordered 

* Orders from the Factory must be placed:
     - 1 week ahead of time, as they take 1 week to deliver. Therefore, the Warehouse must learn to predict the Store's needs 1 week beforehand

**Q-Agent Warehouse** learns the optimal number of cases of beer to order to supply for the Store, based on cost minimization
* The agent uses a Softmax Policy in order to pick actions from the state - action space


# Problem: The Credit Assignment Problem  

#### How do we assign credit to an action when many decisions contribute to the action?
For instance, the backpropagation of error algorithm defines hidden unit error as the total weighted error signal coming from output units through connections between output units and a hidden unit.

Errors are backpropagated, through adjusting the values of actions

# Model: From Q to TD - Learning

In [8]:
from __future__ import division
import ADMCode
from ADMCode import visualize as vis
from ADMCode import believer_skeptic

import numpy as np
from numpy.random import sample as rs
import pandas as pd
import sys
import os

# from ipywidgets import interactive
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Temporary for now until push changes to PIP 
#sys.path.insert(0,'../ADMCode')
#import believer_skeptic

warnings.simplefilter('ignore', np.RankWarning)
warnings.filterwarnings("ignore", module="matplotlib")
warnings.filterwarnings("ignore")
sns.set(style='white', font_scale=1.3)



%matplotlib inline

### **Hypothesis:**
** An agent will learn to maintain a net inventory at 0 in fewer time-steps when: **
* α is higher
* β is lower
* γ is lower

High learning rates may be optimal because the value of certain actions will change depending on the environment. By using a liberal strategy (low β), the agent may be more likely to explore the option of ordering lower stock, even though it may seem counterintuitive towards the goal reducing the cost of backorders at a particular time-step. Lower discount rate will help the agent integrate the feedback despite the effect of time delays

### **Testing:**
The hypotheses will be tested by choosing low, medium, and high, values of α, β, and γ for the Warehouse in each trial, which will contain twenty rounds of the Beer Game. Other suppliers will be programmed to meet consumer demand for each week, which will be held constant. The revenue from selling each unit as well as the fees for backorders and excess units will also be held constant.**

# Results: 

In [None]:
graphs

# Conclusions 