# Why use the discounted reward sum? Why not just add up the rewards normally?

# Motivation 1: (psychology) Humans and other animals seems to instinctively apply a weighting factor to future rewards.
- The future rewards are inherently uncertain.
- The weighting factor is a way to factor in the risks associated with uncertain future rewards.

## Makes a lot of sense in the model free setting
- The agent does not know the MDP (the state transition probs and the reward function in advance)
- The value of a state, given a policy, is an expected value. The only way to find the value function is to go through the MDP infinite number of times (infinite number of episodes) (theoretically). 
- The agent feels that it can never get a true sense of the value function of a state.
- TO compensate for that, it decides to calculate a different quanity that it feels will converge faster than the canonical value function (takes less samples to converge). 
- Easiest way to do that is to disocunt the contributions from the most uncertain parts (future rewards) of the value function calculation.

# Motivation 2: Reward sum is not well defined in MDPs without terminal states, but the discounted reward sum is well defined.

# Motivation 3: Some rare RL problems somehow naturally includes a discount factor (because of the nature of the rewards)

- Financial markets.
- Because of the effect of *interest rate*, dollar 10 right now might be better than dollar 10 after 4 years, because dollar 10 might grow to become dollar 11 or 12 because of accrueing interest. 
- *The interest rate* is a natural discount factor in problems related to financial markets

In [1]:
import gym

class InitMod(gym.Wrapper):
    def __init__(self, env, initial_state):
        super().__init__(env)
        self.initial_state = initial_state
        
    def reset(self):
        observation = self.env.reset()
        self.unwrapped.state = self.initial_state
        return self.unwrapped.state
    
import numpy as np
pole_right_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=np.array([0, 0.01, 0.15, 0]))

In [2]:
episode_history = []

observation = pole_right_init_cartpole_env.reset()
while True:
    action = pole_right_init_cartpole_env.action_space.sample()
    next_observation, reward, done, _ = pole_right_init_cartpole_env.step(action)
    episode_history.append({"observation": observation, "reward": reward})
    observation = next_observation
    if done:
        break
pole_right_init_cartpole_env.close()

In [3]:
episode_history

[{'observation': array([0.  , 0.01, 0.15, 0.  ]), 'reward': 1.0},
 {'observation': array([ 2.00000000e-04,  2.02687997e-01,  1.50000000e-01, -2.41851667e-01]),
  'reward': 1.0},
 {'observation': array([ 0.00425376,  0.39538451,  0.14516297, -0.48371596]),
  'reward': 1.0},
 {'observation': array([ 0.01216145,  0.58819155,  0.13548865, -0.72735653]),
  'reward': 1.0},
 {'observation': array([ 0.02392528,  0.39148251,  0.12094152, -0.39528519]),
  'reward': 1.0},
 {'observation': array([ 0.03175493,  0.19487074,  0.11303581, -0.06705158]),
  'reward': 1.0},
 {'observation': array([ 0.03565235,  0.38820588,  0.11169478, -0.32204175]),
  'reward': 1.0},
 {'observation': array([ 0.04341646,  0.58157475,  0.10525395, -0.57751759]),
  'reward': 1.0},
 {'observation': array([ 0.05504796,  0.77507623,  0.09370359, -0.83527599]),
  'reward': 1.0},
 {'observation': array([ 0.07054948,  0.96880178,  0.07699807, -1.09708095]),
  'reward': 1.0},
 {'observation': array([ 0.08992552,  1.16283015,  0.0

In [4]:
value_samples_random_policy = {}
gamma = 0.9
backward_reward_sum = 0
for step in reversed(episode_history):
    backward_reward_sum = (gamma * backward_reward_sum) + step["reward"]
    value_samples_random_policy[tuple(step["observation"])] = backward_reward_sum

In [5]:
for key, value in value_samples_random_policy.items():
    print(key, value)

(0.28777123575678687, 2.1396226018779734, -0.18716203833204625, -2.8875378288098035) 1.0
(0.24889719409596556, 1.943702083041065, -0.1360320029267657, -2.5565017702640267) 1.9
(0.21394034626397745, 1.7478423915994057, -0.09128932759414701, -2.2371337666309348) 2.71
(0.18289641130054746, 1.5521967481714996, -0.05271770404489147, -1.9285811774627768) 3.439
(0.1557595273416303, 1.3568441979458585, -0.02012379027308613, -1.6296956885902667) 4.0951
(0.1325233929872728, 1.1618067177178732, 0.006658279181731782, -1.3391034727408955) 4.68559
(0.11318212191608223, 0.9670635535595291, 0.027763533835231263, -1.055262732674974) 5.217031
(0.08992551900138579, 1.1628301457348222, 0.055056455782348086, -1.3646460973558412) 5.6953279000000006
(0.07054948338639869, 0.968801780749355, 0.07699807474966608, -1.0970809483658996) 6.12579511
(0.05504795870648505, 0.775076233995682, 0.09370359455527014, -0.835275990280203) 6.5132155990000005
(0.04341646379355925, 0.5815747456462901, 0.10525394643344155, -0.57

# Goal of model free RL: Irrespective of the MDP, find the policy that maximizes the expected discounted reward sum per episode (value function of the initial state)

# Goal of model free RL: Irrespective of the MDP, find the policy that maximizes the expected discounted reward sum (value) of all states in the MDP