<a href="https://colab.research.google.com/github/hanklin3/code/blob/master/6_7920_Fall_2024_Homework_5_problem_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup



In this problem, we will consider the discounted stochastic LQ control problem, as described in Lecture 3. Specifically, we will consider using Q-learning to approximate the Q-function of this problem. Below, we provide an instant of this problem through the environment **LQEnv**. The environment is a python class, whose interface is very similar to that of OpenAI gym.

The code block below imports important and relevant packages, defnies the environment, and includes helper functions like a policy evaluation function.

Kerb: hanklin

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import gym
from gym import error, spaces, utils
from gym.utils import seeding

dtype = np.float32
class LQEnv(gym.Env):
  def __init__(self, n, m, seed = 0, gamma = 0.9, sigma = 0.2):
    """
    The LQ environment, initialized with the parameters
    n: state dimension
    m: action dimension
    seed: random seed
    gamma: discount factor
    sigma: noise variance
    """
    np.random.seed(seed)
    self.n = n
    self.action_space = spaces.Box(low = -np.inf, high = np.inf, shape =  [m,1], dtype = dtype )
    self.observation_space = spaces.Box(low = -np.inf, high = np.inf, shape =  [n,1], dtype = dtype )
    self.A = np.array([[0.0488135 , 0.21518937],
                       [0.10276338, 0.04488318]
                        ])
    self.B =  np.array([[-0.0763452 ,  0.14589411],
                        [-0.06241279,  0.391773  ]
                       ])
    self.R =  np.array([[1.57567331, 0.96575621],
                        [0.96575621, 1.40655837]
                        ])
    self.Q =  np.array([[1.67940376, 0.12099823],
                        [0.12099823, 0.51263764]
                        ])
    self.state = self.observation_space.sample()
    self.sigma = sigma
    self.gamma = gamma

  def step(self, action):
    """
    This method is the primary interface between environment and agent.
    Paramters:
        action: array of shape [m,1]

    Returns:
        output: (next state:array, cost:float, done:bool, None)

    """
    err_msg = f"{action!r} ({action.shape}) ({type(action)}) invalid"
    # assert self.action_space.contains(action), err_msg
    assert self.state is not None, "Call reset before using step method."
    w = self.sigma*np.random.randn()

    cost = (self.state.T.dot(self.Q)).dot(self.state) +  (action.T.dot(self.R)).dot(action)

    self.state = self.A.dot(self.state) + self.B.dot(action) + w

    done = False

    return self.state, cost, done, {}

  def reset(self, state = None):
    """
    This method resets the environment to its initial values.
    Paramters:
        state array
            set the env to specifc state (optional)
    Returns:
        observation:    array
                        the initial state of the environment
    """
    if state is None:
        # sample from a guassian with zero mean and std of 10
        self.state = 10*self.observation_space.sample()
    else:
        self.state = state
    return  self.state

  def close(self):
    """
    This method provides the user with the option to perform any necessary cleanup.
    """
    pass

  def sample_random_action(self):
      """
      sample actions from a normal dist with mean zero and std 10
      """
      return  10*env.action_space.sample()


def policy_evaluation(policy, env, T = 1000, N = 5):
    """
    This method evaluate the performance of a specifc policy on the env through simulating
    it on N trajectories each of length T.
    Paramters:
        policy: function that takes in state and return action
        env: instant of LQEnv
        T: number of iterations per trajectory
        N: number of trajectories
    Returns:
        output: mean discounted total cost

    """
    costs = []
    for ite in tqdm(range(N)):
        state = env.reset()
        gamma = env.gamma
        total_costs = 0
        for t in range(T):
            action = policy(state)
            # print('\npolicy_evaluation:action', action.shape, action)
            state, cost, _, _ = env.step(action)
            total_costs += cost * gamma**(t)
        costs.append(total_costs)
    return np.mean(costs)

def lin_policy(K):
    """
    helper function to define linear policies of the form u = L@x
    """
    def policy(state):
        return K.dot(state).astype(dtype)
    return policy


def Q_a(x,u, theta):
    """
    Q-function parameterized by the tuple/list theta
    """
    q =  x.T@theta[0].T@theta[0]@x
    q+= u.T@theta[1].T@theta[1]@u
    q+= 2*x.T@theta[2]@u + theta[3]
    return q[0,0]



## Initialize environemnt

Below, we initialize the environment that you need to interact with in this excercise.

In [None]:
n = 2
m = 2
env = LQEnv(n,m)

### (a) Assuming knowledge of the system matrices $A, B, R,$ and $Q$, compute $P$, the solution to the appropriate form of the Riccati equation, and the matrix $K$ that characterizes the optimal policy. Evaluate the policy performance over 100 trajectories each with length 1000.


In [None]:
import numpy as np
from scipy.linalg import solve_discrete_are

# Solve the discrete-time algebraic Riccati equation
P = solve_discrete_are(env.A, env.B, env.Q, env.R)

# Compute the optimal policy L = -(B^T P B + R)^-1 B^T P A
K = -np.linalg.inv(env.B.T @ P @ env.B + env.R) @ (env.B.T @ P @ env.A)


print("Solution to the Riccati equation (P):")
print(P)
print("\nOptimal policy matrix (K):")
print(K)

def optimal_policy(K):
  def policy(state):
     return -K_optimal @ state
  return policy  # Using the optimal gain matrix K

K = K.astype(dtype)
# policy = lin_policy(K)
policy = optimal_policy(K)
print('policy', policy)

# cost_mean = policy_evaluation(policy, env, T = 1000, N = 5)
cost_mean = policy_evaluation(policy, env, T = 1000, N = 100)
print('cost_mean', cost_mean)


  and should_run_async(code)


Solution to the Riccati equation (P):
[[1.68872555 0.14018861]
 [0.14018861 0.58521217]]

Optimal policy matrix (K):
[[ 0.03457423  0.07474569]
 [-0.04677614 -0.09387158]]
policy <function optimal_policy.<locals>.policy at 0x7b0e42b075b0>


100%|██████████| 100/100 [00:15<00:00,  6.59it/s]

cost_mean 244.43228610055655





In [None]:
print('cost_mean', cost_mean)

cost_mean 244.43228610055655


In [None]:
a1 = env.action_space.sample()
print(a1.shape, type(a1), a1)

(2, 1) <class 'numpy.ndarray'> [[-1.5008005 ]
 [ 0.45925373]]


### (b) Given an arbitrary $\Theta_0$, write the greedy policy with respect to $Q(x; u; \Theta_0)$ in terms of the parameters $\Theta_0$. What is the form of this greedy policy?


The greedy policy minimizes the Q-function with respect to the control input u.

$dQ(x;u;\theta_0)/du$ = 2$\theta_2^T$$\theta_2$$u$+2$\theta_3^T$$x$ = 0

$u = -(θ_{2}^Tθ_{2})^{-1}θ_{3}^Tx$

### (c) Implement Q-learning for this problem, where actions are chosen using the greedy policy (wrt to the current Q-function). Then train using samples chosen from a single arbitrarily long trajectory. Use the given function to initialize theta.

Hint: If your states/parameters grow very large during training, consider making the learning rate very small

In [None]:
def init_theta(n,m, fun = np.random.randn, seed = 1):
    np.random.seed(seed)
    A = fun(n,n)
    B = fun(m,m)
    C = fun(n,m)
    const = 0
    return [A,B,C, const]


In [None]:
import numpy as np

def Q_function(x, u, theta1, theta2, theta3, theta4):
    """Computes the quadratic approximation of the Q-function."""
    return (x.T @ theta1.T @ theta1 @ x + u.T @ theta2.T @ theta2 @ u
            + 2 * x.T @ theta3 @ u + theta4)

def greedy_policy(x, theta2, theta3, action_space):
    """Computes the greedy policy u based on the current parameters."""
    # u = -np.linalg.inv(theta2.T @ theta2) @ (theta3.T @ x)

    # # Compute the matrix to invert with regularization to avoid singular
    # regularized_matrix = theta2.T @ theta2 + reg * np.eye(theta2.shape[1])
    # # Calculate the greedy action
    # u = -np.linalg.inv(regularized_matrix) @ (theta3.T @ x)

    # avoid singluar using pseudo inverse
    u = -np.linalg.pinv(theta2.T @ theta2) @ (theta3.T @ x)

    return u


def Q_learning(env, theta1, theta2, theta3, theta4, alpha=0.001, gamma=0.9, num_steps=10000):
    """Q-learning with function approximation for the LQ control problem."""
    state = env.reset()  # Initialize the environment
    for step in range(num_steps):
        # Select action using greedy policy
        action = greedy_policy(state, theta2, theta3, env.action_space)
        # print("Action shape:", action.shape)

        # Perform the action and observe the next state and reward
        next_state, reward, is_done, _ = env.step(action)

        # Compute Q(s, a; Θ)
        current_Q = Q_function(state, action, theta1, theta2, theta3, theta4)

        # Compute the greedy action for the next state
        next_action = greedy_policy(next_state, theta2, theta3, env.action_space)

        # Compute Q(s', a'; Θ)
        next_Q = Q_function(next_state, next_action, theta1, theta2, theta3, theta4)

        # Update Q-learning parameters using the update rule
        td_error = reward + gamma * next_Q - current_Q

        # Update parameters
        theta1 += alpha * td_error * (np.outer(state, state) @ theta1)
        theta2 += alpha * td_error * (np.outer(action, action) @ theta2)
        theta3 += alpha * td_error * (np.outer(state, action))
        theta4 += alpha * td_error


        # Move to the next state
        state = next_state

    return theta1, theta2, theta3, theta4

# Initialize the environment and parameters
env = LQEnv(n=2, m=2)
theta1, theta2, theta3, theta4 = init_theta(2, 2)
print("Initial Parameters:")
print("Theta1:", theta1)
print("Theta2:", theta2)
print("Theta3:", theta3)
print("Theta4:", theta4)

# Run Q-learning
theta1, theta2, theta3, theta4 = Q_learning(env, theta1, theta2, theta3, theta4)

# Evaluate the resulting policy
print("Updated Parameters:")
print("Theta1:", theta1)
print("Theta2:", theta2)
print("Theta3:", theta3)
print("Theta4:", theta4)

Initial Parameters:
Theta1: [[ 1.62434536 -0.61175641]
 [-0.52817175 -1.07296862]]
Theta2: [[ 0.86540763 -2.3015387 ]
 [ 1.74481176 -0.7612069 ]]
Theta3: [[ 0.3190391  -0.24937038]
 [ 1.46210794 -2.06014071]]
Theta4: 0
Updated Parameters:
Theta1: [[ 1.68340042 -0.70572002]
 [-0.40154793 -1.05705392]]
Theta2: [[ 0.86740957 -2.30426342]
 [ 1.73912616 -0.75998766]]
Theta3: [[ 0.30515654 -0.23440247]
 [ 1.4641209  -2.07320247]]
Theta4: [[0.63044645]]


### Do the parameters converge? Is the resulting policy the same as the policy obtained in part (a)? If not, why?


In [None]:
# Optimal feedback gain matrix K from Part a (using the Riccati equation)
K_optimal = -np.linalg.inv(env.B.T @ P @ env.B + env.R) @ (env.B.T @ P @ env.A)

# Learned parameters from Q-learning in Part c
theta_2_learned = theta2  # Obtained from Q-learning
theta_3_learned = theta3  # Obtained from Q-learning

# Compute the learned policy gain matrix K_learned
K_learned = np.linalg.inv(theta_2_learned.T @ theta_2_learned) @ theta_3_learned.T

# Compare the two gain matrices
diff = np.linalg.norm(K_learned - K_optimal, 'fro')

print("Optimal K (from Part a):")
print(K_optimal)

print("\nLearned K (from Part c):")
print(K_learned)

print("\nDifference between K_optimal and K_learned (Frobenius norm):", diff)

# Threshold to determine if policies are similar
if diff < 1e-3:  # You can adjust this threshold based on precision needs
    print("\nThe learned policy is approximately the same as the optimal policy from Part a.")
else:
    print("\nThe learned policy is different from the optimal policy from Part a.")

Answer = 'It did not converge likely due to it did not explore all states to find the optimal policy'
print(Answer  )


Optimal K (from Part a):
[[ 0.03457423  0.07474569]
 [-0.04677614 -0.09387158]]

Learned K (from Part c):
[[ 0.09082672  0.15482267]
 [ 0.01141178 -0.26483182]]

Difference between K_optimal and K_learned (Frobenius norm): 0.20540180977561184

The learned policy is different from the optimal policy from Part a.
It did not converge likely due to it did not explore all states to find the optimal policy


Answer = 'It did not converge likely due to it did not explore all states to find the optimal policy, since we always do greedy and never explored.'

### (d) Train using  samples chosen from multiple trajectories. Each with length 25.  Also choose actions using an $\epsilon$-greedy policy for $\epsilon = 0.3$. Do the parameters converge? Is the resulting policy the same as the policy obtained in part (a)? Evaluate the policy performance over 100 trajectories each with length 1000 and compare its average cost with that of the optimal policy.

Hint: to sample actions randomly, use the method: env.sample_random_action()

In [None]:
import numpy as np

def epsilon_greedy_policy(x, theta1, theta2, theta3, epsilon, action_space):
    """Chooses an action using an epsilon-greedy policy."""
    if np.random.rand() < epsilon:
        # Random action
        return env.action_space.sample()
    else:
        # Greedy action
        return greedy_policy(x, theta2, theta3, action_space)

def Q_learning_multiple_trajectories(env, theta1, theta2, theta3, theta4, epsilon=0.3,
                                     alpha=1e-8, gamma=0.9, num_trajectories=100, trajectory_length=25):
    """Q-learning with function approximation and multiple trajectories."""
    for trajectory in range(num_trajectories):
        state = env.reset()  # Reset the environment at the start of each trajectory

        for t in range(trajectory_length):
            # Choose action using epsilon-greedy policy
            action = epsilon_greedy_policy(state, theta1, theta2, theta3, epsilon, env.action_space)

            # Perform action and observe the next state and reward
            next_state, reward, is_done, _  = env.step(action)

            # Compute Q(s, a; Θ)
            current_Q = Q_function(state, action, theta1, theta2, theta3, theta4)

            # Choose next action using greedy policy
            next_action = greedy_policy(next_state, theta2, theta3, env.action_space)

            # Compute Q(s', a'; Θ)
            next_Q = Q_function(next_state, next_action, theta1, theta2, theta3, theta4)

            # Compute TD error
            td_error = reward + gamma * next_Q - current_Q

            # # Update parameters using the TD error
            # theta1 += alpha * td_error * (np.outer(state, state) @ theta1)  # Update rule for theta1
            # theta2 += alpha * td_error * (np.outer(action, action) @ theta2)  # Update rule for theta2
            # theta3 += alpha * td_error * (np.outer(state, action))  # Update rule for theta3
            # theta4 += alpha * td_error * 1.0  # Update rule for theta4

            # Clip the TD error to avoid thetas converges to NaN
            clipped_td_error = np.clip(td_error, -1.0, 1.0)

            # Update parameters
            theta1 += alpha * clipped_td_error * (np.outer(state, state) @ theta1)
            theta2 += alpha * clipped_td_error * (np.outer(action, action) @ theta2)
            theta3 += alpha * clipped_td_error * (np.outer(state, action))
            theta4 += alpha * clipped_td_error

            # Move to the next state
            state = next_state
    return theta1, theta2, theta3, theta4

In [None]:
import time

def runs(epsilon, theta_seed=1):
  n = 2
  m = 2
  env = LQEnv(n, m)
  theta1, theta2, theta3, theta4 = init_theta(n, m, seed=theta_seed)

  # Run Q-learning with multiple trajectories
  time_start = time.time()
  theta1, theta2, theta3, theta4 = Q_learning_multiple_trajectories(
      env, theta1, theta2, theta3, theta4,
      num_trajectories=100, trajectory_length=1000,
      alpha=0.00001, epsilon=epsilon)
  time_end = time.time()
  print("Updated Parameters:")
  print("Theta1:", theta1)
  print("Theta2:", theta2)
  print("Theta3:", theta3)
  print("Theta4:", theta4)

  return theta1, theta2, theta3, theta4, time_end - time_start

def optimal_cost(n = 2, m = 2):
  env = LQEnv(n, m)
  # Solve the discrete-time algebraic Riccati equation
  P = solve_discrete_are(env.A, env.B, env.Q, env.R)
  # Compare the resulting policy with the optimal policy
  K_optimal = -np.linalg.inv(env.B.T @ P @ env.B + env.R) @ (env.B.T @ P @ env.A)  # Optimal policy from Part a
  policy = lin_policy(K_optimal)
  cost_mean = policy_evaluation(policy, env, T = 1000, N = 100)
  return cost_mean

def learned_cost(theta1, theta2, theta3, theta4):
  K_learned = np.linalg.inv(theta2.T @ theta2) @ theta3.T  # Learned policy from Q-learning
  policy = lin_policy(K_optimal)
  cost_mean = policy_evaluation(policy, env, T = 1000, N = 100)
  return cost_mean

def print_compare_cost(epsilon, avg_cost_optimal, avg_cost_learned):
  print('\n')
  print("Epsilon:", epsilon)
  # Print the results
  print("Average Cost of Optimal Policy:", avg_cost_optimal)
  print("Average Cost of Learned Policy:", avg_cost_learned)

  # Compare the two average costs
  if avg_cost_learned < avg_cost_optimal:
      print("The learned policy is better than the optimal policy.")
  else:
      print("The learned policy is worse than the optimal policy.")


def print_compare_policy(theta2, theta3, n = 2, m = 2):
  # Compare the resulting policy with the optimal policy
  env = LQEnv(n, m)
  K_optimal = -np.linalg.inv(env.B.T @ P @ env.B + env.R) @ (env.B.T @ P @ env.A)  # Optimal policy from Part a
  K_learned = np.linalg.inv(theta2.T @ theta2) @ theta3.T  # Learned policy from Q-learning

  diff = np.linalg.norm(K_learned - K_optimal, 'fro')

  print("Optimal feedback gain matrix K (from Part a):")
  print(K_optimal)

  print("\nLearned feedback gain matrix K (from Part d):")
  print(K_learned)

  print("\nDifference between K_optimal and K_learned (Frobenius norm):", diff)

  # Threshold to check if the policies are approximately the same
  if diff < 1e-3:
      print("\nThe learned policy is approximately the same as the optimal policy from Part a.")
  else:
      print("\nThe learned policy is different from the optimal policy from Part a.")

In [None]:
avg_cost_optimal = optimal_cost()

100%|██████████| 100/100 [00:14<00:00,  6.77it/s]


In [None]:
print('avg_cost_optimal', avg_cost_optimal)

avg_cost_optimal 237.18853852775806


In [None]:
runtimes = {}

In [None]:
epsilon = 0.3
theta1, theta2, theta3, theta4, runtimes[epsilon] = runs(epsilon, theta_seed=1)
avg_cost_learned = learned_cost(theta1, theta2, theta3, theta4)


Updated Parameters:
Theta1: [[ 1.51472038 -0.61475558]
 [-0.4552223  -1.0586355 ]]
Theta2: [[ 0.9877118  -2.17478617]
 [ 1.57947942 -0.91222193]]
Theta3: [[ 0.3134316  -0.2332111 ]
 [ 1.44995774 -2.04967514]]
Theta4: [[-0.05280974]]


100%|██████████| 100/100 [00:14<00:00,  6.90it/s]


In [None]:
print_compare_cost(epsilon, avg_cost_optimal, avg_cost_learned)




Epsilon: 0.3
Average Cost of Optimal Policy: 237.18853852775806
Average Cost of Learned Policy: 190.43095456513208
The learned policy is better than the optimal policy.


### Do the parameters converge? Is the resulting policy the same as the policy obtained in part (a)?


In [None]:
print_compare_policy(theta2, theta3)

Optimal feedback gain matrix K (from Part a):
[[ 0.03457423  0.07474569]
 [-0.04677614 -0.09387158]]

Learned feedback gain matrix K (from Part d):
[[ 0.14113878  0.11031566]
 [ 0.04914226 -0.29734088]]

Difference between K_optimal and K_learned (Frobenius norm): 0.25143851178909465

The learned policy is different from the optimal policy from Part a.


*type your answer here*
It doesn't look like the policy matrix is exactly the same, but the cost is pretty close to the optimal policy cost.
So YES, I would say it has converged.

### Evaluate the policy performance over 100 trajectories each with length 1000 and compare its average cost with that of the optimal policy.


### (e)  Try different values of epsilons $\in \{0.2,0.4,0.6, 0.8\}$ (using the same initialization of $\Theta$). With which value of epsilon does the algorithm converges faster?



In [None]:
epsilon = 0.2
theta1, theta2, theta3, theta4, runtimes[epsilon] = runs(epsilon, theta_seed=1)
avg_cost_learned = learned_cost(theta1, theta2, theta3, theta4)
print_compare_cost(epsilon, avg_cost_optimal, avg_cost_learned)

Updated Parameters:
Theta1: [[ 1.4499848  -0.67285978]
 [-0.34923284 -1.10862877]]
Theta2: [[ 1.0294868  -2.14965543]
 [ 1.53084867 -0.95898962]]
Theta3: [[ 0.30610166 -0.20934839]
 [ 1.43967336 -2.05060619]]
Theta4: [[0.00176851]]


100%|██████████| 100/100 [00:14<00:00,  6.94it/s]



Epsilon: 0.2
Average Cost of Optimal Policy: 237.18853852775806
Average Cost of Learned Policy: 218.5230098221142
The learned policy is better than the optimal policy.





In [None]:
epsilon = 0.4
theta1, theta2, theta3, theta4, runtimes[epsilon] = runs(epsilon, theta_seed=1)
print_compare_cost(epsilon, avg_cost_optimal, avg_cost_learned)

Updated Parameters:
Theta1: [[ 1.36153405 -0.62589886]
 [-0.33567115 -1.03539694]]
Theta2: [[ 1.17960776 -2.09961113]
 [ 1.4418526  -1.14886842]]
Theta3: [[ 0.31109595 -0.20980312]
 [ 1.43470282 -2.03158151]]
Theta4: [[-0.08288427]]


Epsilon: 0.4
Average Cost of Optimal Policy: 237.18853852775806
Average Cost of Learned Policy: 218.5230098221142
The learned policy is better than the optimal policy.


In [None]:
epsilon = 0.6
theta1, theta2, theta3, theta4, runtimes[epsilon] = runs(epsilon, theta_seed=1)
print_compare_cost(epsilon, avg_cost_optimal, avg_cost_learned)

Updated Parameters:
Theta1: [[ 1.43079401 -0.61688035]
 [-0.37646226 -0.99896519]]
Theta2: [[ 1.35437888 -2.13973392]
 [ 1.45226702 -1.36868318]]
Theta3: [[ 0.29516281 -0.22886941]
 [ 1.44078412 -2.00913236]]
Theta4: [[-0.17331113]]


Epsilon: 0.6
Average Cost of Optimal Policy: 237.18853852775806
Average Cost of Learned Policy: 218.5230098221142
The learned policy is better than the optimal policy.


In [None]:
epsilon = 0.8
theta1, theta2, theta3, theta4, runtimes[epsilon] = runs(epsilon, theta_seed=1)
print_compare_cost(epsilon, avg_cost_optimal, avg_cost_learned)

Updated Parameters:
Theta1: [[ 1.32827875 -0.50416198]
 [-0.4277696  -0.88029103]]
Theta2: [[ 1.47648568 -2.17259781]
 [ 1.43410515 -1.5007993 ]]
Theta3: [[ 0.28610879 -0.2226254 ]
 [ 1.43863098 -1.99766938]]
Theta4: [[-0.27432793]]


Epsilon: 0.8
Average Cost of Optimal Policy: 237.18853852775806
Average Cost of Learned Policy: 218.5230098221142
The learned policy is better than the optimal policy.


In [None]:
print(runtimes)

{0.2: 59.69315814971924, 0.4: 65.580899477005, 0.6: 75.0695641040802, 0.8: 76.54838824272156, 0.3: 62.00534701347351}


In [None]:
Answer = '0.2 epsilon has the fastest runtimes'
print(Answer)

0.2 epsilon has the fastest runtimes


###  (f) Is it fair to generalize the conclusions we have found about the optimal way to select epsilon and the trajectory length to other problems/environment? For example, if you have found that a single infinite trajectory does not work well, and larger values of epsilon works better, can you generalize this to other environments/problems like, for example, chess? why/why not?

*type your answer here*: The hyperparameters might not generalized to other environments/problems.
A more complex environment, like large state space and non-convex, high epsilon results may result in too much exploration that lead to poor learning since random action are unlikely to provide valuable feedbacks.
Smaller state space, and smoother and continuous dynamics (like LQ), a larger epsilon can help agent explore more and escape local optima.

A short trajectories may limit the agent's ability to learn longer-term dependencies, espeically when the rewards/cost are sparse and near the end of episode (like winning a game).
Longer trajectories on the other hands provides more data but is more computationally expensive.