# Hill-Climbing Approach

Hill-Climbing approach is how we would have intuitively reached some conclusion. Start somewhere (randomly chosen weights) => Gradually take steps to reach the best position (add some noise and update on seeing improvement)

In [5]:
import gym
import numpy as np
import matplotlib.pyplot as plt

The idea is to gradually improve rather than jumping around and being hopeful of getting the best solution. It might still not be the best idea as we may see but may prove better a lot of times.

Noise scaling is basically how much random we want to keep our next value wrt current value. A high scaling is basically what random search is.

Thus, initial weight initalization holds a major role in how well the agent performs as the improvement is iterative. So if you are stuck at a position that is too far from the best reward, you wont find a good answer. That's why you may find a lot of initializations get stuck and no result is found.

A fix to this is increase noise_factor every iteration we see no improvement (Basically implies that increase randomness if you don't see the required change as you are probably not at the right place.



In [6]:
def train(submit):
    env = gym.make('CartPole-v0')
    if submit:
        env.monitor.start('cartpole-experiments/', force = True)
    
    episodes_per_update = 5
    noise_scaling = 0.1
    counter = 0
    bestparams = np.random.rand(4) * 2 - 1
    bestreward = 0
    for _ in range(10000):
        parameters = (np.random.rand(4) * 2 - 1)*noise_scaling + bestparams 
        counter += 1
        # This can further improve the solution, instead of updating 
        # after each iteration, we sum the reward after a number of
        # them and update accordingly
        # for _ in xrange(episodes_per_update):
        #     run = run_episode(env,newparams)
        #     reward += run
        reward = run_episode(env, parameters)
        if reward > bestreward:
            bestreward = reward
            bestparams = parameters
            # considered solved if the agent lasts 200 timesteps
            if reward == 200:
                break
    
    if submit:
        for _ in range(100):
            run_episode(env, bestparams)
        env.monitor.close()
        
    return counter

In [7]:
def run_episode(env, parameters):
    observation = env.reset()
    totalreward = 0
    counter = 0
    for _ in range(200):
        action = 0 if np.matmul(parameters, observation) < 0 else 1
        #env.render()
        observation, reward, done, info = env.step(action) #take that action
        totalreward += reward
        counter += 1
        if done:
            break
    return totalreward

In [8]:
r = train(submit = False)
print(r/1000.0)

0.059
