#Question 2: Mountain Car with Q-Learning
Dataset Problem: Use OpenAI Gym's MountainCar-v0 environment to train a Q-learning agent.
Similar to the CartPole example, but with the Mountain Car environment. The Q-learning code will be similar, with adjustments to the state and action space to fit the Mountain Car environment.


In this implementation, I use a very simple network with the following structure:

In [None]:
!pip install --upgrade gym numpy

In [1]:
!pip install numpy==1.23.5



In [5]:
import numpy as np
import gym

# Initialize environment
env = gym.make("MountainCar-v0")

# Hyperparameters
num_episodes = 10000
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 1.0  # Exploration-exploitation trade-off
epsilon_min = 0.01
decay_rate = 0.995

# Discretization parameters
num_bins = [20, 20]  # Binning for position and velocity
state_bins = [
    np.linspace(-1.2, 0.6, num_bins[0] - 1),  # Position
    np.linspace(-0.07, 0.07, num_bins[1] - 1)  # Velocity
]

# Initialize Q-table
q_table = np.zeros((num_bins[0], num_bins[1], env.action_space.n))

def discretize_state(state):
    """Converts continuous state to discrete bins."""
    return tuple(np.digitize(state[i], state_bins[i]) - 1 for i in range(len(state)))

# Training loop
for episode in range(num_episodes):
    state, _ = env.reset()
    state = discretize_state(state)
    total_reward = 0
    done = False

    while not done:
        # Epsilon-greedy action selection
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])

        new_state, reward, terminated, truncated, _ = env.step(action)
        new_state = discretize_state(new_state)
        done = terminated or truncated

        # Update Q-table
        best_future_q = np.max(q_table[new_state])
        q_table[state][action] += alpha * (reward + gamma * best_future_q - q_table[state][action])

        state = new_state
        total_reward += reward

    # Decay epsilon
    epsilon = max(epsilon * decay_rate, epsilon_min)

    if episode % 1000 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward}")

# Evaluation
state, _ = env.reset()
state = discretize_state(state)
done = False
total_reward = 0

while not done:
    action = np.argmax(q_table[state])
    new_state, reward, terminated, truncated, _ = env.step(action)
    new_state = discretize_state(new_state)
    done = terminated or truncated
    total_reward += reward
    env.render()

env.close()
print(f"Total Reward: {total_reward}")

Episode 0, Total Reward: -200.0
Episode 1000, Total Reward: -200.0
Episode 2000, Total Reward: -153.0
Episode 3000, Total Reward: -200.0
Episode 4000, Total Reward: -115.0
Episode 5000, Total Reward: -163.0
Episode 6000, Total Reward: -144.0
Episode 7000, Total Reward: -137.0
Episode 8000, Total Reward: -152.0
Episode 9000, Total Reward: -180.0
Total Reward: -200.0


  gym.logger.warn(
