# Simple Reinforcement Learning Implementation

***Summary:***

The assignment involves **implementing simple reinforcement learning (RL) classes** for an agent and an environment. The agent is a piece of code that aims to gain rewards through interactions, while the environment is an external model of the world providing observations and rewards to the agent. The focus is on creating a basic RL code with a dummy environment that gives random rewards regardless of the agent's actions. The environment class has methods for handling actions, providing observations, checking the end of episodes, and more. The agent class, on the other hand, includes a constructor and a step function allowing the agent to observe the environment, make decisions based on observations, submit actions, and collect rewards.

***Purpose:***

The purpose of the assignment is to provide a hands-on introduction to the implementation of basic reinforcement learning concepts. By creating simple classes for the agent and environment, the assignment aims to help learners understand the fundamental components of RL systems, such as the interaction between the agent and the environment, handling actions and observations, and collecting rewards. The use of a dummy environment with random rewards simplifies the focus on the implementation of the classes, making it accessible for learning purposes without the complexity of practical RL applications.

***SampleEnvironment Class:***

* Constructor (__init__): Initializes the environment with a fixed number of steps_left (set to 20 initially).

* get_observation Method: Returns a fixed observation, represented by a zero vector [0.0, 0.0, 0.0].

* get_actions Method: Returns a list of available actions, [0, 1].

* is_done Method: Checks if the number of remaining steps is zero, indicating the end of the episode.

* action Method: Accepts an action as an argument, decrements the steps_left counter, and returns a random reward using random.choice([0, 1]).

***Agent Class:***

* Constructor (__init__): Initializes the agent with an initial total_reward set to zero.

* step Method: Accepts an instance of the SampleEnvironment class. It obtains the current observation from the environment, prints it, gets the available actions, selects a random action, performs the action in the environment, and updates the agent's total_reward with the obtained reward.

***Main Execution (__main__ block):***

* Creates an instance of SampleEnvironment and Agent.

* Enters a loop until the environment is done.

* In each iteration, the agent takes a step in the environment by calling the step method.

* Prints the total reward obtained by the agent at the end of the episode.


Note: The code currently has a small issue in the action method where it attempts to return both random.random() and random.choice([0,1]). The corrected line should be return random.choice([0, 1]). Additionally, the provided code includes some unnecessary line breaks and indentation inconsistencies, which might be cleaned up for better readability.

In [1]:
import random
from typing import List

class SampleEnvironment:
    def __init__(self):
        self.steps_left = 20

    def get_observation(self) -> List[float]:
        return [0.0, 0.0, 0.0]

    def get_actions(self) -> List[int]:
        return [0, 1]

    def is_done(self) -> bool:
        return self.steps_left == 0

    def action(self, action: int) -> float:
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1
        return random.random()

The agent's Class simple and includes only two methods: the constructor and the method that performs one step in the environment

Intitially the total reward collected is set to zero by the constructor.

The step function accepts environment instance as an argument and allows agent to perform the following actions:

* Observe the environment
* Make a decision about the action to take based on the observations
* Submit the action to the environment
* Get the reward for the current step

In [2]:
random.choice([0,1])

0

In [3]:
class Agent:
    def __init__(self):
        self.total_reward = 0.0

    def step(self, env: SampleEnvironment):
        current_obs = env.get_observation()
        print("Observation {}".format(current_obs))
        actions = env.get_actions()
        print(actions)
        reward = env.action(random.choice(actions))
        self.total_reward += reward
        print("Total Reward {}".format(self.total_reward))

In [4]:
if __name__ == "__main__":
    env = SampleEnvironment()
    agent = Agent()
    i=0

    while not env.is_done():
        i=i+1
        print("Steps {}".format(i))
        agent.step(env)

    print("Total reward got: %.4f" % agent.total_reward)

Steps 1
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 0.7871281417402183
Steps 2
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 0.8556236780426499
Steps 3
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 1.8511079531670174
Steps 4
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 1.9353772791736956
Steps 5
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 2.784935841401925
Steps 6
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 3.5213220934236906
Steps 7
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 4.1846044640479
Steps 8
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 4.350056926015854
Steps 9
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 4.909398134970374
Steps 10
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 5.66861173350634
Steps 11
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 6.27447420231237
Steps 12
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 6.347758195551178
Steps 13
Observation [0.0, 0.0, 0.0]
[0, 1]
Total Reward 6.40738997642766
Steps 14
Observation [0.0, 0.0, 0.

**Take Note! with current code this the result:**

Here's a breakdown of the results:

* Observation: The environment's observation is a fixed zero vector [0.0, 0.0, 0.0].

* Available Actions: The environment provides two available actions, [0, 1].

* Agent's Steps: In each step, the agent:

1.   Obtains the current observation.
2.   Prints the available actions.
3. Chooses a random action from the available actions.
4. Performs the chosen action in the environment.
5. Updates the agent's total_reward with the obtained reward.

* Total Reward: The total reward obtained by the agent after 20 steps is printed as 10.4146

The results indicate that the agent is interacting with the environment, making random decisions, and accumulating rewards based on the dummy environment's logic. This demonstrates a basic structure for implementing reinforcement learning, where an agent interacts with an environment, makes decisions, and learns from the rewards received.



***To improve the code, we can consider several enhancements and modifications. Here are a few suggestions:***

These changes introduce a new environment class, **CustomEnvironment**, with an additional method **custom_action** that provides rewards based on a combination of a base reward and a bonus reward depending on the chosen action. The agent class, **ModifiedAgent**, is modified to consider these changes in the decision-making process. The main program is updated to use the new environment and agent classes.



***New Environment Class: CustomEnvironment***


* a) Create a new environment class called CustomEnvironment.


* b) Modify the __init__ method to include a parameter for the initial number of steps, which can vary for different instances of the environment.


* c) Implement a new method called custom_action that takes an action as input and returns a reward. This time, the reward should be a combination of a base reward and a bonus reward based on the action. For example, if the action is 0, the bonus reward could be 0.2, and if the action is 1, the bonus reward could be 0.5.

**In this version of the CustomEnvironment class:**

* **The __init__ **method now accepts an additional parameter **initial_steps** to set the initial number of steps.

* The **custom_action** method has been implemented as requested, providing a reward based on a combination of a base reward and a bonus reward depending on the chosen action.

use this **CustomEnvironment class** as a replacement for the **SampleEnvironment** class in your code.


In [5]:
import random
from typing import List

class CustomEnvironment:
    def __init__(self, initial_steps: int):
        self.steps_left = initial_steps

    def get_observation(self) -> List[float]:
        # Implement a dynamic observation (e.g., a random vector)
        return [random.random() for _ in range(3)]

    def get_actions(self) -> List[int]:
        return [0, 1]

    def is_done(self) -> bool:
        return self.steps_left == 0

    def custom_action(self, agent_action: int) -> float:
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1

        # Implement a reward mechanism based on the chosen action
        base_reward = random.uniform(0.0, 1.0)
        bonus_reward = 0.2 if agent_action == 0 else 0.5

        return base_reward + bonus_reward


**Agent Class Modification: ModifiedAgent**

* a) Create a new agent class called ModifiedAgent.

* b) Modify the step method to incorporate the changes in the CustomEnvironment. Instead of using random.choice(actions), the agent should now decide between the available actions based on the bonus rewards provided by the environment.

In this version of the **ModifiedAgent** class:

* The step method now accepts an instance of **CustomEnvironment** instead of **SampleEnvironment**.

* The **choose_action** method has been added to allow the agent to make decisions based on both available actions and observations from the environment.

* The decision-making strategy is modified based on the bonus rewards and observations. You can customize this strategy according to your requirements.

 use this **ModifiedAgent** class as a replacement for the **Agent** class in your code.

In [6]:
class ModifiedAgent:
    def __init__(self):
        self.total_reward = 0.0

    def step(self, env: CustomEnvironment):
        current_obs = env.get_observation()
        print("Observation {}".format(current_obs))
        actions = env.get_actions()

        # Modify the decision-making process based on bonus rewards
        chosen_action = self.choose_action(actions, current_obs)

        reward = env.custom_action(chosen_action)
        self.total_reward += reward
        print("Chosen Action: {}, Total Reward {:.4f}".format(chosen_action, self.total_reward))

    def choose_action(self, actions: List[int], observation: List[float]) -> int:
        # Modify the decision-making strategy based on bonus rewards and observation
        # For example, choose action 0 if the first observation value is less than 0.5, otherwise choose action 1
        return 0 if observation[0] < 0.5 else 1


**Main Program Modification:**

* a) Change the instantiation of the environment to use the new CustomEnvironment class.

* b) Update the agent instantiation to use the ModifiedAgent class.

In this version:

* The instantiation of the environment is changed to use the new **CustomEnvironment** class (**custom_env = CustomEnvironment(initial_steps=20**)).

* The agent instantiation is updated to use the **ModifiedAgent** class (**modified_agent = ModifiedAgent**()).

In [7]:
if __name__ == "__main__":
    custom_env = CustomEnvironment(initial_steps=20)
    modified_agent = ModifiedAgent()
    i = 0

    while not custom_env.is_done():
        i += 1
        print("Steps {}".format(i))
        modified_agent.step(custom_env)

    print("Total reward obtained: {:.4f}".format(modified_agent.total_reward))


Steps 1
Observation [0.5974081739106338, 0.13791546300113056, 0.55652214647664]
Chosen Action: 1, Total Reward 1.2911
Steps 2
Observation [0.5767904865520905, 0.6060165558667303, 0.8768420191085013]
Chosen Action: 1, Total Reward 2.7610
Steps 3
Observation [0.69334693284241, 0.7494663759537192, 0.1522173046974986]
Chosen Action: 1, Total Reward 3.9129
Steps 4
Observation [0.3813915525391739, 0.33007615796303, 0.9240473224669049]
Chosen Action: 0, Total Reward 4.2024
Steps 5
Observation [0.9222741532563912, 0.8546009067256808, 0.600385354881331]
Chosen Action: 1, Total Reward 5.1156
Steps 6
Observation [0.9372295368410759, 0.7272092603491358, 0.6440083797970427]
Chosen Action: 1, Total Reward 5.7128
Steps 7
Observation [0.6015965877733098, 0.19180200821671411, 0.13190095438946214]
Chosen Action: 1, Total Reward 6.8903
Steps 8
Observation [0.18012001515658282, 0.36402130878556493, 0.4510674754512063]
Chosen Action: 0, Total Reward 7.5530
Steps 9
Observation [0.48245605619107423, 0.284290

The provided output represents the steps and results obtained during the execution of the modified code. Let's break down the key information:

* **Steps 1 to 20:** Each step represents an interaction between the agent and the environment.
* **Observation:** The environment generates a dynamic observation at each step, represented as a list of three random values.
* **Chosen Action:** The agent makes decisions based on the custom decision-making strategy implemented in the ModifiedAgent class. In this case, the agent chooses between actions 0 and 1 based on the first value of the observation. If the first observation value is less than 0.5, the agent chooses action 0; otherwise, it chooses action 1.
* **Total Reward:** The total reward obtained by the agent is accumulated at each step. The reward is calculated based on the custom_action method in the CustomEnvironment class, which combines a base reward and a bonus reward based on the chosen action.
* **Total reward obtained:** The final total reward obtained by the agent after 20 steps is printed as **19.5162.**

The output demonstrates the interaction between the modified agent and the custom environment, where the agent's decision-making is influenced by the observations provided by the environment. The dynamic nature of the observations and the bonus rewards contribute to the variability in the agent's actions and the total reward obtained.

# conclusion

The point of this exercise is to demonstrate the process of making improvements to a basic reinforcement learning setup. By introducing a more complex environment (**CustomEnvironment**) and modifying the agent's decision-making strategy (**ModifiedAgent**), we aim to enhance the learning capabilities and adaptability of the agent.

***Comparison with the Original Result:***

* **Dynamic Environment:** In the modified version, the environment provides dynamic observations, making it more realistic compared to the fixed observation [0.0, 0.0, 0.0] in the original version. This allows the agent to learn from different states of the environment.

* **Custom Action and Bonus Rewards:** The modified environment introduces a custom_action method that combines base rewards with bonus rewards based on the chosen action. This provides a more sophisticated reward structure compared to the random rewards in the original version, potentially allowing the agent to learn more meaningful behaviors

* **Modified Decision-Making:** The modified agent class (ModifiedAgent) incorporates observations into the decision-making process, allowing the agent to adapt its actions based on the dynamic environment. In contrast, the original agent class (Agent) made random decisions without considering the environment's state.

* **Learning Strategy:** In the modified version, the agent's decision to choose action 0 or 1 is influenced by the first value of the observation. This introduces a basic form of learning, where the agent associates observations with actions, potentially leading to more informed decisions over time.

***Potential Areas for Further Improvement:***

* **Advanced Learning Algorithms:** Consider implementing more advanced learning algorithms such as Q-learning or deep reinforcement learning techniques to enhance the agent's learning capabilities.

* **Tuning Parameters:** Experiment with different parameter values, such as bonus reward magnitudes, learning rates, and exploration-exploitation trade-offs, to fine-tune the agent's behavior.

* **Complex Observations:** Introduce more complex observations to challenge the agent and require it to learn more intricate strategies.

* **Evaluation Metrics:** Define and use appropriate evaluation metrics to assess the performance of the agent over multiple episodes or environments.

Overall, the exercise provides a foundation for understanding how modifications to the environment and agent can impact the learning process in reinforcement learning. The improvements aim to make the learning environment more realistic and the agent more adaptive to different situations.