# a) Problem formulation

#### Action space $\mathcal{S}$

We will consider the following 4-dimensional state space:

$\mathcal{S} = \{(x, \theta, \dot x, \dot \theta)\} \cup \{end\}$

The state $end$ is the state corresponding to either $|x| > 2.4$ or $|\theta| > 12.5$ or the episode has ended.

#### Action space $\mathcal{A}$

Formally, the action space is the force of magnitude $N$ that we can apply to either side of the cart:

$\mathcal{A} = \{left, right\}$


#### Transition dynamics $\mathcal{P}$

In this problem the transition probabilities are determined by the equations of physics and the observation of each state is corrupted by a uniform random noise of $\pm 0.05$ for each state component. Whenever the agent is in the termial $end$ state, he stays in it.


#### Rewards $\mathcal{R}$

At every time step, if $|x| < 2.4$ and $|\theta| < 12.5$ the agent receives $r_t=+1$ and $r_t=0$ otherwise.

#### DQN 
In a classic RL approach such as Q-learning, we would need to discretize state space into a finite number of buckets in order to be able to visit each state infinetely often and update their Q values. Here we use a deep neural network to approximate the Q function because since the state space is continuous and thus too large, it is impossible to visit each state.



# b) Outline of the code

**Main**

- env = gym.make('CartPole-v0') : making the gym environment of the cartpole-v0

The $env$ object contains the description of the environment: state space, action space, the step method (which output the next state given an input action), all parameters to describe the problem.

- agent = DQNAgent(state_size, action_size)

Next we create an instance of a DQN agent which will operate in the environment env. All his attributes are called in the constructor along with different methods.

We fill the array $test\_states$ using a uniform random policy. This variable will be used as examples of the different states of the environment for the training of the model.

After this first loop, we run $n=EPISODES$ episodes. Until the end of the episode ($t > 200$ or the pole is not vertical), we will:
* take an action based on the state
* add the result to the sample of the agent
* train the model
* After every episode, we update the $target\_model$.

**Methods of DQNAgent**

* $\_\_init\_\_(self, state\_size, action\_size)$: constructor of the class. Take as parameters the input and output of the NN and create the model, which consists of:
   * _memory_ : deque that saves the samples from the environment.
   * _model_: NN to predict the actions that is updated every time step
   * _target_model_ : NN to predict actions that is updated at the end of the episode with same weights of _model_


* $build\_model(self)$: returns the NN with the loss and optimizer defined on the function.

* $update\_target\_model(self)$: copy the weights from _model_ to _target_model_.

* $get\_action(self, state)$: based on current state, get action from model using epsilon-greedy policy

* $append\_sample(self, state)$: add new sample to _memory_.

* $train\_model(self)$: train the _model_:
    * It takes a batch from _memory_ ; 
    * make the prediction with _model_ and _target_model_ ; 
    * .... ; 
    * call the method _fit_ to train the model.

* $plot\_data(self, episodes, scores, max\_q\_mean)$: plots the score per episode as well as the maximum q value per episode, averaged over precollected states.

# c) pseudo-code for DQN

In [6]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "pseudo_code_with_code_line.png")