# Cartpole-v0 
## Méthode 1 Deep Learning avec TensorFlow
(https://pythonprogramming.net/openai-cartpole-neural-network-example-machine-learning-tutorial/)

### les packages à importer

In [1]:
# pour avoir le jeu
import gym
# pour que l'agent agisse de manière aléatoire au début 
import random
import numpy as np

import tflearn
#pour construire le réseau de neurones
from tflearn.layers.core import input_data, dropout, fully_connected 
#pour la dernière couche
from tflearn.layers.estimator import regression


# pour voir ce que random a fait
from statistics import mean, median
#
from collections import Counter

curses is not supported on this machine (please install/reinstall curses for an optimal experience)








In [2]:
#learning rate
LR = 1e-3

#pour lancer le jeu
env = gym.make("CartPole-v0")
env.reset()

#paramètres pour le jeu
goal_steps = 200
score_requirement = 70 
initial_games = 10000



In [None]:
#pour lancer le jeu de manière aléatoire au début

def some_random_games_first():
    # Each of these is its own game.
    for episode in range(25):
        env.reset()
        # this is each frame, up to 200...but we wont make it that far.
        for t in range(200):
            # This will display the environment
            # Only display if you really want to see it.
            # Takes much longer to display it.
            env.render()
            
            # This will just create a sample action in any environment.
            # In this environment, the action can be 0 or 1, which is left or right
            action = env.action_space.sample()
            
            # this executes the environment with an action, 
            # and returns the observation of the environment, 
            # the reward, if the env is over, and other info.
            observation, reward, done, info = env.step(action)
            if done:
                break
                
some_random_games_first()

### Documentaion open ai
Observations

If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

    # observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
    # reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
    # done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
    # info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

In [3]:
#pour creer un training data
#to generate trainning samples

def initial_population():
    
    # data that we are interested in trainning on
    #(observation and move made)
    # moves are random BUT only append IF score > 50
    training_data = []
    scores = []
    accepted_scores = []
    
    # iterate through however many games we want:
    # "_" => For ignoring the specific values
    for _ in range(initial_games):
        score = 0
        # to store all the movements bc we don't know 
        #before the end of the game if we beat the score
        game_memory = []
        # previous observation that we saw
        prev_observation = []
        
        # THE GAME for each frame in 200 (each _ => one game)
        for _ in range(goal_steps):
            # choose random action (0 or 1)
            action = random.randrange(0,2)
            # do it!
            observation, reward, done, info = env.step(action)
            #ensemble équivalent à action = env.action_space.sample()
            
            # notice that the observation is returned FROM the action
            # so we'll store the previous observation here, pairing
            # the prev observation to the action we'll take.
            #observation generated after the action
            if len(prev_observation) > 0 :
                game_memory.append([prev_observation, action])
            prev_observation = observation
            score+=reward
            
            #Pq?
            if done: break
        
        #STOCKAGE DES BONNES PARTIES
        # IF our score is higher than our threshold, we'd like to save
        # every move we made
        # NOTE the reinforcement methodology here. 
        # all we're doing is reinforcing the score, we're not trying 
        # to influence the machine in any way as to HOW that score is 
        # reached.
        if score >= score_requirement:
            accepted_scores.append(score)
            for data in game_memory:
                # convert to one-hot (this is the output layer for our neural network)
                if data[1] == 1:
                    output = [0,1]
                elif data[1] == 0:
                    output = [1,0]
                    
                # saving our training data
                training_data.append([data[0], output])

        # reset env to play again
        env.reset()
        # save overall scores
        scores.append(score)
    
    # just in case you wanted to reference later
    training_data_save = np.array(training_data)
    np.save('saved.npy',training_data_save)
    
    # some stats here, to further illustrate the neural network magic!
    print('Average accepted score:',mean(accepted_scores))
    print('Median score for accepted scores:',median(accepted_scores))
    print(Counter(accepted_scores))
    
    return training_data

In [4]:
def neural_network_model(input_size):
    network = input_data(shape=[None, input_size, 1], name='input')
    #input size here is 4

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)
    # keep rate = 0;8

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)
    #5 layers
    
    #last layer, 2 are the moves we can make
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')

    return model


def train_model(training_data, model=False):
    #if we saved a model model = model_that_we_saved
    
    X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
    y = [i[1] for i in training_data]

    if not model:
        model = neural_network_model(input_size = len(X[0]))
    
    model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_learning')
    return model

In [5]:
training_data = initial_population()

Average accepted score: 82.55263157894737
Median score for accepted scores: 78.5
Counter({82.0: 7, 70.0: 7, 73.0: 6, 74.0: 5, 75.0: 4, 77.0: 4, 71.0: 4, 86.0: 3, 95.0: 3, 80.0: 3, 78.0: 3, 85.0: 3, 72.0: 3, 93.0: 2, 92.0: 2, 90.0: 2, 76.0: 2, 105.0: 2, 79.0: 2, 108.0: 2, 99.0: 1, 87.0: 1, 83.0: 1, 89.0: 1, 101.0: 1, 160.0: 1, 117.0: 1})


In [6]:
model = train_model(training_data)

Training Step: 484  | total loss: [1m[32m0.65698[0m[0m | time: 1.108s
| Adam | epoch: 005 | loss: 0.65698 - acc: 0.6334 -- iter: 6144/6198
Training Step: 485  | total loss: [1m[32m0.66130[0m[0m | time: 1.118s
| Adam | epoch: 005 | loss: 0.66130 - acc: 0.6216 -- iter: 6198/6198
--


In [7]:
scores = []
choices = []
for each_game in range(10):
    score = 0
    game_memory = []
    prev_obs = []
    env.reset()
    for _ in range(goal_steps):
        env.render()

        if len(prev_obs)==0:
            action = random.randrange(0,2)
        else:
            action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])

        choices.append(action)
                
        new_observation, reward, done, info = env.step(action)
        prev_obs = new_observation
        game_memory.append([new_observation, action])
        score+=reward
        if done: break

    scores.append(score)

print('Average Score:',sum(scores)/len(scores))
print('choice 1:{}  choice 0:{}'.format(choices.count(1)/len(choices),choices.count(0)/len(choices)))
print(score_requirement)

Average Score: 200.0
choice 1:0.5005  choice 0:0.4995
70


### Résumé méthode 1

In [17]:
Xbis = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)

In [18]:
Xbis

array([[[-0.00479897],
        [-0.21833818],
        [-0.01911065],
        [ 0.25347615]],

       [[-0.00916573],
        [-0.02294864],
        [-0.01404113],
        [-0.04517283]],

       [[-0.0096247 ],
        [-0.21786646],
        [-0.01494459],
        [ 0.24304712]],

       ...,

       [[-0.32963446],
        [-1.30115486],
        [ 0.14115801],
        [ 1.354832  ]],

       [[-0.35565755],
        [-1.49773778],
        [ 0.16825465],
        [ 1.68813626]],

       [[-0.38561231],
        [-1.30491364],
        [ 0.20201738],
        [ 1.45221828]]])

In [19]:
ybis = [i[1] for i in training_data]

In [20]:
ybis

[[0, 1],
 [1, 0],
 [1, 0],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [1, 0],
 [1, 0],
 [0, 1],
 [1, 0],
 [1, 0],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [1, 0],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [1, 0],
 [1, 0],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [1, 0],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [1, 0],
 [0, 1],
 [1, 0],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [1, 0],
 [0, 1],
 [1, 0],
 [1, 0],
 [1, 0],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [1, 0],
 [1, 0],
 [0, 1],
 

## Méthode 2 Q learning

In [10]:
import math
from collections import deque

class QCartPoleSolver():
    def __init__(self, buckets=(1, 1, 6, 12,), n_episodes=1000, n_win_ticks=195, min_alpha=0.1, min_epsilon=0.1, gamma=1.0, ada_divisor=25, max_env_steps=None, quiet=False, monitor=False):
        
        self.buckets = buckets # down-scaling feature space to discrete range
        
        self.n_episodes = n_episodes # training episodes 
        
        self.n_win_ticks = n_win_ticks # average ticks over 100 episodes required for win
        
        self.min_alpha = min_alpha # learning rate
        #Alpha ( learning rate): The alpha probability determines how much the agent values newly acquired 
        #information over the older data set. If alpha was 0 the agent would learn nothing new, 
        #and at a value of1 would only make decisions based on the most recent data.
        
        self.min_epsilon = min_epsilon # exploration rate
        #Epsilon (exploration rate): To avoid getting stuck in a local minimum we make our agent explore. 
        #In our case, this means choosing a random action with the probability epsilon (0 < epsilon < 1). 
        #Our new choose_action function will now look like this.
        
        self.gamma = gamma # discount factor
        #Gamma (discount factor): This rate determines how to preference rewards happening sooner rather 
        #than later. A discount factor of 0 would only consider the next reward; while 1 would give all 
        #future rewards equal weight. The goal is to keep the thing upright as long as possible, 
        #so we will weight all future rewards equally.
        
        self.ada_divisor = ada_divisor # only for development purposes
        
        self.quiet = quiet

        self.env = gym.make('CartPole-v0')
        if max_env_steps is not None: self.env._max_episode_steps = max_env_steps
        if monitor: self.env = gym.wrappers.Monitor(self.env, 'tmp/cartpole-1', force=True) # record results for upload

        self.Q = np.zeros(self.buckets + (self.env.action_space.n,))

    def discretize(self, obs):
        upper_bounds = [self.env.observation_space.high[0], 0.5, self.env.observation_space.high[2], math.radians(50)]
        lower_bounds = [self.env.observation_space.low[0], -0.5, self.env.observation_space.low[2], -math.radians(50)]
        ratios = [(obs[i] + abs(lower_bounds[i])) / (upper_bounds[i] - lower_bounds[i]) for i in range(len(obs))]
        new_obs = [int(round((self.buckets[i] - 1) * ratios[i])) for i in range(len(obs))]
        new_obs = [min(self.buckets[i] - 1, max(0, new_obs[i])) for i in range(len(obs))]
        return tuple(new_obs)

    def choose_action(self, state, epsilon):
        return self.env.action_space.sample() if (np.random.random() <= epsilon) else np.argmax(self.Q[state])

    def update_q(self, state_old, action, reward, state_new, alpha):
        self.Q[state_old][action] += alpha * (reward + self.gamma * np.max(self.Q[state_new]) - self.Q[state_old][action])

    def get_epsilon(self, t):
        return max(self.min_epsilon, min(1, 1.0 - math.log10((t + 1) / self.ada_divisor)))

    def get_alpha(self, t):
        return max(self.min_alpha, min(1.0, 1.0 - math.log10((t + 1) / self.ada_divisor)))

    def run(self):
        scores = deque(maxlen=100)

        for e in range(self.n_episodes):
            current_state = self.discretize(self.env.reset())

            alpha = self.get_alpha(e)
            epsilon = self.get_epsilon(e)
            done = False
            i = 0

            while not done:
                # self.env.render()
                action = self.choose_action(current_state, epsilon)
                obs, reward, done, _ = self.env.step(action)
                new_state = self.discretize(obs)
                self.update_q(current_state, action, reward, new_state, alpha)
                current_state = new_state
                i += 1

            scores.append(i)
            mean_score = np.mean(scores)
            if mean_score >= self.n_win_ticks and e >= 100:
                if not self.quiet: print('Ran {} episodes. Solved after {} trials ✔'.format(e, e - 100))
                return e - 100
            if e % 100 == 0 and not self.quiet:
                print('[Episode {}] - Mean survival time over last 100 episodes was {} ticks.'.format(e, mean_score))

        if not self.quiet: print('Did not solve after {} episodes 😞'.format(e))
        return e

if __name__ == "__main__":
    solver = QCartPoleSolver()
    solver.run()
    # gym.upload('tmp/cartpole-1', api_key='')

[Episode 0] - Mean survival time over last 100 episodes was 19.0 ticks.
[Episode 100] - Mean survival time over last 100 episodes was 38.83 ticks.
[Episode 200] - Mean survival time over last 100 episodes was 164.99 ticks.
Ran 249 episodes. Solved after 149 trials ✔


### Résumé méthode 2

https://fr.wikipedia.org/wiki/Q-learning
La situation consiste en un agent, un ensemble d'états S {\displaystyle S} S et d'actions A {\displaystyle A} A. En réalisant une action a ∈ A {\displaystyle a\in A} a\in A, l'agent passe d'un état à un nouvel état. L'exécution d'une action dans un état spécifique fournit à l'agent une récompense (valeur numérique). Le but de l'agent est de maximiser sa récompense totale. Cela est réalisé par apprentissage de l'action optimale pour chaque état. L'action optimale pour chaque état correspond à celle avec la plus grande récompense sur le long terme. Cette récompense est une somme pondérée de l'espérance mathématique des récompenses de chaque étape future à partir de l'état actuel. La pondération de chaque étape peut être γ Δ t {\displaystyle \gamma ^{\Delta t}} {\displaystyle \gamma ^{\Delta t}} où Δ t {\displaystyle \Delta t} \Delta t est le délai entre l'étape actuelle et future et γ {\displaystyle \gamma } \gamma un nombre entre 0 et 1 (autrement dit 0 ≤ γ ≤ 1 {\displaystyle 0\leq \gamma \leq 1} {\displaystyle 0\leq \gamma \leq 1}) appelé le facteur d'actualisation.

L'algorithme calcule une fonction de valeur action-état :

    Q : S × A → R {\displaystyle Q:S\times A\to \mathbb {R} } {\displaystyle Q:S\times A\to \mathbb {R} }

Avant que l'apprentissage ne débute, la fonction Q est initialisée arbitrairement. Ensuite, à chaque choix d'action, l'agent observe la récompense et le nouvel état (qui dépend de l'état précédent et de l'action actuelle). Ainsi, Q {\displaystyle Q} Q est mis à jour. Le cœur de l'algorithme est une mise à jour de la fonction de valeur. La définition de la fonction de valeur est corrigée à chaque étape de la façon suivante5 :

    Q [ s , a ] := ( 1 − α ) Q [ s , a ] + α ( r + γ max a ′ Q [ s ′ , a ′ ] ) {\displaystyle Q[s,a]:=(1-\alpha )Q[s,a]+\alpha \left(r+\gamma \max _{a'}Q[s',a']\right)} {\displaystyle Q[s,a]:=(1-\alpha )Q[s,a]+\alpha \left(r+\gamma \max _{a'}Q[s',a']\right)}

où s ′ {\displaystyle s'} s'est le nouvel état, s {\displaystyle s} s est l'état précédent, a {\displaystyle a} a est l'action choisie, r {\displaystyle r} r est la récompense reçue par l’agent, α {\displaystyle \alpha } \alpha est un nombre entre 0 et 1, appelé facteur d'apprentissage, et γ {\displaystyle \gamma } \gamma est le facteur d'actualisation.

Un épisode de l'algorithme finit lorsque s t + 1 {\displaystyle s_{t+1}} s_{{t+1}} est un état final. Toutefois, le Q {\displaystyle Q} Q-learning peut aussi apprendre dans une tâche non épisodique. Si le facteur d'actualisation est plus petit que 1, la valeur action-état est finie même pour Δ t {\displaystyle \Delta t} \Delta t infini.

N.B. : Pour chaque état final s f {\displaystyle s_{f}} {\displaystyle s_{f}}, la valeur de Q ( s f , a ) {\displaystyle Q(s_{f},a)} {\displaystyle Q(s_{f},a)} n'est jamais mise à jour et maintient sa valeur initiale. Généralement, Q ( s f , a ) {\displaystyle Q(s_{f},a)} {\displaystyle Q(s_{f},a)} est initialisé à zéro. 