# Blackjack 
- The library **learning** implements several reinforcement learning algorithm (e.g. sarsa, qlearning, MC, sarsa_lambda, and q_lambda). 
- We will compare the SARSA and Q-learning algorithm to make an agent play the blakjack game.
- The **objective** of the exercise if for you to understand and explain how both algorithms are used in playing this game. You may use the following reference for SARSA and Q-learning algorithm (https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html)

### Import dependencies
- All python implementations are in the folder **learning**. 
- Read carefully the SARSA and Q-learning implementation (also used in exercise-01). Make some comments about the implementation.

In [1]:
import numpy as np
from itertools import product

# This functions are inside the folder named learning.
from learning.model_free import Problem
from learning.model_free import sarsa
from learning.model_free import qlearning

In [2]:
# Write your comments in this cell
'''


in SARSA implementation ( Learning from the current) : The two consecutive state /action pairs and the immediate reward received by the agent while
transitioning from first state to next state determine the updated Q value


for Q learning , the q value is derived from the best action in the next state.
It converge to a solution that is optimal under the assumption that after generationg experience and training , we switch over to the greedy policy 

'''

# greedy ==> cupide , qui ne cherche qu'a maximiser son gain ( pour q learning etant donné qu'on prend le max de q ) 

'\n\n\nin SARSA implementation ( Learning from the current) : The two consecutive state /action pairs and the immediate reward received by the agent while\ntransitioning from first state to next state determine the updated Q value\n\n\nfor Q learning , the q value is derived from the best action in the next state.\nIt converge to a solution that is optimal under the assumption that after generationg experience and training , we switch over to the greedy policy \n\n'

### Create the game class
- This class inherits from the class **Problem** and adds simple methods to define the components of the game blackjack.
- As comments, clearly state the components of the Markov Decision Process used for the blackjack game.

In [27]:
class BlackJack(Problem): #Classe

    def __init__(self): # Methode 
        # Sum of player's cards, dealer's showing card, usable ace
        self.states = [(-1, -1, -1)]
        self.states += [(i, j, k) for (i, j, k) in product(range(12, 22), range(1, 11), [0, 1])]
        self.a = ['hit', 'stick']  # Actions 

        self.states_map = {s: i for i, s in enumerate(self.states)}

        Problem.__init__(self, len(self.states), len(self.a))

    def get_card(self):
        return min(10, np.random.randint(1, 14))

    def sample_initial_state(self):
        my_card = np.random.randint(12, 22)
        dealer_showing = self.get_card()
        usable_ace = np.random.randint(0, 2) # obtenir un as (1) ou non (0)

        return self.states_map[(my_card, dealer_showing, usable_ace)]

    def hand_value(self, sum_cards, usable_ace):
        if usable_ace and sum_cards > 21: # dans le cas ou l'on depasse 21 , on compte l'as comme egal à 1 et non 11 d'ou le fait de soustraire 10
            return sum_cards - 10
        return sum_cards # si l'on ne depasse pas 21 

    def actions(self, s):
        (my_sum, _, usable_ace) = self.states[s]

        if self.hand_value(my_sum, usable_ace) >= 21:
            return [1] # on brule
        else:
            return [0, 1] # on n'a pas brulé , reward de 1

    def is_final(self, s):
        return s == 0

    # Computes the next state and reward pair and whether the state is final
    def state_reward(self, s, a): 
        (my_sum, dealer_card, usable_ace) = self.states[s] # Differents états 
        next_s = self.states_map[(my_sum, dealer_card, usable_ace)] 

        if a == 1:  # Stick
            if self.hand_value(my_sum, usable_ace) > 21:
                return 0, -1 # total des cartes > 21 , on brule , reward negative ( et nulle pour le dealer)

            dealer_sum = dealer_card
            dealer_usable_ace = 0
            if dealer_card == 1: # le dealer tire l'as 
                dealer_sum += 10 # On le compte comme égal à 11 (on ajoute 10 pour faire comprendre a l'algorithme de compter l'as comme 11 puisque le dealer a tiré '1' ; 10+1 =11)
                dealer_usable_ace = 1 # Un as disponible dans le jeu du croupier 

            while self.hand_value(dealer_sum, dealer_usable_ace) < self.hand_value(my_sum, usable_ace): # Total du croupier inferieur à celui du joueur 
                card = self.get_card() # Tire une carte 

                dealer_sum += card # On ajoute cette carte au montant total 
                if card == 1: # si c'est un as 
                    dealer_sum += 10 # compté comme égal à 11

                if card == 1 or dealer_usable_ace: # si le dealer a pioché un as ou a deja un as dans son jeu 
                    if dealer_sum <= 21:
                        dealer_usable_ace = 1 # si la somme est < 21 , on garde notre configuaration
                    else:
                        dealer_sum -= 10 # sinon , l'as vaut 1
                        dealer_usable_ace = 0

                if self.hand_value(dealer_sum, dealer_usable_ace) == self.hand_value(my_sum, usable_ace) == 17: # Pas de reward 
                    return 0, 0

            if self.hand_value(dealer_sum, dealer_usable_ace) > 21: # le dealer brule , on gagne (reward de 1 )
                return 0, 1

            if self.hand_value(dealer_sum, dealer_usable_ace) == self.hand_value(my_sum, usable_ace): # Egalité , personne n'a de reward 
                return 0, 0

            if self.hand_value(dealer_sum, dealer_usable_ace) < self.hand_value(my_sum, usable_ace): # On gagne à la somme  (reward de 1)
                return 0, 1

            # if dealer_sum > my_sum:
            return 0, -1
        else:  # Hit
            card = self.get_card()
            my_sum += card
            if card == 1:
                my_sum += 10

            if card == 1 or usable_ace: # on un as dans notre jeu ou on vient de tirer un as 
                if my_sum <= 21:
                    usable_ace = 1
                else:
                    my_sum -= 10
                    usable_ace = 0

            if self.hand_value(my_sum, usable_ace) > 21:
                return 0, -1

            # Only nonterminal case
            next_s = self.states_map[(my_sum, dealer_card, usable_ace)]
            return next_s, 0

        raise Exception('Unexpected state/action pair') # Error created (exception)

    def print_policy(self, policy):
        print('Usable ace:')
        for i, state in enumerate(self.states):
            if state[2]:
                print('Hand value: {0}, Dealer Showing: {1}, Action: {2}'.format(
                    self.hand_value(state[0], 1), state[1], self.a[policy[i]]))

        print('No usable ace:')
        for i, state in enumerate(self.states):
            if not state[2]:
                print('Hand value: {0}, Dealer Showing: {1}, Action: {2}'.format(
                    self.hand_value(state[0], 0), state[1], self.a[policy[i]]))

    def print_values(self, values):
        for i in range(len(values)):
            print('State {0}. Value: {1}'.format(self.states[i], values[i]))

In [None]:
# Write your comments in this cell
'''
States : my_sum, dealer_card, usable_ace

Actions : Hit  , Stick 

Rewards : -1,0,1

'''

### Main function with different tests
- Modify this code to test different scenarios and hyperparameters.
- Also, compare both algorithms by evaluating the evolution of the reward in each trial/step of the process. 
- Comment about the differences.

In [None]:
# Write your comments in this cell
'''
SARSA : we keep following the same policy , which is the same as behaviour policy 

evaluation of the rewrad per trials : 
[ in the rewrad's plot : state(my sum , dealer card , usable ace) ] --> chaque paire my sum , dealer card est plotée 2 fois pour avoir le cas 1 ou 0 pour l'as 
A better/Higher evolution of rewrads in Sarsa algorithm cause Sarsa is preferable in situations where we care about the agent's performance 
during the process of learning / generating experience 

'''

In [30]:
def main():
    problem = BlackJack()
    
    print("****************************************************")
    print("SARSA algorithm")
    print("****************************************************")
    pi, v = sarsa(problem, 10000, epsilon=0.1, alpha=0.1, gamma=1.0)
    problem.print_policy(pi)
    problem.print_values(v)

    
    print("****************************************************")
    print("Q learning algorithm")
    print("****************************************************")
    pi, v = qlearning(problem, 10000, epsilon=0.1, alpha=0.1, gamma=1.0)

    problem.print_policy(pi)
    problem.print_values(v)


In [31]:
# Excuting this line should also plot the evolution of the reward 
# for each algorithm. You may modify any part of the code, just be
# careful to comment what changes are you implementing.

main()

****************************************************
SARSA algorithm
****************************************************
Usable ace:
Hand value: -1, Dealer Showing: -1, Action: hit
Hand value: 12, Dealer Showing: 1, Action: hit
Hand value: 12, Dealer Showing: 2, Action: hit
Hand value: 12, Dealer Showing: 3, Action: hit
Hand value: 12, Dealer Showing: 4, Action: hit
Hand value: 12, Dealer Showing: 5, Action: hit
Hand value: 12, Dealer Showing: 6, Action: hit
Hand value: 12, Dealer Showing: 7, Action: hit
Hand value: 12, Dealer Showing: 8, Action: hit
Hand value: 12, Dealer Showing: 9, Action: hit
Hand value: 12, Dealer Showing: 10, Action: hit
Hand value: 13, Dealer Showing: 1, Action: hit
Hand value: 13, Dealer Showing: 2, Action: hit
Hand value: 13, Dealer Showing: 3, Action: hit
Hand value: 13, Dealer Showing: 4, Action: hit
Hand value: 13, Dealer Showing: 5, Action: hit
Hand value: 13, Dealer Showing: 6, Action: hit
Hand value: 13, Dealer Showing: 7, Action: hit
Hand value: 13, De